This is a static archive of our old Q&A Site. Please post any new questions and answers at ask.wireshark.org.

Capturing url from tcp packets

0	Hi All I have gone through the packet-tcp.c but I am not sure which section deals with extracting the url if it exists in the tcp packets. I have my whole payload packet but I need this specific function. url dissector tcp asked 12 Aug '13, 06:09 newbie14 26●3●3●8 accept rate: 0%

3 Answers:

All I have gone through the packet-tcp.c but I am not sure which section deals with extracting the url

there are no URLs in TCP itself. URLs are a concept of HTTP. You need to look in packet-http.c

UPDATE

I am capturing the packets via pf_ring. So in my case I am interested for fly-by analysis and later store into database.

O.K. in that case, Wireshark is kind of 'overkill' for you. Please have a look at the following tools (and their code): ngrep, xplico, tcpextract, etc.

http://ngrep.sourceforge.net/
http://www.xplico.org/
https://pypi.python.org/pypi/tcpextract

Regards
Kurt

answered 12 Aug '13, 06:10

Kurt Knochner ♦
24.8k●10●39●237
accept rate: 15%

edited 13 Aug '13, 04:38

I am lost here. So say I got a tcp packet how to decide if it will have url or not ? I will go through packet-http.c where to start ? Where does each of this packet-*** starts from ? Sorry I am very new.

(12 Aug '13, 06:13) newbie14

Take a look at the function basic_request_dissector() and what is called therein.

BTW: Why do you need 'that specific function'? Maybe there is a better solution to your problem.

So say I got a tcp packet how to decide if it will have url or not

within your own Wireshark dissector or in general?

(12 Aug '13, 06:17) Kurt Knochner ♦

I have capture packets using another tool. So I want to read the url in those packets where I have the whole payload. So what is your best suggestion? I know there available solution so no point to re invent the wheel. I have seen the basic_request_dissector() but I am not too sure with the parameters passed into it.

(12 Aug '13, 08:22) newbie14

I have capture packets using another tool.

Do you have those packets in a pcap file or are you interested in a fly-by analysis?

If it is a pcap file, you can still run tshark and print the payload of the packets, then use some perl/python scripts to search for URLs in the output with regular expressions.

(12 Aug '13, 14:21) Kurt Knochner ♦

@I am capturing the packets via pf_ring. So its all in hex format. So in my case I am interested for fly-by analysis and later store into database. The issue I can get all the ip layer details but the tool does not do further then. So I need to dissect further layers by myself based on the layer types. Any idea? Everything I prefer to be in C as the capture engine is all in C too.

(12 Aug '13, 19:34) newbie14

So say I got a tcp packet how to decide if it will have url or not ?

You first decide whether it's traffic for a protocol such as HTTP that has URLs; if not, it doesn't have a URL. Wireshark decides whether traffic is HTTP based on the TCP port it's going to or from; ports such as 80 and 8080 are assumed to be HTTP.

Then you have to parse the HTTP data to see whether it contains an HTTP request or response and, if it does, extract the request URL from requests and other URLs from responses (e.g., a 301 Moved Permanently response has the URL to which the item has moved).

(12 Aug '13, 20:08) Guy Harris ♦♦

@GuyHarris ok I have capture the source and destination port. So say now I have either way is 80 then I move to next level. With this port I know it will be http traffic right. So now I need help is on parsing the payloaad to capture the url. Thank you for your insights.

(12 Aug '13, 20:56) newbie14

see the UPDATE in my answer

(13 Aug '13, 04:26) Kurt Knochner ♦

@Kurt thank you for the link let me go through them. I think the second links looks good.

(13 Aug '13, 10:58) newbie14

showing 5 of 9 show 4 more comments

hi Newbie,

Open the wireshark app on your laptop, make sure you have your laptop/pc connected to internet. Then from Wireshark turn on packet capture on the interface card. Open browser and type a url and browse. Stop the packet capture. Open the pcap file and in the search filter type "http", you should be able to see packets on HTTP protocol.

answered 12 Aug '13, 21:08

pundalik
11●1
accept rate: 0%

@the problem I am not going to use wireshark for the capture I am using another tool called pf_ring. So I have capture most data except from the payload I need the url.

(12 Aug '13, 21:10) newbie14

So you have your own application running on top of a modified libpcap using pfring? If so you will have to reinvent part of wireshark/tcpdump to parse what you need of all the protocol layers upto and including http, not realy a wireshark question.

(12 Aug '13, 21:49) Anders ♦

@yes I am using pfring to capture the packets. I can determine the ports no issue with that just that I need the url parser now which I think is available in wireshark and no point me reinventing the wheel?

(12 Aug '13, 22:57) newbie14

I need the url parser now which I think is available in wireshark

There is no "URL parser" in Wireshark. There is an HTTP parser in Wireshark, which is in epan/dissectors/packet-http.c, and it (and the routines it calls, such as the ones in epan/req_resp_hdrs.c) parses HTTP in its entirety. It uses a lot of other parts of Wireshark, including the "tvbuff" code in epan/tvbuff.c for handing buffers with packet data and the "protocol tree" code in epan/proto.c for constructing a tree of protocol fields. It is not code that's going to be easy to pull out of the rest of Wireshark and use by itself; given that it's part of Wireshark, and written to be part of Wireshark, it was not written to be easy to pull out of the rest of Wireshark and use by itself. I.e., it's like a wheel that depends heavily on the rest of the car, so if you don't want to reinvent the wheel, you'll have to use the transmission and axles and engine and... of the car, in addition to the wheel.

(12 Aug '13, 23:47) Guy Harris ♦♦

@Guy I appreciate your explanation so what should I do is reinvent the wheel cause I need to store data into db which is not feasible via wireshark right. Yes I have seen on the tvbuff which I still dont understand as I am new to it.

(13 Aug '13, 01:20) newbie14

As @Guy has said, there is a lot of work being done by different parts of Wireshark before the URL is extracted. If you're just interested in the URL's and you assume that each HTTP request is generating a new TCP packet (which usually is true, but the nature of TCP does not make this a necessity) and you assume that the requested URL will fit in one TCP segment (which is not true for networks with small MTU's and large request URL's), then you can skip all reassembly and just parse each TCP packet on it's own.

When parsing the payload, look for a pattern like "<method> <url> HTTP/<version>" at the start of each TCP payload. Where <method> can be "GET", "POST", "HEAD", etc. Look for the methods in which you're interested. The <url> should always start with "/" and will not contain spaces. Finally, <version> will be "1.0" or "1.1" currently. In short, anything between "GET " and " HTTP/1.", "POST " and " HTTP/1." or "HEAD " and " HTTP/1." (watch the spaces) will be your URL and should be quite easy to extract.

Downsides of this method:

URL's in requests that do not start at a packet boundery will not be extracted
URL's that are too large to fit one TCP segment will not be extracted
Any packet that have the same pattern at the start of the packet will be seen as URL (think of a webpage containing an example of a http request)

So depending on how fool-proof your tool must be, this could be a simple solution to your problem... If you need 100% exact results, there is nothing you could do but follow all the steps that wireshark is taking (TCP reassembly and fully parsing the reassembled TCP stream to exactly determine where a new request starts).

answered 13 Aug '13, 04:08

SYN-bit ♦♦
17.1k●9●57●245
accept rate: 20%

@SYN-bit so if I understand your answer carefully please correct me,what I need to do now if the its a tcp packet transform all the hex payload value into human readable. Next look for anything between "GET " and " HTTP/1.", "POST " and " HTTP/1." or "HEAD " and " HTTP/1." Is that correct ?

(13 Aug '13, 10:54) newbie14