tshark alternative (tcpdump?) for long-time on-the-fly capturing and analysis

Question

Hi, I have developed a Java program, and one of its threads runs a tshark command and listens on its command-line output and performs analysis in real time.

The network topology is that dozens of clients request HTTP content from a server, and I'm running the program on the server:8080.

The tshark command I'm using is:

tshark -B 40 -i any -l -f tcp -t e -n -Y tcp.port==8080

The requirements that made me chose tshark are:

I need to identify packets that contain an HTTP GET request
I need to see each data packet that is sent from the server:8080 to any client (most likely with a length of 1516 bytes)

I just discovered that tshark's memory footprint keeps increasing (which has been extensively discussed on this site), which makes it unsuitable for long-term running (months). For example, if I run tshark for 10 minutes on my server, its RAM usage will reach 4GB (according to "top"). Note that I need to analyse each packet on-the-fly, so ring buffer is not an option either.

I understand that many folks on this site recommends tcpdump as an alternative to tshark. However, for performance requirement, I cannot afford to let tcpdump output every single packet's payload and then look for HTTP GET.

Ideally, I'm looking for a program that can produce output like this to command-line (I'm using tcpdump output as a template):

[timestamp] IP user1.38572 > server.8080 ...
HTTP GET url/url...
[timestamp] IP server.8080 > user1.38572 flags [.] ack xxx, length 1516
[timestamp] IP server.8080 > user1.38572 flags [.] ack xxx, length 1516
[timestamp] IP server.8080 > user1.38572 flags [.] ack xxx, length 1516

Is there a one-liner for tcpdump that can do this, or any other advice is highly appreciated.

Answer 1

1

You can use tcpdump to do what you want. The filters are pretty powerful and flexible. In your case, you probably want anything headed for a given port (say 80), and with the string 'GET ' as the first 4 packets in the payload. You can use the expression below to create such a filter:

( dst port 80 ) and ( tcp[20:4] == 0x47455420 )

the tcp[20:4] == 0x47455420 tells tcpdump to save anything where the 4 packets starting at byte #20 in the TCP packet equal the bytes 0x47 0x45 0x54 0x20 (which is just hex for 'GET '. The number of packets to match must be an even power of 2, so you need that 4th byte corresponding to the space). This assumes, of course, that you have a 20 byte tcp header with no options.

As far as performance goes (if you are running a modern linux system) tcpdump should be compiling the filter into bytecode and handing it off to the kernel to execute against packets as the come in. If you are interested, you can actually see the little program it creates using the -d option. If you are load/latency tolerant enough that you were considering running tshark on the server anyway, you will probably be OK

tcpdump -d '( dst port 80 ) and ( tcp[20:4] == 0x47455420 )'
(000) ldh      [12]
(001) jeq      #0x86dd          jt 14   jf 2
(002) jeq      #0x800           jt 3    jf 14
(003) ldb      [23]
(004) jeq      #0x84            jt 14   jf 5
(005) jeq      #0x6             jt 6    jf 14
(006) ldh      [20]
(007) jset     #0x1fff          jt 14   jf 8
(008) ldxb     4*([14]&0xf)
(009) ldh      [x + 16]
(010) jeq      #0x50            jt 11   jf 14
(011) ld       [x + 34]
(012) jeq      #0x47455420      jt 13   jf 14
(013) ret      #65535
(014) ret      #0

answered 03 Nov '16, 14:00

ryber
146●4●5●9
accept rate: 16%

1

I'm afraid the filter suggested above fails to match GET packets whose TCP header is augmented with some options. On Wireshark wiki page on capture filters, there is a capture filter which addresses this:

port 80 and tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x47455420

Even with this filter, there is still a theoretical chance to get some false positives if a GET string occurs somewhere in the headers of a http request which spans more than a single segment and falls at the beginning of a TCP segment. Also, I would add dst before port 80 as we are not interested in packets carrying http responses and their bodies - there are much more likely to contain a GET string than the headers of requests.

(03 Nov '16, 14:42) sindy

Good suggestion about handling the possibility of TCP options.

The filter as written above does already have 'dst' before 'port'.

Good point about the chance of false positives in the event of a fragmented HTTP request that starts with 'GET ' as the first four bytes. How to handle that (as well as requests fragmented across multiple TCP packets) would come down to the use case. It might be easier to handle when the data is being analyzed.

@Chang could also build a more complex filter that looks for a few more characters following GET (such as '/...') to reduce false positives.

(03 Nov '16, 14:55) ryber

The filter as written above does already have dst before port.

Yours did, but the one at the wiki page didn't, that's why I've mentioned that.

(03 Nov '16, 15:00) sindy

Thanks @ryber and @sindy, I have tried the capture filters you mentioned above, and I can see that TCP packets containing HTTP GET requests are correctly captured.

However, this still does not meet my requirements:

I need to see the URL of the GET request.
I need to see all other data packets sent from server:8080 to client printed out too (in the format of ACKs sent as segmented HTTP response).

Any advice?

(04 Nov '16, 02:29) Chang

I just tried using the filter below (changed "and" to "or):

port 8080 or tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x47455420

and used -A option of tcpdump to print out the payload content.

Now I can see the GET request content (great!), but I can also see tons of data of HTTP response printed out (yuck!). For the HTTP response packets, I just need to see one line showing the packet without the payload (like when there is no -A option).

(04 Nov '16, 02:41) Chang

To see the url of the GET request, it should be enough to use tcpdump's command line options making it print the contents of the packets, and apply a sed or awk filter on it to extract only the url.

But to see the contents of the server responses, you need to track the context (state), and tracking state is exactly what causes the continuous growth of memory consumption in tshark. So I'm afraid that your only way out is to let tcpdump or dumpcap capture into a circular buffer of files, using a only a basic capture filter like tcp port 8080, and some automated post-processing of these files.

(04 Nov '16, 02:45) sindy

1

I can also see tons of data of HTTP response printed out

In fact, the or between the port 8080 and the rest of the filter expression effectively masks out the rest of the expression for packets which have source or destination port 8080. Also, you cannot use -A for some packets and not use it for others for the same instance of tcpdump.

(04 Nov '16, 02:51) sindy

@sindy I don't need to track the states of HTTP response - all I need to see is the packets themselves that are being sent from server to client (so that I can manually add up packet-by-packet how many bytes have been sent in response to a previous GET request).

Based on your comments so far, I think sed might be the way to do this - I just need to filter & print 2 types of lines:

All tcpdump regular output (i.e., the output without -A)
The line containing GET and URL

Unfortunately I'm not very good at regex...can you please provide some initial guidelines on where I should start, thanks!

(04 Nov '16, 03:14) Chang

2

We are getting close to the edge of the site scope...

I'm afraid you neglect the possibility of packet loss and subsequent retransmissions, i.e. you assume that each GET request in client->server direction is followed by properly ordered packets of a matching response in the server->client direction. This is not always the case, that's why I've mentioned the context - HTTP uses TCP as transport, and in tshark or wireshark, the TCP dissector takes care of hiding network problems from the HTTP dissector, not giving it any piece of data until all previous ones have been completely received, and not giving it any retransmitted data once again.

Leaving that aside, you may use -x rather than -A, making tcpdump print out hex dumps of the packets rather than their text representations, as the contents of the responses may be encoded in many ways, not all of them printable. Your post-process would then evaluate the headers and handle the requests one way and the responses another way.

Or, if you software can handle two input pipes, you can feed one with the GETs as ASCII and the other one with the responses as hex dumps.

To answer your question, if the request packet looks as follows:

GET some/url/ HTTP/1.1\r\n HeaderX: ...\r\n HeaderY: ...\r\n

you may use sed 's/^GET \([^ ]*\) HTTP.*/\1/' to extract only some/url/ from the first line, but I cannot tell you how to get rid of the rest of the lines as it largely depends on your system, handling of control characters differs a lot.

(04 Nov '16, 04:07) sindy

@sindy when I was using tshark, I have already disabled all dissecting options and I have already built my software to handle retransmissions on per-segment level.

I think I will follow your suggestion on using 2 tcpdump input pipes in parallel and work something out...thanks again.

(04 Nov '16, 04:32) Chang

showing 5 of 10 show 5 more comments

Answer 2

0

Seems to me like you should be looking at a tool like Snort, Suricata or Bro instead, which are made for on-the-fly packet content matching and alerting.

answered 01 Nov '16, 07:47

Jasper ♦♦
23.8k●5●51●284
accept rate: 18%

Thanks, I've looked at Snort's "sniffer mode" - how is this different from plain tcpdump?

(01 Nov '16, 08:06) Chang

I think the sniffer mode isn't that much different. What you need is the pattern matching capabilities of Snort which can run for years without running into the same memory trouble while being able to match patterns on-the-fly. I think you'll need to do a full packet capture at the same time, and then script something to use the hits found by Snort to extract all packets belonging to that hit.

(01 Nov '16, 08:11) Jasper ♦♦

@Jasper are you suggesting that there's no way to do "sniffer" and "alert" mode at the same time? I.e., I might need to start 2 separate threads with one running on sniffer mode and the other one that alerts me when an HTTP GET is detected?

(02 Nov '16, 02:50) Chang