Processing a huge .pcap file with tshark

Question

I am trying to process a huge .pcap file. I'm talking about a more than 50GB file. I also have a separated file with the following flow's parameter for each flow: the start time (epoch time), end time (epoch time), and socket information (IP source, IP destination, port source, port destination and transport protocol. I want to extract each packet of each flow and write a .csv file different for each flow with the information of every packet of that flow. The flows in the file are either TCP or UDP.

I've written a bash script that starts by splitting the file in a 100MB files in order to be able to process it. Then I read the first packet epoch time and the last packet epoch time of each 100Mb file and I write them down to a file. After that, I read start reading each line of the separated file and compare the start times and end times of each flow with the first packet time and last packet time of each 100Mb file. Doing so I know which files I have to process to gather all the packets of a flow. After that I apply a filter with tshark and save the parameters I need on a separated file. As I said, I create a different file for each flow

This process take a lot of time. How can I speed up this process? I've been thinking about the following possibilities:

1) Use 10Mb files instead of 100Mb files. Could that improve the speed? Is there and known size that yields good performance when processing a huge file?

2) Once I've read with tshark a flow is it possible to delete that flow from the pcap file? Is it possible to generate a separated .pcap file with only that flow and delete it from the original file? Doing that I will only read a flow once. Looks like that tshark filters the packets comparing the desired information with all the packets in the .pcap file and that slows down the performance a lot

Thanks

Edit:

I read the start time (epoch), end time (epoch) IP source, IP destination Port source Port destination and Transport protocol form my separate file. The using the time information I check which .pcap 100Mb I should use for this flow. (If you split a large file some flows may start in one file and end in another one). Then I run the following lines with tshark.

For TCP

tshark -r pkt.pcap$i -n -Y "frame.time_epoch >= $start_time && frame.time_epoch <= $end_time && ip.src==$local_ip && ip.dst==$remote_ip && tcp.srcport == $local_port && tcp.dstport ==$remote_port" -T fields -E separator=, -e frame.time_epoch -e frame.cap_len -e ip.src -e ip.dst -e tcp.srcport -e tcp.dstport >> temp1.csv
tshark -r pkt.pcap$i -n -Y "frame.time_epoch >= $start_time && frame.time_epoch <= $end_time && ip.dst==$local_ip && ip.src==$remote_ip && tcp.dstport == $local_port && tcp.srcport ==$remote_port" -T fields -E separator=, -e frame.time_epoch -e frame.cap_len -e ip.src -e ip.dst -e tcp.srcport -e tcp.dstport >> temp1.csv

For udp is the same changing tcp.srcport and so on for udp.src and so on

Then I add the transport protocol to the csv file

add="$transport_protocol"
awk -v d="$add" -F"," 'BEGIN { OFS = "," } {$7=d; print}' temp1.csv > flow$flow.csv

Thanks

Answer 1

Whats included in "the information of every packet of that flow" ?

Does it involve and dissection of the TCP/UDP payloads ? Is this a one-time project or something which is to be used repeatedly ?

In any case, a couple of thoughts:

Disable all protocols except ethernet/ip/tcp/udp/etc. That might speed up tshark processing significantly. There's a tremendous amount of work being done by tshark to dissect all the layers of each frame.
Or: maybe write what I expect might not be too large a program to read the pcap file directly and do minimal dissection to get the info you need. It's been quite a while since I've done this sort of thing, but I expect there are libraries to read pcap files & etc.

Answer 2

1) the file size will not improve speeds as it is your processing that is the problem, not the file size

2) you can't remove flows from pcap files, except if you parse it again and leave out all packets matching a certain filter - but that's going to make things slower, not faster.

By the way, what exactly are you trying to get as a result? CSV, yes, but how many files and what should they contain? Maybe there is an easier way to achieve what you need, e.g. by using tools like tcpsplit.

Answer 3

tshark -r pkt.pcap$i -n -Y "frame.time_epoch >= $start_time && frame.time_epoch <= $end_time && ip.src==$local_ip && ip.dst==$remote_ip && tcp.srcport == $local_port && tcp.dstport == $remote_port" -T fields -E separator=, -e frame.time_epoch -e frame.cap_len -e ip.src -e ip.dst -e tcp.srcport -e tcp.dstport >> temp1.csv

Hm... why are you using tshark at all? All you do is to write the IP addresses, the ports and the time stamps into a capture file, but you already know the IP addresses and the ports as you have them in variables ($local_port, $remote_port, etc.). All you actually extract from the capture file are the time stamps and the frame len.

Maybe I'm missing something, but your whole process to extract data seems to have another "loop" somewhere to get the IP addresses and ports and maybe that additional step creates the extra processing time. Can you please describe in more details what you are doing right now and what you want to achieve?

++ UPDATE ++

O.K. in your case you won't need the whole dissection capabilities of wireshark/tshark. A simple perl script, that reads the pcap file and extracts the same information would be way faster.

Sample script:

use strict;
use warnings;
use Net::Pcap;
use NetPacket::Ethernet qw(:types);
use NetPacket::IP qw(:protos);
use NetPacket::TCP;
use NetPacket::UDP;
my $pcap_file = $ARGV[0];
if (not $pcap_file) {
die("ERROR: please give pcap file name on the cli\n")
};
my $err = undef;
read data from pcap file.
my $pcap = pcap_open_offline($pcap_file, $err) or die "Can't read $pcap_file : $err\n";
pcap_loop($pcap, -1, &amp;process_packet, "just for the demo");
close the device
pcap_close($pcap);
my $ethernet;
my $ip;
my $payload;
sub process_packet {
my ($user_data, $header, $packet) = @_;
my $cap_len = $header-&gt;{caplen};
my $frame_len = $header-&gt;{len};
my $time_epoch = $header-&gt;{tv_sec} . &quot;.&quot; . $header-&gt;{tv_usec};

$ethernet = NetPacket::Ethernet-&gt;decode($packet);

if ($ethernet-&gt;{type} != ETH_TYPE_IP) {
    return;
}

$ip = NetPacket::IP -&gt; decode($ethernet-&gt;{data});

my $src_ip = $ip-&gt;{src_ip};
my $dst_ip = $ip-&gt;{dest_ip};

if ($ip-&gt;{proto} == IP_PROTO_TCP) {
    $payload = NetPacket::TCP-&gt;decode($ip-&gt;{data});
} elsif ($ip-&gt;{proto} == IP_PROTO_UDP) {
    $payload = NetPacket::UDP-&gt;decode($ip-&gt;{data});
} else {
    return;
}

my $src_port = $payload-&gt;{src_port};
my $dst_port = $payload-&gt;{dest_port};

#print &quot;$time_epoch;$cap_len;$frame_len;$src_ip;$dst_ip;$src_port;$dst_port\n&quot;;#
print &quot;$time_epoch,$cap_len,$src_ip,$dst_ip,$src_port,$dst_port\n&quot;;

}

Test

I tested the script and tshark against the same file (90 MByte, several http/https downloads):

tshark: ~20 seconds
script: ~4 seconds

BTW: If you need the output only for one direction (as implied by your tshark filter), you can either post process the output file or change the perl script to use filters.

http://search.cpan.org/dist/Net-Pcap/Pcap.pm

BTW: Net::Pcap only supports libpcap files. So, if your capture file is pcap-ng, you’ll have to convert if first with editcap, which takes only a few seconds on a fast system.

editcap -F pcap input.pcapng output.pcap

Regards
Kurt