I am trying to process a huge .pcap file. I'm talking about a more than 50GB file. I also have a separated file with the following flow's parameter for each flow: the start time (epoch time), end time (epoch time), and socket information (IP source, IP destination, port source, port destination and transport protocol. I want to extract each packet of each flow and write a .csv file different for each flow with the information of every packet of that flow. The flows in the file are either TCP or UDP. I've written a bash script that starts by splitting the file in a 100MB files in order to be able to process it. Then I read the first packet epoch time and the last packet epoch time of each 100Mb file and I write them down to a file. After that, I read start reading each line of the separated file and compare the start times and end times of each flow with the first packet time and last packet time of each 100Mb file. Doing so I know which files I have to process to gather all the packets of a flow. After that I apply a filter with tshark and save the parameters I need on a separated file. As I said, I create a different file for each flow This process take a lot of time. How can I speed up this process? I've been thinking about the following possibilities: 1) Use 10Mb files instead of 100Mb files. Could that improve the speed? Is there and known size that yields good performance when processing a huge file? 2) Once I've read with tshark a flow is it possible to delete that flow from the pcap file? Is it possible to generate a separated .pcap file with only that flow and delete it from the original file? Doing that I will only read a flow once. Looks like that tshark filters the packets comparing the desired information with all the packets in the .pcap file and that slows down the performance a lot Thanks Edit: I read the start time (epoch), end time (epoch) IP source, IP destination Port source Port destination and Transport protocol form my separate file. The using the time information I check which .pcap 100Mb I should use for this flow. (If you split a large file some flows may start in one file and end in another one). Then I run the following lines with tshark. For TCP
For udp is the same changing tcp.srcport and so on for udp.src and so on Then I add the transport protocol to the csv file
Thanks asked 30 Oct ‘14, 14:32 Xavi1618 edited 30 Oct ‘14, 15:01 |
3 Answers:
Whats included in "the information of every packet of that flow" ? Does it involve and dissection of the TCP/UDP payloads ? Is this a one-time project or something which is to be used repeatedly ? In any case, a couple of thoughts:
answered 30 Oct '14, 14:53 Bill Meier ♦♦ |
1) the file size will not improve speeds as it is your processing that is the problem, not the file size 2) you can't remove flows from pcap files, except if you parse it again and leave out all packets matching a certain filter - but that's going to make things slower, not faster. By the way, what exactly are you trying to get as a result? CSV, yes, but how many files and what should they contain? Maybe there is an easier way to achieve what you need, e.g. by using tools like tcpsplit. answered 30 Oct '14, 14:49 Jasper ♦♦ edited, helps? (30 Oct '14, 15:00) Xavi1618 Helps a lot processing smaller files. With 1Mb-5Mb the improvement is considerably. Just tried out. However the time it takes tcpdump command to split the archive is considerable (31 Oct '14, 14:04) Xavi1618 |
Hm... why are you using tshark at all? All you do is to write the IP addresses, the ports and the time stamps into a capture file, but you already know the IP addresses and the ports as you have them in variables ($local_port, $remote_port, etc.). All you actually extract from the capture file are the time stamps and the frame len. Maybe I'm missing something, but your whole process to extract data seems to have another "loop" somewhere to get the IP addresses and ports and maybe that additional step creates the extra processing time. Can you please describe in more details what you are doing right now and what you want to achieve? ++ UPDATE ++ O.K. in your case you won't need the whole dissection capabilities of wireshark/tshark. A simple perl script, that reads the pcap file and extracts the same information would be way faster. Sample script:
Test I tested the script and tshark against the same file (90 MByte, several http/https downloads): tshark: ~20 seconds BTW: If you need the output only for one direction (as implied by your tshark filter), you can either post process the output file or change the perl script to use filters. BTW: Net::Pcap only supports libpcap files. So, if your capture file is pcap-ng, you’ll have to convert if first with editcap, which takes only a few seconds on a fast system.
Regards answered 31 Oct ‘14, 01:56 Kurt Knochner ♦ edited 02 Nov ‘14, 10:14 I have the information of the flow in a separated file. I want to know what the parameters about every packet of a flow as its direction, bytes and frame time (31 Oct ‘14, 14:05) Xavi1618 see the UPDATE in my answer. (02 Nov ‘14, 08:57) Kurt Knochner ♦ |
How can I disable the other protocols?
See:
https://ask.wireshark.org/questions/9544/how-to-disable-dissectors-in-tshark
Essentially:
Using wireshark:
Then: specify profile created above to tshark using -C option.
Thanks, It helps but not that much. It improves the processing time, but no significantly. 5-10% improvement