I have a 9.59GB pcap that I am running editcap -D from 1 to 1000000 exponentially. There are 13751611 packets in the file. I have a dedicated Win 7 Ent. VM on an ESXi server with 26GB RAM with 2 dual-core 3GHz CPUs. It took 234991 seconds to process the command with a window of 1000000. The command took only 1.5 GB of RAM and the CPUs didn't seem very taxed. Is there a way to get this to finish faster? I realize that BBP is to cut the files into smaller chunks when analyzing them for better speeds, but if I do it breaks the coherence of the -D window. asked 08 Oct '13, 14:50 karl edited 08 Oct '13, 14:51 |
2 Answers:
Is there a reason why you use a window of 1 million frames? If you're deduplicating frames that are a result of multi SPAN sources you should be getting good to perfect results with windows a fraction that size - I'm using 100 myself quite often, and it almost never fails. Duplicates that editcap removes are appearing within a couple of milliseconds at most, often single digit microseconds - it makes no sense to waste performance on a huge window of 1 million frames. I don't think RAM size and CPU are the problem, it's probably more the disk I/O and the searching in the huge list of MD5 hashes that takes the longest time. If I were you I'd do a test run with a window of 100 frames to see how fast it performs, maybe on a smaller trace at first. Update: wait, what do you mean by "running exponentially"? Are you saying that you start with -D 1, then -D 2, -D 3, until -D 1000000? If so: seriously? why would you do that? answered 08 Oct '13, 15:28 Jasper ♦♦ edited 08 Oct '13, 15:30 showing 5 of 9 show 4 more comments |
From the editcap man page
I guess you hit that constraint ;-)
This will end up in 13751611 MD5 calculations and roughly 13751611 * 1000000 MD5 comparisons (minus a few at the beginning because there are less MD5 sums than the window size). The later (MD5 sum comparisons) will kill your execution time.
Try to narrow down the window and then run editcap with the 'optimized' window size. Here is how I would do it.
Take the max window value of your script, add 15% (just to be safe ;-)) and use that value as an input for editcap -D. BTW: If your max window is 0 (you did not find any duplicate MD5 hashes), you can skip editcap, as there are no duplicates. I guess that this will be faster, because there are a ton less comparisons to make, however I have no prove (yet) :-) However: Maybe the sort operation will be heavy as well (or heavy as hell). I'm running some test right now ;-) ++ UPDATE ++ O.K. you can ignore the sort operation. I created 100.000 fake frames of 1500 bytes length, calculated the MD5 sum and then sorted the sums. Result: 100.000 Frames Create (/dev/urandom) and calculate MD5 sum: 6m37.485s So, sorting is ways faster than creating the MD5 sums. Who had thought of that? ;-) My test was done in a VM on a latop and it took ~400 seconds for 100.000 frames. So it will take ~137 times longer for your capture file, which is ~55.000 seconds. Although this is much faster than your time, it is still 15 hours (mostly MD5 calculation)!! However your server is probably faster than my laptop. Pros and Cons: Pro: If you're lucky, there are no duplicate MD5 sums, so you're done after the first step. I guess that in real world scenarios you will find the max windows between 100 and 1000 as @Jasper also mentioned. So, if you want the 'exact' result, you can use my pre-processing method to speed up things. Otherwise, just use a window of 1000+ and rely on the 'rule of thumb' ;-) Regards answered 08 Oct '13, 16:29 Kurt Knochner ♦ If all duplicate packet numbers can be deduced this way, then can these numbers be passed to editcap as parameters and the resultant files could be created quicker? (08 Oct '13, 17:33) karl well ... no, because you obviously have way to many packets to remove (you mention 128987 in one comment). You simply cannot give that many options (frame numbers) to editcap. Anyway, currently I don't get what you are trying to do.
(08 Oct '13, 17:56) Kurt Knochner ♦ I don't know why there are still duplicates with window size of 177827. The network is a lot slower. See capinfos from the original file: File type: Wireshark - pcapng File encapsulation: Ethernet File size: 10305318996 bytes Data size: 9848966269 bytes Capture duration: Data byte rate: 30071.70 bytes/sec Data bit rate: 240573.61 bits/sec Average packet size: 716.20 bytes Average packet rate: 41.99 packets/sec This implies the window (177827) is on average 1 hour 10 minutes and 35 seconds wide. The capture was on a mirror port of a switch. A router took all users and then sent it over our [slow] WAN link. I am eliminated duplicates because a "pro" said to. I am running exponentially to prove @Jasper, your, mine, and the man pages expressed feeling of the rules of thumb. It's real world only for the sake of proving the capabilities of editcap -D. It's fun because it's a simple test... just super long. My problem is that editcap doesn't seem to allocate memory correctly when memory is available. (09 Oct '13, 10:20) karl
because there are a lot of possible 'candidates' if you look back that far.
So, if you just go back far enough, you will always find some duplicates! But they are duplicate by nature and not due to an error on the network.
It really, really does not make sense to search for duplicates in that time window. Where should the real duplicate frames come from? Did circulate in a dark area of the network, just to pop out an hour later? Nah...
Greetings to your "pro". If I was a "pro", I would eliminate duplicate frames only if I had a problem during analysis with them and not as a precaution. It costs time, it might cause confusion (sounds familiar ;-)), etc.
How do you get to this conclusion? editcap 'allocates' the memory for the max amount of MD5 hashes as a static data structure and then only a few more things dynamically. Why does it not allocate more memory? It does not need it. See the code. My overall recommendation: Simply stop doing, what you are doing, as it does not give you any real benefit, unless you have a real problem during the analysis phase with real duplicate frames. In contrary, it leads to massive confusion as we have seen in this discussion about phantom duplicate frames ;-)) Nevertheless, I thank you for asking this question, as I had a reason to check the editcap code and now I understand how that stuff works ;-)) (09 Oct '13, 13:26) Kurt Knochner ♦ |
I am comparing the removal behavior of editcap -D.
By "running exponentially" I mean -D 1, -D 2, -D 3, -D 5, -D 6, -D 10, -D 17, -D 31, -D 56, -D 100, -D 177, -D 316, -D 562, -D 1000, ..., -D 562341, -D 1000000.
With this when I chart the plot on a log scale I have discretionary points early in the beginning, and ramp up exponentially to have a mid point and two points near it (one to the left and one to the right) in addition to the major magnitudes.
I have run the test on 3 other files removed pkts are shown. 93578 pkts, 55MB = default 153 (0.12%), max 3412 (3.65%) 578197 pkts, 340MB = default 2736 (0.47%), max 21036 (3.64%) 1393760 pkts, 948MB = default 6126 (0.44%), max 14329 (1.03%) This file: 13751611 pkts, 9.59GB = default 61701 (0.45%), max 129128 (0.94%)
The resultant curves are not purely linear, but have steps/ramps.
Why would there be any differences between two magnitudinally equal gigantic window sizes?
Different frame sizes will take different times for the MD5 hash calculation.
Sorry, I mean differences in packets removed.
What do you mean? I assume the capture files are different !?
I have ran -D 100000 and had 128,887 pkts removed and then -D 177827 and had 128,987 pkts removed. By extending the window size by 77,827 the 100 pkts were removed. I wouldn't suspect any packets would be removed at this point, but they are. Is there a way to get a feel for how long 177827 packets are in seconds in relation to my capture?
Just look at the time stamps. Select one frame, set a time reference (CTRL-T), then scroll forward 177827 frames and check the delta in the time column.
editcap -D may remove additional frames that are very distant to each other but not duplicates (false positives). A common example are BPDU frames - if the Spanning Tree is stable (and it should be) ALL BPDU frames are bitwise identical, but 3 seconds apart. It is quite possible that with huge ranges like yours you'll detect them as duplicates at some point (3 seconds being an eternity in networks, but with your range you'll eventually find things like that).
good one! +1
Is there a way to put the duplicate packets in their own file to inspect them?