This is a static archive of our old Q&A Site. Please post any new questions and answers at ask.wireshark.org.

Editcap [and tshark] performance

0

I have a 9.59GB pcap that I am running editcap -D from 1 to 1000000 exponentially. There are 13751611 packets in the file. I have a dedicated Win 7 Ent. VM on an ESXi server with 26GB RAM with 2 dual-core 3GHz CPUs. It took 234991 seconds to process the command with a window of 1000000. The command took only 1.5 GB of RAM and the CPUs didn't seem very taxed. Is there a way to get this to finish faster?

I realize that BBP is to cut the files into smaller chunks when analyzing them for better speeds, but if I do it breaks the coherence of the -D window.

asked 08 Oct '13, 14:50

karl's gravatar image

karl
16225
accept rate: 0%

edited 08 Oct '13, 14:51


2 Answers:

1

Is there a reason why you use a window of 1 million frames? If you're deduplicating frames that are a result of multi SPAN sources you should be getting good to perfect results with windows a fraction that size - I'm using 100 myself quite often, and it almost never fails. Duplicates that editcap removes are appearing within a couple of milliseconds at most, often single digit microseconds - it makes no sense to waste performance on a huge window of 1 million frames.

I don't think RAM size and CPU are the problem, it's probably more the disk I/O and the searching in the huge list of MD5 hashes that takes the longest time. If I were you I'd do a test run with a window of 100 frames to see how fast it performs, maybe on a smaller trace at first.

Update: wait, what do you mean by "running exponentially"? Are you saying that you start with -D 1, then -D 2, -D 3, until -D 1000000? If so: seriously? why would you do that?

answered 08 Oct '13, 15:28

Jasper's gravatar image

Jasper ♦♦
23.8k551284
accept rate: 18%

edited 08 Oct '13, 15:30

I am comparing the removal behavior of editcap -D.

By "running exponentially" I mean -D 1, -D 2, -D 3, -D 5, -D 6, -D 10, -D 17, -D 31, -D 56, -D 100, -D 177, -D 316, -D 562, -D 1000, ..., -D 562341, -D 1000000.

With this when I chart the plot on a log scale I have discretionary points early in the beginning, and ramp up exponentially to have a mid point and two points near it (one to the left and one to the right) in addition to the major magnitudes.

I have run the test on 3 other files removed pkts are shown. 93578 pkts, 55MB = default 153 (0.12%), max 3412 (3.65%) 578197 pkts, 340MB = default 2736 (0.47%), max 21036 (3.64%) 1393760 pkts, 948MB = default 6126 (0.44%), max 14329 (1.03%) This file: 13751611 pkts, 9.59GB = default 61701 (0.45%), max 129128 (0.94%)

The resultant curves are not purely linear, but have steps/ramps.

Why would there be any differences between two magnitudinally equal gigantic window sizes?

(08 Oct '13, 16:17) karl

Why would there be any differences between two magnitudinally equal gigantic window sizes?

Different frame sizes will take different times for the MD5 hash calculation.

(08 Oct '13, 16:44) Kurt Knochner ♦

Sorry, I mean differences in packets removed.

(08 Oct '13, 16:47) karl

What do you mean? I assume the capture files are different !?

(08 Oct '13, 17:13) Kurt Knochner ♦

I have ran -D 100000 and had 128,887 pkts removed and then -D 177827 and had 128,987 pkts removed. By extending the window size by 77,827 the 100 pkts were removed. I wouldn't suspect any packets would be removed at this point, but they are. Is there a way to get a feel for how long 177827 packets are in seconds in relation to my capture?

(08 Oct '13, 17:41) karl

Is there a way to get a feel for how long 177827 packets are in seconds in relation to my capture?

Just look at the time stamps. Select one frame, set a time reference (CTRL-T), then scroll forward 177827 frames and check the delta in the time column.

(08 Oct '13, 17:52) Kurt Knochner ♦
1

editcap -D may remove additional frames that are very distant to each other but not duplicates (false positives). A common example are BPDU frames - if the Spanning Tree is stable (and it should be) ALL BPDU frames are bitwise identical, but 3 seconds apart. It is quite possible that with huge ranges like yours you'll detect them as duplicates at some point (3 seconds being an eternity in networks, but with your range you'll eventually find things like that).

(08 Oct '13, 23:24) Jasper ♦♦

ALL BPDU frames are bitwise identical, but 3 seconds apart. It is quite possible that with huge ranges like yours you'll detect them as duplicates at some point

good one! +1

(09 Oct '13, 01:46) Kurt Knochner ♦

Is there a way to put the duplicate packets in their own file to inspect them?

(09 Oct '13, 10:23) karl
showing 5 of 9 show 4 more comments

0

From the editcap man page

NOTE: Specifying large <dup time="" window=""> values with large tracefiles can result in very long processing times for editcap.

I guess you hit that constraint ;-)

I am running editcap -D from 1 to 1000000 exponentially. There are 13751611 packets in the file

This will end up in 13751611 MD5 calculations and roughly 13751611 * 1000000 MD5 comparisons (minus a few at the beginning because there are less MD5 sums than the window size). The later (MD5 sum comparisons) will kill your execution time.

Is there a way to get this to finish faster?

Try to narrow down the window and then run editcap with the 'optimized' window size.

Here is how I would do it.

  • Read the pcap file with a script (like Perl Net::Pcap or Python Scapy)
  • Create a MD5 sum of each packet (full bytes!)
  • Print: 'MD5 hash'; 'Frame number'
  • At the end of the pcap file: Sort the output (MD5 hashes)
  • Read the sorted output and try to find two consecutive identical (duplicate) MD5 hashes. If you find one: Subtract the frame numbers and store that as max window. Repeat this step until the end. If you find a larger max window, replace the old value with the new value.
  • At the end: Print the max window value

Take the max window value of your script, add 15% (just to be safe ;-)) and use that value as an input for editcap -D. BTW: If your max window is 0 (you did not find any duplicate MD5 hashes), you can skip editcap, as there are no duplicates.

I guess that this will be faster, because there are a ton less comparisons to make, however I have no prove (yet) :-) However: Maybe the sort operation will be heavy as well (or heavy as hell). I'm running some test right now ;-)

++ UPDATE ++

O.K. you can ignore the sort operation. I created 100.000 fake frames of 1500 bytes length, calculated the MD5 sum and then sorted the sums.

Result: 100.000 Frames

Create (/dev/urandom) and calculate MD5 sum: 6m37.485s
Sort the MD5 sums: real 0m0.186s

So, sorting is ways faster than creating the MD5 sums. Who had thought of that? ;-)

My test was done in a VM on a latop and it took ~400 seconds for 100.000 frames. So it will take ~137 times longer for your capture file, which is ~55.000 seconds. Although this is much faster than your time, it is still 15 hours (mostly MD5 calculation)!! However your server is probably faster than my laptop.

Pros and Cons:

Pro: If you're lucky, there are no duplicate MD5 sums, so you're done after the first step.
Con: If things went wrong, you'll find the max window to be 1000000 (max of editcap), and then you'll have to run editcap on top, which means you just lost the 15 hours needed for pre-processing. However, how likely is it to have duplicates within a range of 1000000 packets ?!?

I guess that in real world scenarios you will find the max windows between 100 and 1000 as @Jasper also mentioned. So, if you want the 'exact' result, you can use my pre-processing method to speed up things. Otherwise, just use a window of 1000+ and rely on the 'rule of thumb' ;-)

Regards
Kurt

answered 08 Oct '13, 16:29

Kurt%20Knochner's gravatar image

Kurt Knochner ♦
24.8k1039237
accept rate: 15%

If all duplicate packet numbers can be deduced this way, then can these numbers be passed to editcap as parameters and the resultant files could be created quicker?

(08 Oct '13, 17:33) karl

well ... no, because you obviously have way to many packets to remove (you mention 128987 in one comment). You simply cannot give that many options (frame numbers) to editcap.

Anyway, currently I don't get what you are trying to do.

  • Why do you see that many duplicate frames with such a large gap between them (177827 frames)?

  • What kind of network are we talking about (1G/s, 10G/s, 40G/s)?

  • How do you capture?

  • Why do you need to eliminate duplicate frames?

  • Why did you run editcap "exponentially"? Is this just for fun, or a real world problem?

(08 Oct '13, 17:56) Kurt Knochner ♦

I don't know why there are still duplicates with window size of 177827.

The network is a lot slower. See capinfos from the original file: File type: Wireshark - pcapng File encapsulation: Ethernet File size: 10305318996 bytes Data size: 9848966269 bytes Capture duration: Data byte rate: 30071.70 bytes/sec Data bit rate: 240573.61 bits/sec Average packet size: 716.20 bytes Average packet rate: 41.99 packets/sec

This implies the window (177827) is on average 1 hour 10 minutes and 35 seconds wide.

The capture was on a mirror port of a switch. A router took all users and then sent it over our [slow] WAN link.

I am eliminated duplicates because a "pro" said to.

I am running exponentially to prove @Jasper, your, mine, and the man pages expressed feeling of the rules of thumb. It's real world only for the sake of proving the capabilities of editcap -D. It's fun because it's a simple test... just super long.

My problem is that editcap doesn't seem to allocate memory correctly when memory is available.

(09 Oct '13, 10:20) karl

I don't know why there are still duplicates with window size of 177827.

because there are a lot of possible 'candidates' if you look back that far.

  • ARP requests for the same address are identical
  • Cisco CDP frames are identical
  • Spanning Tree BPDUs are identical
  • many other broadcast frames are identical

So, if you just go back far enough, you will always find some duplicates! But they are duplicate by nature and not due to an error on the network.

This implies the window (177827) is on average 1 hour 10 minutes and 35 seconds wide.

It really, really does not make sense to search for duplicates in that time window. Where should the real duplicate frames come from? Did circulate in a dark area of the network, just to pop out an hour later? Nah...

I am eliminated duplicates because a "pro" said to.

Greetings to your "pro".

If I was a "pro", I would eliminate duplicate frames only if I had a problem during analysis with them and not as a precaution. It costs time, it might cause confusion (sounds familiar ;-)), etc.

My problem is that editcap doesn't seem to allocate memory correctly when memory is available.

How do you get to this conclusion? editcap 'allocates' the memory for the max amount of MD5 hashes as a static data structure and then only a few more things dynamically. Why does it not allocate more memory? It does not need it. See the code.

My overall recommendation: Simply stop doing, what you are doing, as it does not give you any real benefit, unless you have a real problem during the analysis phase with real duplicate frames. In contrary, it leads to massive confusion as we have seen in this discussion about phantom duplicate frames ;-)) Nevertheless, I thank you for asking this question, as I had a reason to check the editcap code and now I understand how that stuff works ;-))

(09 Oct '13, 13:26) Kurt Knochner ♦