A week ago or so I started the below thread. I have since tweaked our code to assure random sequence numbers every time. The problem is if the sequence number used is greater than 2147483647 it does fail. As long as it is less than that it works repeatedly using the same port. The people who wrote the third party TCP-IP stack for our embedded device are requesting an RFC number to confirm that is part of the TCP specifications. I can't find one, and the best I could find is this link: http://www-01.ibm.com/support/docview.wss?uid=isg1IZ05180 Is it part of the standards and if so, which RFC is it? Thanks in advance! http://ask.wireshark.org/questions/32932/can-identical-initial-sequence-numbers-cause-issues asked 02 Jun '14, 08:31 wes000000 |
3 Answers:
WHAT? Seriously? How were they able to write a TCP stack, without knowing the RFCs???? Well, here we go: RFC 793 defines TCP and section 3.3 defines the range of sequence numbers. Sequence numbers are in the range of 0 - 2^32 - 1, which is roughly 4 billion (unsigned 32 bit integer). So, I don't see any good reason why the sequence number should be only 2^16 (~ 2 billion), unless the embedded device does not offer 'native' 32 bit integers and those developers did not care to implement the RFC in the right way, so they limited the SEQ number to what the hardware is offering, in your case apparently 2^16 - 1 (~ 2 billion). Regards answered 02 Jun '14, 08:41 Kurt Knochner ♦ edited 02 Jun '14, 08:43 showing 5 of 12 show 7 more comments |
First of all 2 bilion is 2^31 not 2^16 (which is 65ki) I can confirm this bug in 32bit Debian kernel 2.6.X and it probably affects many other linux distros. answered 04 Feb '15, 06:48 siemin81 |
I guess what you need is RFC 6528, concerning random initial sequence number generation. As far as I know the ISN can range all over the 32 bit range, which goes up to 4,294,967,295. answered 02 Jun '14, 08:42 Jasper ♦♦ |
The device does support native 32 bit numbers and it does generate sequence numbers using the full available range.
The issue we were seeing was that if the initial sequence numbers were greater then the last successful connection initial sequence number by more than ~2 billion then the connection would be ignored by the SMTP server we want to connect to.
That is why I had posted previously and was curious when a user posted about 2 billion being the greatest difference. And then i found the IBM site which seems to confirm that... Not sure to be honest.
And I think they know RFC (but I don't know as they are third party), they were just trying to find one to confirm the 2 billion rule, but maybe they can't and wont be able to cause there is no rule. Again I'm not an expert at networking stuff.
O.K. maybe I misunderstood your question. Now it sounds more like a problem with the SMTP server.
So, what is
Is there any security and/or software device between the client and the server, like a firewall, IDS, etc. Your capture files in the other questions shows all systems in the same subnet, but the capture could have been anonymized.
You mean the same source port, right?
So, here is what I know and what I don't know.
I don't know the 2 billion delta "effect", hence my question about the OS and OS version.
What I know is this:
http://www.rfc-base.org/txt/rfc-1122.txt
Cite:
So, yes, the new ISN must be larger than any previous SEQ (obvious), but there is nothing about a magic 2 billion delta border you’ll have to cross.
I guess that’s an implementation detail of the TCP stack of the SMTP servers OS (call it a bug or freedom of choice or whatever).
Maybe there are other RFCs, that change this defined behavior. If so, I was unable to find them ;-)
Regards
Kurt
For all the logs shown I had our embedded device plugged directly into my laptop which was running hMailServer as a test environment. Both laptop and device had static IP and same subnet so it was a direct connnection. And I did have firewall and antivirus off during testing. My laptop is running Windows 7 Premium 64-bit and our device is ARM based with the third party stack on it.
Prior to simplifying setup, we were originally trying to connect to ‘mail.lookoutportablesecurity.com’ SMTP server which is hosted by HostMonster. I doubt two servers both have bugs though.
When I said ‘same port’ I meant same source and destination port. It was sending from port 1024 to destination port 25 everytime. We actually sent two emails per device cycle and do it would send port 1024 then 1025, but after reset it would go back to 1024 and thats when the problems started.
Maybe we are having an issue with it thinking it’s an old duplicate… I will try and find some RFC describing what and old duplicate is. Then I can send that off to the tcp-ip stack people and hopefully they will know what to do with it!
well… unless proven otherwise I tend to not believe anything ;-))
O.K. Let’s go back to your other question and to the root of the problem. But you are right. It’s a first indicator that there could be something wrong ;-)
I think I don’t fully understand the capture files you posted. You posted a file called wirshark_before_reset.pcap and a file called wireshark_after_reset.pcap, however the absolute time stamp in the ‘before’ file (17:04:3x) is after the ‘after’ file (16:12:35) !?! So, what exactly are those files showing?
Would it be possible to show the whole effect in a single capture file with a brief explanation and a reference to frame numbers, where you identified the problem, like: working, not working, working again?
Otherwise it would take too much time to get into the details. You know it all, because you spent hours and days with the problem. For us this is totally new and it would be easier to discuss the problem if there was a common base that everybody understands ;-)
BTW: While you are testing/recording, could you please check if the connection on the SMTP server is still visible during the different phases of the test.
It would be interesting to see how long it is in ESTABLISHED and TIME_WAIT state.
I am working on putting together a new capture all in a single file right now. I will run netstat several times to get an idea of when and how long it stays in TIME_WAIT etc.
The reason the first two files had different times was because I actually ran the ‘before reset’ file after I captured the ‘after reset’ file. It is kind of confusing with the names.
Basically what is happening is I am physically removing and restoring power to embedded processor (resetting it) and restarting our code after every reset. Our code is written to use this library to send two emails. So per power on it attempts to send 2 emails. The trick is if you wait several minutes and reset device both emails send fine. If reset the device immediately after those two emails send, and keep resetting it after the 2 attempts complete each time only certain emails succeed. For a long time I had no pattern. Then someone posted about the 2 billion thing and I put together a spreadsheet and entered ISN numbers and the pattern matched the issues I was seeing.
https://drive.google.com/file/d/0B-pYPAmyNVqbMUFNOVRXTk1vc3c/edit?usp=sharing
Here is the unified capture file. I actually did three resets, so 6 email attempts. The first two went fine but after that is where things get weird. In certain cases the server just ignores the device and pretends it’s not there. I ran netstat… and at first it said LISTENING, then it said TIME_WAIT and it stayed that way from after the first email until about 3 minutes after the last email.
If you apply the following filter, you’ll see, that the client clearly violates RFC 1122, hence the server is ‘not allowed’ to answer.
There are two source ports, 1024 and 1025 and you’ll have to look at them separately.
++ Source port: 1025 ++
frame #42: SEQ = 3114737405. Connection works
frame #88: SEQ = 2363438278. Server does not answer, as the SEQ is lower than the last SEQ of the previous connection. This is correct according to RFC1122. The client violates the RFC!.
frame #89 - #93: The client retries until it gives up and sends a RESET. The RESET must be ignored by the server to stay in TIME_WAIT!
frame #134: SEQ = 1182260416. Server does not answer, as the SEQ is still lower than the last SEQ of the previous connection. This is correct according to RFC1122. The client violates the RFC!.
frame #135 - #139: The client retries until it gives up and sends a RESET. The RESET must be ignored by the server to stay in TIME_WAIT!
RESULT: Everything absolutely normal for port 1025. The client violates RFC 1122 and the server does not answer the new connection requests, which is correct.
++ Source port: 1024 ++
basically the same as for port 1025, however, with an important and strange difference at the end.
frame #3: SEQ = 2929729191. Connection works
frame #82: SEQ = 2771582515. Server does not answer, as the SEQ is lower than the last SEQ of the previous connection. This is correct according to RFC1122. The client violates the RFC!
frame #83 - #87: The client retries until it gives up and sends a RESET. The RESET must be ignored by the server to stay in TIME_WAIT!
NOW for the strange thing.
frame #96: SEQ = 742881266. Connection works! This should not have worked, as the SEQ is much lower than the last one for the working connection (frame #3 - 2929729191).
So the interesting question is: Have you seen the connection for port 1024 in TIME_WAIT while the last connection was established? If so, this is a clear violation of RFC 1122. If the connection was no longer in TIME_WAIT it would have been O.K. to answer the SYN.
All the rest is totally normal and adheres to RFC 1122.
Now my question: What is it that your are identifying as a problem? Only the last connection for port 1024 or the whole behavior of the server?
TOTAL RESULT: In your test case, the client clearly violates RFC 1122. The easiest way to fix the problem would be to not reuse the source ports while the device is running, until there is a ‘natural’ wrap around at 65k. If the device is not doing that, the TCP/IP stack needs to be fixed.
As the developers asked for links, send them the link to RFC 1122 and to this discussion.
BTW: Did you reboot the device or reset the TCP/IP stack during your tests?
If so, then the port reuse is ‘normal’ and you are facing a problem that cannot be fixed easily. Reason: if you reboot the device (or reset the TCP/IP stack) the first TCP connection after the reboot will again start at source port 1024 and the OS will know nothing about the last SEQ for the same source port. So, all the OS can do is to choose a random ISN, which might be lower than the last SEQ, which is a violation of RFC 1122.
So, if you rebooted or reset the device during the test, stop doing that, as there is no way to get your test going. It could cause the same problem on any OS, although it would be much harder to actually trigger the problem, as (for example) a windows/linux system might open some TCP connections right after it has booted, so the chances to use the same source port for a mail to a certain server are rather small. In contrary your embedded device might not open any TCP connection, until you send the test mail, so the chances to hit the same source port are rather high.
O.K. I have not seen your previous comment, but that perfectly explains what is happening, see my last comment. I suspected that you were booting/resetting the device.
Yes, because the connection on the server will be removed from TIME_WAIT after several minutes.
Unfortunately there is not much you can do yourself to fix this problem. There is no magical 2 billion delta/limit, it’s just sh.. happening combined with bad luck ;-)) The client violated RFC 1122 and the server does not answer, which is correct. The client violates RFC 1122, due to the following circumstances:
What could be done is this:
Your comments have been extremely helpful, thank you very much!
you’re welcome!