Help analyzing connection timeout

Question

Hello,

I have been having issues with long-running ssh connections dropping. I want to blame our firewall, but I see no evidence of it timing out states (normally I'd see packets associated with a session that timed out being blocked). To help figure this out, I started an session from one host to another after starting a tcpdump trace on both hosts.

Looking at the captures side-by-side in wireshark, I'm a bit perplexed. Can anyone explain further what I'm seeing? Below is the text output of the "flow graph" from both ends. The first output is from the host that initiated the ssh session (sanitized to 10.1.1.1) the second output is from the ssh session's destination host (sanitized to 10.2.2.2). 10.1.1.1 has a firewall between it and the internet (no NAT though), 10.2.2.2 has a host-based firewall, but tcpdump sees the packets before any filtering.

I couldn't get the code formatting here to actually work, so I've put it here instead:

1 - http://pastebin.ca/2092041

2 - http://pastebin.ca/2092042

Answer 1

Looks like both sides are sending keep-alive packets after 2 hours, but they never reach each other. Most probably because the firewall in between has timed out the session. You can change the keep-alive interval on your ssh session to prevent the session from being dropped. Here is some info from the ssh_config manpage of ssh on my mac:

 ServerAliveCountMax
         Sets the number of server alive messages (see below) which may be sent without ssh(1) receiving any mes-
         sages back from the server.  If this threshold is reached while server alive messages are being sent, ssh
         will disconnect from the server, terminating the session.  It is important to note that the use of server
         alive messages is very different from TCPKeepAlive (below).  The server alive messages are sent through the
         encrypted channel and therefore will not be spoofable.  The TCP keepalive option enabled by TCPKeepAlive is
         spoofable.  The server alive mechanism is valuable when the client or server depend on knowing when a con-
         nection has become inactive.
     The default value is 3.  If, for example, ServerAliveInterval (see below) is set to 15 and
     ServerAliveCountMax is left at the default, if the server becomes unresponsive, ssh will disconnect after
     approximately 45 seconds.  This option applies to protocol version 2 only.

ServerAliveInterval
Sets a timeout interval in seconds after which if no data has been received from the server, ssh(1) will
send a message through the encrypted channel to request a response from the server.  The default is 0,
indicating that these messages will not be sent to the server.  This option applies to protocol version 2
only.
TCPKeepAlive
Specifies whether the system should send TCP keepalive messages to the other side.  If they are sent, death
of the connection or crash of one of the machines will be properly noticed.  However, this means that con-
nections will die if the route is down temporarily, and some people find it annoying.
     The default is ``yes&#39;&#39; (to send TCP keepalive messages), and the client will notice if the network goes
     down or the remote host dies.  This is important in scripts, and many users want it too.

     To disable TCP keepalive messages, the value should be set to ``no&#39;&#39;.</code></pre></div><div class="answer-controls post-controls"></div><div class="post-update-info-container"><div class="post-update-info post-update-info-user"><p>answered <strong>21 Oct '11, 00:21</strong></p><img src="https://secure.gravatar.com/avatar/7901a94d8fdd1f9f47cda9a32fcfa177?s=32&amp;d=identicon&amp;r=g" class="gravatar" width="32" height="32" alt="SYN-bit&#39;s gravatar image" /><p><span>SYN-bit ♦♦</span><br />

17.1k●9●57●245

accept rate: 20%