When it comes to network security and performance of network services, an important concept is how the UNIX kernel handles establishing TCP connections. Whilst the three-way handshake is commonly known, unless you write communications code (e.g. the listen() call and the backlog parameter), you may not have looked at what is referred to on Solaris as Qmax.
Qmax is incredible important. As TCP connections come in to a listening socket and complete the three-way handshake, they get put on a kernel TCP queue (listed as Q on Solaris). When a (userland) application is ready to receive its next connection, it calls accept() to take the next connection off that queue.
However, the queue cannot grow indefinitely. When a process opens up a listening socket using listen(), it supplies a backlog – a limit – on the number of outstanding TCP connections. As soon as this limit is reached the OS will, for example, no longer accept new connections at the kernel level on Solaris - it will behave as if it is firewalled off – the OS will ignore the initial SYN packet.
This has numerous performance implications. For example, TCP at the client end will backoff and retry. Typically such backoff starts at 1 – 3 seconds and increases for each non-response. So on a low-latency system this can be catastrophic for that business function.
From a security point-of-view, if you can establish new connections at a quicker rate than the application can accept them, then you can DoS the system. This isn't as hard as you think in many cases. Think of a Java application and savepoints (stop-the-world events).
Also from a scanning point of view. If a server has multiple IP addresses and you have one listener (on the “any” interface – i.e. accepts from all) a full-connect scan in parallel may get false negatives.
Within Linux, Solaris and other OS's there are parameters to handle limits, including for those connections that have not completed the three-way handshake.
On Linux incomplete connections are constrained by /proc/sys/net/ipv4/tcp_max_syn_backlog and the backlog (from listen()) to /proc/sys/net/core/somaxconn. Set it via sysctl or sysctl.conf.
On Solaris incomplete connections are constrained by the /dev/tcp ndd parameter tcp_conn_req_max_q0 and the backlog to tcp_conn_req_max_q. Set it via ndd or /etc/system.
So, how do you look at the queue? It depends on the OS, but lets do the easy case first.
Viewing the counts on Solaris
Here we can see that the listener on port 22/tcp has q0, q, and max (ie Qmax) as 0, 0 and 8. Q0 represents embryonic connections, Q represents connections waiting to be accepted by the application and max represents the maximum size Q can grow too.
Viewing the counts on Linux
On Linux we can use ss, for example to see listening tcp over ipv4 sockets:
Here Recv-Q is the number of outstanding TCP connections waiting for the application to accept and Send-Q is Qmax. The reason is that there is a bit of a cheat in the kernel structures, in that a listening socket reuses some redundant fields in the data structure.
We can figure out (some) of this directly, but it is less obvious. If we look in /proc/net/tcp we see a number of interesting numbers, especially for listening sockets – but what do they mean.
If we look at index 3, we see the local address with 0016. This is 22/tcp, but in hex – i.e. the listening ssh port.
How do we know it is the listening port? Well, the st column is the state. If we look in include/net/tcp_states.h in the kernel source we see various forms describing this, including an enumeration with TCP_LISTEN as item 10, or 0x0A.
What about the rest of the fields? As with the ss command, are these different? The answer lies in the kernel source file net/ipv4/tcp_ipv4.c and the function get_tcp4_sock().
Here we see that the 8th argument to the printf is rx_queue – i.e. the Recv-Q in the ss command.
Further down we see the last value may be the Qmax value:
It is only if we are a fastopen TCP connection (see e.g. sysctl option net.ipv4.tcp_fastopen). So, we cannot always directly get Qmax via this route.
I wrote a simple program that would open a listening socket on 2000/tcp with a small backlog. It would only accept a connection when I say. As usual things are subtly different between OSes.
For info, sol10-u9-t4 (10.255.0.17) is the Solaris Server, centos-7-2-t1 (10.255.0.13) is the Linux server and pandora (10.255.0.12) is the client.
Of course I also do something a bit non-sensical; I set the listen queue to one:
First let's start the listener on Solaris.
Having a look at the queue states via ndd shows us Solaris was wise to the fact that setting qmax to one is a bit stupid.
So, now lets start a session, which appears to hang, and look at the result.
Solaris has it in Q, pending the application acceptance:
So let's accept it and see the result:
We are no longer on Q as the application has accepted this connection.
If we now connect two clients without accepting them we see the following change to the queue:
We now have the situation of Q == Qmax. On Solaris 10, whilst this is the case, new connections will not be accepted. In reality embryonic connections already on Q0 still complete the handshake but are not moved to Q until there is space; we can see this via kernel tracing (or read the opensolaris source).
If we now establish a third connection whilst running a packet capture, we see this in action.
If we repeat this third connection but accept a connection within the application prior to the timeout, it will get through:
As you can see, whilst we got through, the fact that the kernel queue filled up resulted in this case with a two second delay (1.00021 + 0.91266 seconds). Depending on the time the application accepts in relation to the first syn packet, due to the exponential backoff in retries, could result in a significant delay or failure.
Yet, when we look at the queue we see that Linux accepted out qmax of one:
So, lets connect from the client, just as we did on Solaris:
As we can see we now have a connection in Q:
If we accept and complete the client connection we then see that Q is now clear:
Now lets do two without an accept but with tcpdump as well:
Looks like it ignored my Qmax of one and accepted both. Let's check on the client:
.. and the server ..
Lets try one more connection:
And on the client:
It also appears to work. But if we look at the kernel states something's interesting. The client side confirms an established connection.
But the server side list's it as SYN-RECV; even though the packet captures on both sides prove we've gone through the three-way handshake.
If we then accept the first connection on the queue, we then see this change to ESTAB.
Interestingly, if we then try the same thing but send data from the client, the connection that the client things is established, but marked as syn-recv on the server, does not ack the data on the server side, then the connection eventually is cleared on the server side causing a reset to the client:
As you can see, the way things actually behave under stress can be different between OS's, can be different to what you may think, tools can mislead, and the results can be problematic for application performance and reliability.
Careful analysis will expose anomalys, allowing you to the opportunity to diagnose and resolve the true root cause of the problem you are investigating.
In a later article I'll look at dynamic kernel tracing to track these changes; yes, fun with DTrace.