Search This Blog

Saturday, 2 July 2016

Beyond TCP Qmax



When it comes to network security and performance of network services, an important concept is how the UNIX kernel handles establishing TCP connections. Whilst the three-way handshake is commonly known, unless you write communications code (e.g. the listen() call and the backlog parameter), you may not have looked at what is referred to on Solaris as Qmax.

Qmax is incredible important. As TCP connections come in to a listening socket and complete the three-way handshake, they get put on a kernel TCP queue (listed as Q on Solaris). When a (userland) application is ready to receive its next connection, it calls accept() to take the next connection off that queue.

However, the queue cannot grow indefinitely. When a process opens up a listening socket using listen(), it supplies a backlog – a limit – on the number of outstanding TCP connections. As soon as this limit is reached the OS will, for example, no longer accept new connections at the kernel level on Solaris - it will behave as if it is firewalled off – the OS will ignore the initial SYN packet.

This has numerous  performance implications. For example, TCP at the client end will backoff and retry. Typically such backoff starts at 1 – 3 seconds and increases for each non-response. So on a low-latency system this can be catastrophic for that business function.

From a security point-of-view, if you can establish new connections at a quicker rate than the application can accept them, then you can DoS the system. This isn't as hard as you think in many cases. Think of a Java application and savepoints (stop-the-world events).

Also from a scanning point of view. If a server has multiple IP addresses and you have one listener (on the “any” interface – i.e. accepts from all) a full-connect scan in parallel may get false negatives.

Within Linux, Solaris and other OS's there are parameters to handle limits, including for those connections that have not completed the three-way handshake.

On Linux incomplete connections are constrained by /proc/sys/net/ipv4/tcp_max_syn_backlog and the backlog (from listen()) to /proc/sys/net/core/somaxconn. Set it via sysctl or sysctl.conf.

On Solaris incomplete connections are constrained by the /dev/tcp ndd parameter tcp_conn_req_max_q0 and the backlog to tcp_conn_req_max_q. Set it via ndd or /etc/system.

So, how do you look at the queue? It depends on the OS, but lets do the easy case first.

Viewing the counts on Solaris

Here we can see that the listener on port 22/tcp has q0, q, and max (ie Qmax) as 0, 0 and 8. Q0 represents embryonic connections, Q represents connections waiting to be accepted by the application and max represents the maximum size Q can grow too.

root@sol10-u9-t4# ndd /dev/tcp tcp_listen_hash | head -2
    TCP            zone IP addr         port  seqnum   backlog (q0/q/max)
022 ffffffff879e9d00 0 :: 00022 00000019 0/0/8

Viewing the counts on Linux

On Linux we can use ss, for example to see listening tcp over ipv4 sockets:

[root@centos-7-2-t1 qmax]# ss -4tnl
State      Recv-Q Send-Q Local Address:Port               Peer Address:Port             
LISTEN     0      128           *:111                       *:*                 
LISTEN     0      10            *:2000                      *:*                 
LISTEN     0      128           *:20048                     *:*                 
LISTEN     0      128           *:22                        *:*

Here Recv-Q is the number of outstanding TCP connections waiting for the application to accept and Send-Q is Qmax. The reason is that there is a bit of a cheat in the kernel structures, in that a listening socket reuses some redundant fields in the data structure.

We can figure out (some) of this directly, but it is less obvious. If we look in /proc/net/tcp we see a number of interesting numbers, especially for listening sockets – but what do they mean.

[root@centos-7-2-t1 qmax]# cat /proc/net/tcp
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode                                                    
   0: 00000000:006F 00000000:0000 0A 00000000:00000000 00:00000000 00000000     0        0 18921 1 ffff88003a520000 100 0 0 10 0                    
   1: 00000000:07D0 00000000:0000 0A 00000000:00000001 00:00000000 00000000     0        0 75449 1 ffff88003a525280 100 0 0 10 0                    
   2: 00000000:4E50 00000000:0000 0A 00000000:00000000 00:00000000 00000000     0        0 20578 1 ffff880037790000 100 0 0 10 0                    
   3: 00000000:0016 00000000:0000 0A 00000000:00000000 00:00000000 00000000     0        0 18923 1 ffff88003a520780 100 0 0 10 0

If we look at index 3, we see the local address with 0016. This is 22/tcp, but in hex – i.e. the listening ssh port.

How do we know it is the listening port? Well, the st column is the state. If we look in include/net/tcp_states.h in the kernel source we see various forms describing this, including an enumeration with TCP_LISTEN as item 10, or 0x0A.

What about the rest of the fields? As with the ss command, are these different? The answer lies in the kernel source file net/ipv4/tcp_ipv4.c and the function get_tcp4_sock().

Here we see that the 8th argument to the printf is rx_queue – i.e. the Recv-Q in the ss command.

seq_printf(f, "%4d: %08X:%04X %08X:%04X %02X %08X:%08X %02X:%08lX "
                        "%08X %5u %8d %lu %d %pK %lu %lu %u %u %d",
                i, src, srcp, dest, destp, sk->sk_state,
                tp->write_seq - tp->snd_una,
                rx_queue,

Further down we see the last value may be the Qmax value:

sk->sk_state == TCP_LISTEN ?
                    (fastopenq ? fastopenq->max_qlen : 0) :
                    (tcp_in_initial_slowstart(tp) ? -1 : tp->snd_ssthresh));

It is only if we are a fastopen TCP connection (see e.g. sysctl option net.ipv4.tcp_fastopen). So, we cannot always directly get Qmax via this route.

Demo

I wrote a simple program that would open a listening socket on 2000/tcp with a small backlog. It would only accept a connection when I say. As usual things are subtly different between OSes.

For info, sol10-u9-t4 (10.255.0.17) is the Solaris Server, centos-7-2-t1 (10.255.0.13) is the Linux server and pandora (10.255.0.12) is the client.

Of course I also do something a bit non-sensical; I set the listen queue to one:

l = listen(s, 1);

First let's start the listener on Solaris.

paul@sol10-u9-t4$ ./tcpq
listening .... (a)ccept, e(x)it >

Having a look at the queue states via ndd shows us Solaris was wise to the fact that setting qmax to one is a bit stupid.

root@sol10-u9-t4# ndd /dev/tcp tcp_listen_hash | egrep '(max|02000)'
    TCP            zone IP addr         port  seqnum   backlog (q0/q/max)
215 ffffffff897658c0 0 ::ffff:0.0.0.0 02000 00000000 0/0/2

So, now lets start a session, which appears to hang, and look at the result.

[paul@pandora ~]$ nc 10.255.0.17 2000

Solaris has it in Q, pending the application acceptance:

root@sol10-u9-t4# ndd /dev/tcp tcp_listen_hash | egrep '(max|02000)'
    TCP            zone IP addr         port  seqnum   backlog (q0/q/max)
215 ffffffff897658c0 0 ::ffff:0.0.0.0 02000 00000001 0/1/2

So let's accept it and see the result:

paul@sol10-u9-t4$ ./tcpq
listening .... (a)ccept, e(x)it >a

waiting to accept ... got FD
listening .... (a)ccept, e(x)it >

[paul@pandora ~]$ nc 10.255.0.17 2000
Hello world

root@sol10-u9-t4# ndd /dev/tcp tcp_listen_hash | egrep '(max|02000)'
    TCP            zone IP addr         port  seqnum   backlog (q0/q/max)
215 ffffffff897658c0 0 ::ffff:0.0.0.0 02000 00000001 0/0/2

We are no longer on Q as the application has accepted this connection.

If we now connect two clients without accepting them we see the following change to the queue:

root@sol10-u9-t4# ndd /dev/tcp tcp_listen_hash | egrep '(max|02000)'
    TCP            zone IP addr         port  seqnum   backlog (q0/q/max)
215 ffffffff897658c0 0 ::ffff:0.0.0.0 02000 00000003 0/2/2

We now have the situation of Q == Qmax. On Solaris 10, whilst this is the case, new connections will not be accepted. In reality embryonic connections already on Q0 still complete the handshake but are not moved to Q until there is space; we can see this via kernel tracing (or read the opensolaris source).

If we now establish a third connection whilst running a packet capture, we see this in action.

root@sol10-u9-t4# snoop -t d -d e1000g0 port 2000
Using device e1000g0 (promiscuous mode)
  0.00000  10.255.0.12 -> sol10-u9-t4  TCP D=2000 S=36419 Syn Seq=2910631562 Len=0 Win=29200 Options=<mss 1460,sackOK,tstamp 1486046 0,nop,wscale 7>
  1.00110  10.255.0.12 -> sol10-u9-t4  TCP D=2000 S=36419 Syn Seq=2910631562 Len=0 Win=29200 Options=<mss 1460,sackOK,tstamp 1487048 0,nop,wscale 7>
  2.00416  10.255.0.12 -> sol10-u9-t4  TCP D=2000 S=36419 Syn Seq=2910631562 Len=0 Win=29200 Options=<mss 1460,sackOK,tstamp 1489052 0,nop,wscale 7>
  4.00408  10.255.0.12 -> sol10-u9-t4  TCP D=2000 S=36419 Syn Seq=2910631562 Len=0 Win=29200 Options=<mss 1460,sackOK,tstamp 1493056 0,nop,wscale 7>

[paul@pandora ~]$ nc 10.255.0.17 2000
Ncat: Connection timed out.

If we repeat this third connection but accept a connection within the application prior to the timeout, it will get through:

 73.53762  10.255.0.12 -> sol10-u9-t4  TCP D=2000 S=36421 Syn Seq=4195269135 Len=0 Win=29200 Options=<mss 1460,sackOK,tstamp 1566591 0,nop,wscale 7>
  1.00021  10.255.0.12 -> sol10-u9-t4  TCP D=2000 S=36421 Syn Seq=4195269135 Len=0 Win=29200 Options=<mss 1460,sackOK,tstamp 1567592 0,nop,wscale 7>
  0.91266  sol10-u9-t4 -> 10.255.0.12  TCP D=36416 S=2000 Push Ack=3871252411 Seq=1827671955 Len=12 Win=49232 Options=<nop,nop,tstamp 7781354 1279657>

As you can see, whilst we got through, the fact that the kernel queue filled up resulted in this case with a two second delay (1.00021 + 0.91266 seconds). Depending on the time the application accepts in relation to the first syn packet, due to the exponential backoff in retries, could result in a significant delay or failure.

Now Linux.

[paul@centos-7-2-t1 qmax]$ ./tcpq
listening .... (a)ccept, e(x)it >

Yet, when we look at the queue we see that Linux accepted out qmax of one:

[root@centos-7-2-t1 qmax]# ss -4tln 'sport = 2000'
Netid State      Recv-Q Send-Q Local Address:Port               Peer Address:Port             
tcp   LISTEN     0      1           *:2000                    *:*

So, lets connect from the client, just as we did on Solaris:

[paul@pandora ~]$ nc 10.255.0.13 2000

As we can see we now have a connection in Q:

[root@centos-7-2-t1 qmax]# ss -4tln 'sport = 2000'
Netid State      Recv-Q Send-Q Local Address:Port               Peer Address:Port             
tcp   LISTEN     1      1           *:2000                    *:*                 
tcp   ESTAB      0      0      10.255.0.13:2000               10.255.0.12:51496

If we accept and complete the client connection we then see that Q is now clear:

[root@centos-7-2-t1 qmax]# ss -4tln 'sport = 2000'
Netid State      Recv-Q Send-Q Local Address:Port               Peer Address:Port             
tcp   LISTEN     0      1           *:2000                    *:*                 
tcp   FIN-WAIT-2 0      0      10.255.0.13:2000               10.255.0.12:51496

Now lets do two without an accept but with tcpdump as well:

[root@centos-7-2-t1 qmax]# tcpdump -i eno16777728 port 2000
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eno16777728, link-type EN10MB (Ethernet), capture size 65535 bytes
11:45:37.769911 IP 10.255.0.12.51506 > centos-7-2-t1.m0noc.net.sieve-filter: Flags [S], seq 4210720999, win 29200, options [mss 1460,sackOK,TS val 2909683 ecr 0,nop,wscale 7], length 0
11:45:37.769988 IP centos-7-2-t1.m0noc.net.sieve-filter > 10.255.0.12.51506: Flags [S.], seq 2491402835, ack 4210721000, win 28960, options [mss 1460,sackOK,TS val 55031849 ecr 2909683,nop,wscale 7], length 0
11:45:37.770028 IP 10.255.0.12.51506 > centos-7-2-t1.m0noc.net.sieve-filter: Flags [.], ack 1, win 229, options [nop,nop,TS val 2909683 ecr 55031849], length 0
11:45:39.384387 IP 10.255.0.12.51507 > centos-7-2-t1.m0noc.net.sieve-filter: Flags [S], seq 4241347616, win 29200, options [mss 1460,sackOK,TS val 2911297 ecr 0,nop,wscale 7], length 0
11:45:39.384440 IP centos-7-2-t1.m0noc.net.sieve-filter > 10.255.0.12.51507: Flags [S.], seq 2296846025, ack 4241347617, win 28960, options [mss 1460,sackOK,TS val 55033463 ecr 2911297,nop,wscale 7], length 0
11:45:39.384524 IP 10.255.0.12.51507 > centos-7-2-t1.m0noc.net.sieve-filter: Flags [.], ack 1, win 229, options [nop,nop,TS val 2911297 ecr 55033463], length 0

Looks like it ignored my Qmax of one and accepted both. Let's check on the client:

[root@pandora ~]# ss -a4tn 'dport = 2000'
Netid  State      Recv-Q Send-Q Local Address:Port               Peer Address:Port             
tcp    ESTAB      0      0      10.255.0.12:51506              10.255.0.13:2000               
tcp    ESTAB      0      0      10.255.0.12:51507              10.255.0.13:2000

.. and the server ..

[root@centos-7-2-t1 ~]# ss -a4tn 'sport = 2000'
Netid  State      Recv-Q Send-Q Local Address:Port               Peer Address:Port             
tcp    LISTEN     2      1           *:2000                    *:*                 
tcp    ESTAB      0      0      10.255.0.13:2000               10.255.0.12:51507             
tcp    ESTAB      0      0      10.255.0.13:2000               10.255.0.12:51506

Lets try one more connection:

[root@centos-7-2-t1 qmax]# tcpdump -i eno16777728 port 2000
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eno16777728, link-type EN10MB (Ethernet), capture size 65535 bytes
11:49:20.326595 IP 10.255.0.12.51510 > centos-7-2-t1.m0noc.net.sieve-filter: Flags [S], seq 3247629228, win 29200, options [mss 1460,sackOK,TS val 3132239 ecr 0,nop,wscale 7], length 0
11:49:20.326672 IP centos-7-2-t1.m0noc.net.sieve-filter > 10.255.0.12.51510: Flags [S.], seq 2163865433, ack 3247629229, win 28960, options [mss 1460,sackOK,TS val 55254406 ecr 3132239,nop,wscale 7], length 0

And on the client:

[root@pandora ~]# tcpdump -i em1 port 2000
11:49:19.677814 IP pandora.51510 > 10.255.0.13.sieve-filter: Flags [S], seq 3247629228, win 29200, options [mss 1460,sackOK,TS val 3132239 ecr 0,nop,wscale 7], length 0
11:49:19.678209 IP 10.255.0.13.sieve-filter > pandora.51510: Flags [S.], seq 2163865433, ack 3247629229, win 28960, options [mss 1460,sackOK,TS val 55254406 ecr 3132239,nop,wscale 7], length 0

It also appears to work. But if we look at the kernel states something's interesting. The client side confirms an established connection.

[root@pandora ~]# ss -a4tn 'dport = 2000'
Netid  State      Recv-Q Send-Q Local Address:Port               Peer Address:Port             
tcp    ESTAB      0      0      10.255.0.12:51510              10.255.0.13:2000              
tcp    ESTAB      0      0      10.255.0.12:51506              10.255.0.13:2000               
tcp    ESTAB      0      0      10.255.0.12:51507              10.255.0.13:2000

But the server side list's it as SYN-RECV; even though the packet captures on both sides prove we've gone through the three-way handshake.

[root@centos-7-2-t1 ~]# ss -a4tn 'sport = 2000'
Netid  State      Recv-Q Send-Q Local Address:Port               Peer Address:Port             
tcp    LISTEN     2      1           *:2000                    *:*                 
tcp    SYN-RECV   0      0      10.255.0.13:2000               10.255.0.12:51510             
tcp    ESTAB      0      0      10.255.0.13:2000               10.255.0.12:51507             
tcp    ESTAB      0      0      10.255.0.13:2000               10.255.0.12:51506

If we then accept the first connection on the queue, we then see this change to ESTAB.

Interestingly, if we then try the same thing but send data from the client, the connection that the client things is established, but marked as syn-recv on the server, does not ack the data on the server side, then the connection eventually is cleared on the server side causing a reset to the client:

12:07:05.090381 IP 10.255.0.13.sieve-filter > pandora.51515: Flags [S.], seq 950779250, ack 3352895372, win 28960, options [mss 1460,sackOK,TS val 56319818 ecr 4187968,nop,wscale 7], length 0
12:07:05.090471 IP pandora.51515 > 10.255.0.13.sieve-filter: Flags [.], ack 1, win 229, options [nop,nop,TS val 4197652 ecr 56288216], length 0
12:07:08.238307 IP pandora.51515 > 10.255.0.13.sieve-filter: Flags [P.], seq 1:3, ack 1, win 229, options [nop,nop,TS val 4200800 ecr 56288216], length 2
12:07:33.870349 IP pandora.51515 > 10.255.0.13.sieve-filter: Flags [P.], seq 1:3, ack 1, win 229, options [nop,nop,TS val 4226432 ecr 56288216], length 2

12:08:25.198303 IP pandora.51515 > 10.255.0.13.sieve-filter: Flags [P.], seq 1:3, ack 1, win 229, options [nop,nop,TS val 4277760 ecr 56288216], length 2
12:08:25.198518 IP 10.255.0.13.sieve-filter > pandora.51515: Flags [R], seq 950779251, win 0, length 0

As you can see, the way things actually behave under stress can be different between OS's, can be different to what you may think, tools can mislead, and the results can be problematic for application performance and reliability.

Careful analysis will expose anomalys, allowing you to the opportunity to diagnose and resolve the true root cause of the problem you are investigating.

In a later article I'll look at dynamic kernel tracing to track these changes; yes, fun with DTrace.