When it comes to network security and
performance of network services, an important concept is how the UNIX kernel
handles establishing TCP connections. Whilst the three-way handshake is
commonly known, unless you write communications code (e.g. the listen() call
and the backlog parameter), you may not have looked at what is referred to on
Solaris as Qmax.
Qmax is incredible important. As TCP
connections come in to a listening socket and complete the three-way handshake,
they get put on a kernel TCP queue (listed as Q on Solaris). When a (userland)
application is ready to receive its next connection, it calls accept() to take
the next connection off that queue.
However, the queue cannot grow indefinitely.
When a process opens up a listening socket using listen(), it supplies a
backlog – a limit – on the number of outstanding TCP connections. As soon as
this limit is reached the OS will, for example, no longer accept new
connections at the kernel level on Solaris - it will behave as if it is
firewalled off – the OS will ignore the initial SYN packet.
This has numerous performance implications. For example, TCP at
the client end will backoff and retry. Typically such backoff starts at 1 – 3
seconds and increases for each non-response. So on a low-latency system this
can be catastrophic for that business function.
From a security point-of-view, if you can
establish new connections at a quicker rate than the application can accept
them, then you can DoS the system. This isn't as hard as you think in many
cases. Think of a Java application and savepoints (stop-the-world events).
Also from a scanning point of view. If a server
has multiple IP addresses and you have one listener (on the “any” interface –
i.e. accepts from all) a full-connect scan in parallel may get false negatives.
Within Linux, Solaris and other OS's there are
parameters to handle limits, including for those connections that have not
completed the three-way handshake.
On Linux incomplete connections are constrained
by /proc/sys/net/ipv4/tcp_max_syn_backlog
and the backlog (from listen())
to /proc/sys/net/core/somaxconn.
Set it via sysctl
or sysctl.conf.
On Solaris incomplete connections are
constrained by the /dev/tcp
ndd parameter tcp_conn_req_max_q0
and the backlog to tcp_conn_req_max_q.
Set it via ndd
or /etc/system.
So, how do you look at the queue? It depends on
the OS, but lets do the easy case first.
Viewing
the counts on Solaris
Here we can see that the listener on port
22/tcp has q0, q, and max (ie Qmax) as 0, 0 and 8. Q0 represents embryonic
connections, Q represents connections waiting to be accepted by the application
and max represents the maximum size Q can grow too.
root@sol10-u9-t4# ndd /dev/tcp tcp_listen_hash | head -2
TCP zone IP addr port
seqnum backlog
(q0/q/max)
022 ffffffff879e9d00 0 :: 00022 00000019 0/0/8
Viewing
the counts on Linux
On Linux we can use ss, for example to see
listening tcp over ipv4 sockets:
[root@centos-7-2-t1 qmax]# ss -4tnl
State Recv-Q Send-Q
Local Address:Port Peer Address:Port
LISTEN 0 128 *:111 *:*
LISTEN 0 10 *:2000 *:*
LISTEN 0 128 *:20048 *:*
LISTEN 0 128 *:22 *:*
Here Recv-Q
is the number of outstanding TCP connections waiting for the application to
accept and Send-Q
is Qmax. The reason is that there is a bit of a cheat in the kernel structures,
in that a listening socket reuses some redundant fields in the data structure.
We can figure out (some) of this directly, but
it is less obvious. If we look in /proc/net/tcp
we see a number of interesting numbers, especially for listening sockets – but
what do they mean.
[root@centos-7-2-t1 qmax]# cat /proc/net/tcp
sl local_address rem_address st tx_queue rx_queue tr tm->when
retrnsmt uid timeout inode
0: 00000000:006F
00000000:0000 0A 00000000:00000000 00:00000000 00000000 0
0 18921 1 ffff88003a520000 100 0 0 10 0
1: 00000000:07D0
00000000:0000 0A 00000000:00000001 00:00000000 00000000 0
0 75449 1 ffff88003a525280 100 0 0 10 0
2: 00000000:4E50
00000000:0000 0A 00000000:00000000 00:00000000 00000000 0
0 20578 1 ffff880037790000 100 0 0 10 0
3: 00000000:0016 00000000:0000 0A 00000000:00000000
00:00000000 00000000 0 0 18923 1 ffff88003a520780 100 0 0 10 0
If we look at index 3, we see the local address
with 0016. This is 22/tcp, but in hex – i.e. the listening ssh port.
How do we know it is the listening port? Well,
the st
column is the state. If we look in include/net/tcp_states.h
in the kernel source we see various forms describing this, including an
enumeration with TCP_LISTEN as item 10, or 0x0A.
What about the rest of the fields? As with the ss command, are these
different? The answer lies in the kernel source file net/ipv4/tcp_ipv4.c
and the function get_tcp4_sock().
Here we see that the 8th argument to
the printf is rx_queue –
i.e. the Recv-Q
in the ss
command.
seq_printf(f, "%4d: %08X:%04X %08X:%04X %02X %08X:%08X
%02X:%08lX "
"%08X %5u %8d %lu %d %pK %lu %lu %u %u %d",
i, src, srcp,
dest, destp, sk->sk_state,
tp->write_seq - tp->snd_una,
rx_queue,
Further down we see the last value may be the
Qmax value:
sk->sk_state == TCP_LISTEN ?
(fastopenq ? fastopenq->max_qlen : 0) :
(tcp_in_initial_slowstart(tp) ? -1 : tp->snd_ssthresh));
It is only if we are a fastopen TCP connection
(see e.g. sysctl
option net.ipv4.tcp_fastopen).
So, we cannot always directly get Qmax via this route.
Demo
I wrote a simple program that would open a
listening socket on 2000/tcp with a small backlog. It would only accept a
connection when I say. As usual things are subtly different between OSes.
For info, sol10-u9-t4
(10.255.0.17) is the Solaris Server, centos-7-2-t1
(10.255.0.13) is the Linux server and pandora
(10.255.0.12) is the client.
Of course I also do something a bit
non-sensical; I set the listen queue to one:
l = listen(s, 1);
First let's start the listener on Solaris.
paul@sol10-u9-t4$ ./tcpq
listening .... (a)ccept, e(x)it >
Having a look at the queue states via ndd shows us Solaris was wise to
the fact that setting qmax to one is a bit stupid.
root@sol10-u9-t4# ndd /dev/tcp tcp_listen_hash | egrep
'(max|02000)'
TCP zone IP addr port
seqnum backlog (q0/q/max)
215 ffffffff897658c0 0 ::ffff:0.0.0.0 02000 00000000 0/0/2
So, now lets start a session, which appears to
hang, and look at the result.
[paul@pandora ~]$ nc 10.255.0.17 2000
Solaris has it in Q, pending the application
acceptance:
root@sol10-u9-t4# ndd /dev/tcp tcp_listen_hash | egrep
'(max|02000)'
TCP zone IP addr port
seqnum backlog (q0/q/max)
215 ffffffff897658c0 0 ::ffff:0.0.0.0 02000 00000001 0/1/2
So let's accept it and see the result:
paul@sol10-u9-t4$ ./tcpq
listening .... (a)ccept, e(x)it >a
waiting to accept ... got FD
listening .... (a)ccept, e(x)it >
[paul@pandora ~]$ nc 10.255.0.17 2000
Hello world
root@sol10-u9-t4# ndd /dev/tcp tcp_listen_hash | egrep
'(max|02000)'
TCP zone IP addr port
seqnum backlog (q0/q/max)
215 ffffffff897658c0 0 ::ffff:0.0.0.0 02000 00000001 0/0/2
We are no longer on Q as the application has
accepted this connection.
If we now connect two clients without accepting
them we see the following change to the queue:
root@sol10-u9-t4# ndd /dev/tcp tcp_listen_hash | egrep
'(max|02000)'
TCP zone IP addr port
seqnum backlog (q0/q/max)
215 ffffffff897658c0 0 ::ffff:0.0.0.0 02000 00000003 0/2/2
We now have the situation of Q == Qmax. On
Solaris 10, whilst this is the case, new connections will not be accepted. In
reality embryonic connections already on Q0 still complete the handshake but
are not moved to Q until there is space; we can see this via kernel tracing (or
read the opensolaris source).
If we now establish a third connection whilst
running a packet capture, we see this in action.
root@sol10-u9-t4# snoop -t d -d e1000g0 port 2000
Using device e1000g0 (promiscuous mode)
0.00000 10.255.0.12 -> sol10-u9-t4 TCP D=2000 S=36419 Syn
Seq=2910631562 Len=0 Win=29200 Options=<mss 1460,sackOK,tstamp 1486046
0,nop,wscale 7>
1.00110 10.255.0.12 -> sol10-u9-t4 TCP D=2000 S=36419 Syn
Seq=2910631562 Len=0 Win=29200 Options=<mss 1460,sackOK,tstamp 1487048
0,nop,wscale 7>
2.00416 10.255.0.12 -> sol10-u9-t4 TCP D=2000 S=36419 Syn
Seq=2910631562 Len=0 Win=29200 Options=<mss 1460,sackOK,tstamp 1489052
0,nop,wscale 7>
4.00408 10.255.0.12 -> sol10-u9-t4 TCP D=2000 S=36419 Syn
Seq=2910631562 Len=0 Win=29200 Options=<mss 1460,sackOK,tstamp 1493056
0,nop,wscale 7>
[paul@pandora ~]$ nc 10.255.0.17 2000
Ncat: Connection timed out.
If we repeat this third connection but accept a
connection within the application prior to the timeout, it will get through:
73.53762 10.255.0.12 -> sol10-u9-t4 TCP D=2000 S=36421 Syn
Seq=4195269135 Len=0 Win=29200 Options=<mss 1460,sackOK,tstamp 1566591
0,nop,wscale 7>
1.00021 10.255.0.12 -> sol10-u9-t4 TCP D=2000 S=36421 Syn
Seq=4195269135 Len=0 Win=29200 Options=<mss 1460,sackOK,tstamp 1567592
0,nop,wscale 7>
0.91266 sol10-u9-t4 -> 10.255.0.12 TCP D=36416 S=2000 Push Ack=3871252411 Seq=1827671955 Len=12 Win=49232
Options=<nop,nop,tstamp 7781354 1279657>
As you can see, whilst we got through, the fact
that the kernel queue filled up resulted in this case with a two second delay
(1.00021 + 0.91266 seconds). Depending on the time the application accepts in
relation to the first syn packet, due to the exponential backoff in retries,
could result in a significant delay or failure.
Now Linux.
[paul@centos-7-2-t1 qmax]$ ./tcpq
listening .... (a)ccept, e(x)it >
Yet, when we look at the queue we see that
Linux accepted out qmax of one:
[root@centos-7-2-t1 qmax]# ss -4tln 'sport = 2000'
Netid State Recv-Q
Send-Q Local Address:Port
Peer Address:Port
tcp LISTEN 0
1 *:2000 *:*
So, lets connect from the client, just as we
did on Solaris:
[paul@pandora ~]$ nc 10.255.0.13 2000
As we can see we now have a connection in Q:
[root@centos-7-2-t1 qmax]# ss -4tln 'sport = 2000'
Netid State Recv-Q
Send-Q Local Address:Port
Peer Address:Port
tcp LISTEN 1 1 *:2000 *:*
tcp ESTAB 0
0 10.255.0.13:2000 10.255.0.12:51496
If we accept and complete the client connection
we then see that Q is now clear:
[root@centos-7-2-t1 qmax]# ss -4tln 'sport = 2000'
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port
tcp LISTEN 0 1 *:2000 *:*
tcp FIN-WAIT-2 0 0
10.255.0.13:2000
10.255.0.12:51496
Now lets do two without an accept but with
tcpdump as well:
[root@centos-7-2-t1 qmax]# tcpdump -i eno16777728 port 2000
tcpdump: verbose output suppressed, use -v or -vv for full protocol
decode
listening on eno16777728, link-type EN10MB (Ethernet), capture size
65535 bytes
11:45:37.769911 IP 10.255.0.12.51506 >
centos-7-2-t1.m0noc.net.sieve-filter: Flags [S], seq 4210720999, win 29200,
options [mss 1460,sackOK,TS val 2909683 ecr 0,nop,wscale 7], length 0
11:45:37.769988 IP centos-7-2-t1.m0noc.net.sieve-filter >
10.255.0.12.51506: Flags [S.], seq 2491402835, ack 4210721000, win 28960,
options [mss 1460,sackOK,TS val 55031849 ecr 2909683,nop,wscale 7], length 0
11:45:37.770028 IP 10.255.0.12.51506 >
centos-7-2-t1.m0noc.net.sieve-filter: Flags [.], ack 1, win 229, options
[nop,nop,TS val 2909683 ecr 55031849], length 0
11:45:39.384387 IP 10.255.0.12.51507 >
centos-7-2-t1.m0noc.net.sieve-filter: Flags [S], seq 4241347616, win 29200,
options [mss 1460,sackOK,TS val 2911297 ecr 0,nop,wscale 7], length 0
11:45:39.384440 IP centos-7-2-t1.m0noc.net.sieve-filter >
10.255.0.12.51507: Flags [S.], seq 2296846025, ack 4241347617, win 28960,
options [mss 1460,sackOK,TS val 55033463 ecr 2911297,nop,wscale 7], length 0
11:45:39.384524 IP 10.255.0.12.51507 >
centos-7-2-t1.m0noc.net.sieve-filter: Flags [.], ack 1, win 229, options
[nop,nop,TS val 2911297 ecr 55033463], length 0
Looks like it ignored my Qmax of one and
accepted both. Let's check on the client:
[root@pandora ~]# ss -a4tn 'dport = 2000'
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port
tcp ESTAB 0
0 10.255.0.12:51506 10.255.0.13:2000
tcp ESTAB 0
0 10.255.0.12:51507 10.255.0.13:2000
.. and the server ..
[root@centos-7-2-t1 ~]# ss -a4tn 'sport = 2000'
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port
tcp LISTEN 2
1 *:2000 *:*
tcp ESTAB 0
0 10.255.0.13:2000 10.255.0.12:51507
tcp ESTAB 0
0 10.255.0.13:2000 10.255.0.12:51506
Lets try one more connection:
[root@centos-7-2-t1 qmax]# tcpdump -i eno16777728 port 2000
tcpdump: verbose output suppressed, use -v or -vv for full protocol
decode
listening on eno16777728, link-type EN10MB (Ethernet), capture size
65535 bytes
11:49:20.326595 IP 10.255.0.12.51510 >
centos-7-2-t1.m0noc.net.sieve-filter: Flags [S], seq 3247629228, win 29200,
options [mss 1460,sackOK,TS val 3132239 ecr 0,nop,wscale 7], length 0
11:49:20.326672 IP centos-7-2-t1.m0noc.net.sieve-filter >
10.255.0.12.51510: Flags [S.], seq 2163865433, ack 3247629229, win 28960,
options [mss 1460,sackOK,TS val 55254406 ecr 3132239,nop,wscale 7], length 0
And on the client:
[root@pandora ~]# tcpdump -i em1 port 2000
11:49:19.677814 IP pandora.51510 > 10.255.0.13.sieve-filter:
Flags [S], seq 3247629228, win 29200, options [mss 1460,sackOK,TS val 3132239
ecr 0,nop,wscale 7], length 0
11:49:19.678209 IP 10.255.0.13.sieve-filter > pandora.51510:
Flags [S.], seq 2163865433, ack 3247629229, win 28960, options [mss
1460,sackOK,TS val 55254406 ecr 3132239,nop,wscale 7], length 0
It also appears to work. But if we look at the
kernel states something's interesting. The client side confirms an established
connection.
[root@pandora ~]# ss -a4tn 'dport = 2000'
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port
tcp ESTAB
0 0 10.255.0.12:51510 10.255.0.13:2000
tcp ESTAB 0
0 10.255.0.12:51506 10.255.0.13:2000
tcp ESTAB 0
0 10.255.0.12:51507 10.255.0.13:2000
But the server side list's it as SYN-RECV; even though the packet
captures on both sides prove we've gone through the three-way handshake.
[root@centos-7-2-t1 ~]# ss -a4tn 'sport = 2000'
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port
tcp LISTEN 2
1 *:2000 *:*
tcp SYN-RECV
0 0 10.255.0.13:2000 10.255.0.12:51510
tcp ESTAB 0
0 10.255.0.13:2000 10.255.0.12:51507
tcp ESTAB 0
0 10.255.0.13:2000 10.255.0.12:51506
If we then accept the first connection on the
queue, we then see this change to ESTAB.
Interestingly, if we then try the same thing
but send data from the client, the connection that the client things is
established, but marked as syn-recv on the server, does not ack the data on the
server side, then the connection eventually is cleared on the server side
causing a reset to the client:
12:07:05.090381 IP 10.255.0.13.sieve-filter > pandora.51515:
Flags [S.], seq 950779250, ack 3352895372, win 28960, options [mss 1460,sackOK,TS
val 56319818 ecr 4187968,nop,wscale 7], length 0
12:07:05.090471 IP pandora.51515 > 10.255.0.13.sieve-filter:
Flags [.], ack 1, win 229, options [nop,nop,TS val 4197652 ecr 56288216],
length 0
12:07:08.238307 IP pandora.51515 > 10.255.0.13.sieve-filter:
Flags [P.], seq 1:3, ack 1, win 229, options [nop,nop,TS val 4200800 ecr
56288216], length 2
12:07:33.870349 IP pandora.51515 > 10.255.0.13.sieve-filter:
Flags [P.], seq 1:3, ack 1, win 229, options [nop,nop,TS val 4226432 ecr
56288216], length 2
12:08:25.198303 IP pandora.51515 > 10.255.0.13.sieve-filter:
Flags [P.], seq 1:3, ack 1, win 229, options [nop,nop,TS val 4277760 ecr
56288216], length 2
12:08:25.198518 IP 10.255.0.13.sieve-filter > pandora.51515:
Flags [R], seq 950779251, win 0, length 0
As you can see, the way things actually behave
under stress can be different between OS's, can be different to what you may
think, tools can mislead, and the results can be problematic for application
performance and reliability.
Careful analysis will expose anomalys, allowing
you to the opportunity to diagnose and resolve the true root cause of the
problem you are investigating.
In a later article I'll look at dynamic kernel
tracing to track these changes; yes, fun with DTrace.
No comments:
Post a Comment