Search This Blog

Wednesday, 27 July 2016

Kernel Tracing Qmax on Solaris – Part 1

Time to poke around the kernel and do a simple bit of reversing.

Whether you are a sysadmin, penetration tester, or reverse engineer, if you don't know about Solaris DTrace you will want to. It allows for low-latency instrumentation of the system. This includes function boundary tracing (FBT) of pretty much any function in the kernel and any application running on the server.

People have been able to solve complex problems in applications on other platforms simply because the application also ran on Solaris and they could use DTrace on that port of the application.

You can inspect memory in the kernel and application based on events, and much more. You can even get the kernel to lie to userland processes (with filtering on what it will lie to).

But we are getting ahead of ourselves.

As an example, which I'll cover a very small part here, I had an application that had multiple serious issues that even the vendor was struggling to solve.

To identify the problem areas the most powerful script I wrote was a single DTrace script. This single script was able to trace all the individual requests from the arrival of the initial connection and the three-way handshake using kernel FBT (including when Q = Qmax), through to the system call (syscall tracing), on to the listening thread within the Java app, and then the handover to the worker thread (user FBT) and keep an eye on all Java safepoints and then tell me what was blocking the threads (filesystem issues – sometimes closing a file descriptor took several seconds). All these stages had issues, and sometimes the Java safepoints aggravated the situation; but the picture it painted was invaluable – we could prove what was the cause before fixing it, not just say something was the probable cause, and do this directly on the live system. Tracing all of this without impacting the live application.

To start we will look at monitoring the incoming connection (the first syn packet), and whilst we are in the listener perimeter (mutual exclusion of the listener in interrupt context) we will report on the value of q0/q/qmax. i.e. we know if a connection is ignored because we can prove q==qmax. Using ndd only gives you a point in time, here we are provably showing what the settings are at the time that specific connection is evaluated within the kernel.

It helps to have the Solaris source code (or a version of it) – see

Our first DTrace scripts

In source file usr/src/stand/lib/tcp/tcp.c we have tcp_conn_request(). It's first parameter is a tcp_t, which is defined in /usr/include/inet/tcp.h. As a reverse engineer we could figure this out without the source code, but I will leave that as an exercise.

The tcp_t has tcp_conn_req_cnt_q0, tcp_conn_req_cnt_q, and tcp_conn_req_max for Q0, Q and Qmax respectively.

Within this structure we also have a struct conn_s (tcp_connp) pointer, which is defined in /usr/include/inet/ipclassifier.h. Then in tcp_connp we have a union of structs, such that the local port is at u_port.tcpu_ports.tcp_lport.
So, the script would be:

#!/usr/sbin/dtrace -Cs

#pragma D option quiet

#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/ip.h>

        self->tcpq = (tcp_t*)arg0;

        printf("tcpq: %p\n", self->tcpq );

        printf("q0/q/qmax: %d/%d/%d\n",
        printf("conn_lport: %d\n", (uint32_t)self->tcpq->tcp_connp->u_port.tcpu_ports.tcpu_lport );

If we run the script and then create a new connection to 22/tcp (e.g. using nc or telnet with a dst port) we see this:

# ./tcpq_ex1.d
dtrace: error on enabled probe ID 1 (ID 29873: fbt:ip:tcp_conn_request:entry): invalid alignment (0x603184280103) in action #5 at DIF offset 32

So, what is wrong. First, lets comment out all the printf's except the one that prints the address of arg0 and re-try:

root@sol10-u9-t4# ./tcpq_ex2.d
tcpq: ffffffff879e9b00

If we then look for the address using ndd we see that this doesn't look right:

root@sol10-u9-t4# ndd /dev/tcp tcp_listen_hash | egrep '(TCP|00022)'
    TCP            zone IP addr         port  seqnum   backlog (q0/q/max)
022 ffffffff879e9d00 0 :: 00022 00000081 0/0/8

After a bit of diagnostics it appears that we may have made a false assumption. After all, we are basing the functionality on an open source version of Solaris and a major version of Solaris has many major changes between updates, and I'm running update 9 (on Intel).

root@sol10-u9-t4# head -1 /etc/release
                    Oracle Solaris 10 9/10 s10x_u9wos_14a X86

If we look at the open source code we see that all versions have the following as the first test. It is therefore reasonable to expect that the closed source version will have that first (or near the first) test.

if (tcp->tcp_conn_req_cnt_q >= tcp->tcp_conn_req_max) {

So, lets run up a kernel debugger and have a look at the assembly of tcp_conn_request(). Here is the truncated and annotated output.

root@sol10-u9-t4# mdb -k
Loading modules: ..
> ::dis tcp_conn_request
tcp_conn_request+0xc:           movq   %rdi,%r15           ; arg0 into r15
tcp_conn_request+0xf:           movq   %r12,-0x20(%rbp)
tcp_conn_request+0x13:          movq   0x28(%r15),%r12     ; arg0 offset 0x28 into r12
tcp_conn_request+0x39:          cmpl   $-0x3,0x20(%r12)     ; some other test
tcp_conn_request+0x3f:          je     +0x21    <tcp_conn_request+0x60>
tcp_conn_request+0x41:          call   +0x7e70c2f       <freemsg>
tcp_conn_request+0x74:          movl   0x1d0(%r12),%edx
tcp_conn_request+0x7c:          cmpl   %edx,0x1cc(%r12)     ; cmp q and qmax?
tcp_conn_request+0x84:          jge    +0x8fd   <tcp_conn_request+0x981>

It appears that arg0 may not be tcp_t after all, but if we look at the offset 0x28 from arg0 we will find a pointer to it. In this case, there may be a public structure this maps to, there may not. I will leave that as an exercise for the reader.

Lets re-write the setting of self->tcpq to be the following and try again.

self->tcpq = *(tcp_t**)((char*)arg0+0x28);

This time it looks better:

root@sol10-u9-t4# ./tcpq_ex3.d
tcpq: ffffffff879e9d00
q0/q/qmax: 0/0/8
conn_lport: 0

So, Qmax etc look good but conn_lport doesn't.

We know that the tcp_t address is right from ndd, and that from tcp_lookup_listener_ipv4() we are looking in the right place for the local port, so in this case it may be that the metadata DTrace is using is not the actual structure (something is different). In this case lets go back to the kernel debugger to see if we can find something that could be a “port 22” in the conn_s structure.

> ffffffff879e9d00::print -t tcp_t tcp_connp
struct conn_s *tcp_connp = 0xffffffff879e9b00
> 0xffffffff879e9b00/300B
0xffffffff879e9b00:             0       0       0       0       0       0       0       0       3       0       0       0      
                0       0       0       0       0       0       0       0       0       0       0       0       9       0      
                0       0       0       0       0       0       0       0       0       0       0       0       0       0      
                0       0       0       16      6       0       0       0       0       0       0       0       0       0

This is at offset 266 (0x10a); so lets change this value to the following and re-run.

printf("conn_lport: %d\n", *(uint16_t*)((char*)self->tcpq->tcp_connp+0x10a) );

This time we get a value, but it doesn't look right:

root@sol10-u9-t4# ./tcpq_ex4.d
tcpq: ffffffff879e9d00
q0/q/qmax: 0/0/8
conn_lport: 5632

As we are on Intel and we are printing a network port, there is a good change this is in network byte order. Simple to test; we pass it through htons().

printf("conn_lport: %d\n", htons(*(uint16_t*)((char*)self->tcpq->tcp_connp+0x10a)) );

This time all the values look right:

root@sol10-u9-t4# ./tcpq_ex5.d
tcpq: ffffffff879e9d00
q0/q/qmax: 0/0/8
conn_lport: 22

As this is event driven and we haven't added any predicates we can just try another port to confirm things look ok. e.g. 111/tcp:

tcpq: ffffffff800d8240
q0/q/qmax: 0/0/64
conn_lport: 111

And a quick check with ndd confirms this:

root@sol10-u9-t4# ndd /dev/tcp tcp_listen_hash | egrep '(TCP|00111)'
    TCP            zone IP addr         port  seqnum   backlog (q0/q/max)
367 ffffffff800d8240 0 ::ffff: 00111 00000010 0/0/64

A quick note, the reason q0 is not at least one. We are evaluating the parameters on entry to the function that will update them.

If we take a step back and think what we have done. We are now dynamically tracing incoming SYN packets caused by a network interrupt whilst within the mutual exclusion of the listener perimeter in interrupt context on a running (live) system. We are therefore certain as to the values of q0/q/qmax at that point in time. How awesome is that.

In the next part we will update the script so we also report on the origin of the packet. We will also add some predicates to only look at a particular listener.

I would appreciate some feedback on my articles, at least let me know you are reading them. Ideally, let me know where I could improve the Blog.