About me and this Blog

Thursday, 11 August 2016

Kernel Tracing Qmax on Solaris – Part 2



Following on from Part 1, we will further enhance our script to trace incoming connections through the kernel to the application.

Before doing that we will tweak the script to provide a CSV output. We do this using a BEGIN probe to print the header and then update the probes. We will also split the probes. Where there is a match, multiple probes are triggered in the order listed in the script. We can therefore say we are only interested in a specific port and then do the heavy lifting later on.

It is also important that you clear variables once you finish with them so as not to fill up the kernel buffers (you will loose traces if you don't; given enough probes firing).

#!/usr/sbin/dtrace -Cs

#pragma D option quiet

#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/ip.h>

BEGIN
{
        printf("TS,Date,Action,Source IP,Source Port,Dest Port,Q0,Q,Qmax\n");
}

fbt:ip:tcp_conn_request:entry
{
        self->tcpq = *(tcp_t**)((char*)arg0+0x28);
        self->lport = ntohs(*(uint16_t*)((char*)self->tcpq->tcp_connp+0x10a));
}

fbt:ip:tcp_conn_request:entry
/self->lport == $1/
{
        this->srcAddr = 0;
        this->srcPort = 0;

        printf("%d,%Y,%s,0x%08x,%d,%d,%d,%d,%d\n",
                timestamp, walltimestamp, "syn",
                this->srcAddr, this->srcPort,
                self->lport,
                self->tcpq->tcp_conn_req_cnt_q0,
                self->tcpq->tcp_conn_req_cnt_q,
                self->tcpq->tcp_conn_req_max
        );
}

fbt:ip:tcp_conn_request:return
/self->lport/
{
        self->tcpq = 0;
        self->lport = 0;
}

Whilst in tcp_conn_request() we don't actually have a connection from the client; the kernel is about to build one.

It is the message in the call that is actually the incoming packet to process. IP has dealt with it; now it is TCP's turn. Remember that the tcpq we have used so far is the listener's connection (the one that is in the listen state), not the eager's (new connection); so looking at that is not going to be productive.

We could probe another function or we could decode the message; tcp_accept_comm() is called by tcp_conn_request() to accept the connection providing a few things are satisfied such as Qmax not been reached. If we are only interested in connections that aren't dropped due to, e.g. Q==Qmax, then this would be a good place. However, if we also want to capture failed connections, we need to do the processing ourselves.

So, let's decode the message :D

Looking at the source for tcp_conn_request() we can see that arg1 (mp, the mblk_t) is the message. Not surprising as this is a STREAMS module. Further down we can see where the data is and the format of the data. i.e. we have a pointer to the IP header here.

ipha = (struct ip *)mp->b_rptr;

As mblk_t is a standard structure we can easily map this within our DTrace script. We can therefore change the setting of srcAddr to the following:

        this->mblk = (mblk_t *)arg1;
        this->iphdr = (struct ip*)this->mblk->b_rptr;
        this->srcAddr = this->iphdr->ip_src.s_addr;

If we then run the script and connect to the server running on 172.16.170.133 from client 172.16.170.1 we get the following:

root@sol10-u9-t4# ./tcpq_ex7.d 22
TS,Date,Action,Source IP,Source Port,Dest Port,Q0,Q,Qmax
180548302609318,2016 Aug 11 14:37:02,syn,0x85aa10ac,0,22,0,0,8

As you can see, the hex address is that of the destination and not the source. Given that an IP header is a standard format (defined in rfc791) this is a bit surprising. So, what is going on?

A Slight Detour

The IP header is defined in /usr/include/netinet/ip.h. It is as expected. Whilst I won't go into all the steps to find the cause, here is the quick version.

The DTrace compiler uses the C pre-processor, but compiles it's own code for the kernel to use. DTrace should be able to import C headers to allow you to use names, etc, but clearly something has gone wrong.

One test I did is to create my own header using the constituent parts and a key tweak; one struct doesn't use fields (bitmasks) but both are the same size.

#ifndef MYIP_H
#define MYIP_H

#include <sys/types.h>

typedef struct {
        uchar_t         c1;
        uchar_t         c2,c3,c4;
        ushort_t        s1,s2,s3,s4;
        uint32_t        src,dst;
} myip_t;

typedef struct {
        uchar_t         c1_1:4,c1_2:4;
        uchar_t         c2,c3,c4;
        ushort_t        s1,s2,s3,s4;
        uint32_t        src,dst;
} myip2_t;

#endif

Then, if we run this through a simple C program that just prints the sizes all looks good:

#include <stdio.h>
#include "myip.h"
int main(int argc, char *argv[]) {
        printf("myip_t : %u\n", sizeof(myip_t));
        printf("myip2_t: %u\n", sizeof(myip2_t));
        return 0;
}

Which gives the following output:

root@sol10-u9-t4# ./a.out
myip_t : 20
myip2_t: 20

Then within the BEGIN probe of the DTrace we do the same (not forgetting to include the header):

        printf("myip_t : %d\n", sizeof(myip_t));
        printf("myip2_t: %d\n", sizeof(myip2_t));

Which gives the following output:

root@sol10-u9-t4# ./tcpq_ex8.d -I `pwd` 22
TS,Date,Action,Source IP,Source Port,Dest Port,Q0,Q,Qmax
myip_t : 20
myip2_t: 24

This is the problem. A few more tests show that the DTrace compiler cannot handle fields (bitmasks) in the C structs correctly, at least in the situations I've tested (me'thinks a bug). I will leave it as an exercise for the reader to analyse this further.

Back to the Task at Hand

As the IP header is well-defined we can just define the offset without further analysis.

this->srcAddr = *(uint32_t*)((char*)(this->iphdr)+0xc);

This yields the correct IP address:

root@sol10-u9-t4# ./tcpq_ex9.d 22
TS,Date,Action,Source IP,Source Port,Dest Port,Q0,Q,Qmax
181523558976884,2016 Aug 11 14:53:17,syn,0x01aa10ac,0,22,0,0,8

As the header length of the IP header and the TCP header has the bitmask after the port, we should be able to extract the port quite easily (setting srcPort to the result):

        this->ihl = 4 * ((int)*((char*)(this->iphdr)) & 0xf);
        this->tcphdr = (struct tcphdr*)((char*)(this->iphdr)+(this->ihl));
        this->srcPort = ntohs(this->tcphdr->th_sport);

This yields the following output:

root@sol10-u9-t4# ./tcpq_ex10.d 22
TS,Date,Action,Source IP,Source Port,Dest Port,Q0,Q,Qmax
181903328660708,2016 Aug 11 14:59:37,syn,0x01aa10ac,40950,22,0,0,8

Which is confirmed via netstat:

root@sol10-u9-t4# netstat -an | fgrep 40950
172.16.170.133.22    172.16.170.1.40950   29312      0 49232      0 ESTABLISHED

Completion of the handshake

As you can guess, there are various locations that we can probe the kernel in order to mark events. In the case of the completion of the three-way handshake, I'm going to go for tcp_send_conn_ind() in usr/src/uts/common/inet/tcp/tcp_tpi.c. The key function is actual tcp_rput_data() in usr/src/stand/lib/tcp/tcp.c where the state is changed to EST.

Anyway, in tcp_send_conn_ind() arg0 is the listener and arg1 is a mblk_t containing a pointer to a struct T_conn_ind. However, as we have already processed the original syn packet, we also have an eager tcp_t. We can extract that, and use the offsets we know to get the address and port of the peer.

We can combine probes where there is common processing. In this particular kernel arg0 is the same.

fbt:ip:tcp_conn_request:entry,
fbt:ip:tcp_send_conn_ind:entry
{
        self->tcpq = *(tcp_t**)((char*)arg0+0x28);
        self->lport = htons(*(uint16_t*)((char*)self->tcpq->tcp_connp+0x10a));
}

We can now predicate all the other probes within the kernel thread to where the port maps what we are after. This includes a simple way to label the action. We can also change the variable context from 'this' (lexical scope) to 'self' (thread scope) to pass the parameters that are in different probes but the same thread.

First, tcp_conn_request(), which should be familiar:

fbt:ip:tcp_conn_request:entry
/self->lport == $1/
{
        self->action = "syn";
        this->mblk = (mblk_t *)arg1;
        this->iphdr = (struct ip*)this->mblk->b_rptr;

        this->ihl = 4 * ((int)*((char*)(this->iphdr)) & 0xf);
        this->tcphdr = (struct tcphdr*)((char*)(this->iphdr)+(this->ihl));

        self->srcAddr = *(uint32_t*)((char*)(this->iphdr)+0xc);
        self->srcPort = ntohs(this->tcphdr->th_sport);
}

Then tcp_send_conn_ind():

fbt:ip:tcp_send_conn_ind:entry
/self->lport == $1/
{
        self->action = "syn-aa";
        this->mblk = (mblk_t *)arg1;
        this->tconnind = (struct T_conn_ind*)(this->mblk->b_rptr);
        this->tcpp = *(tcp_t**)(((char*)this->tconnind)+(this->tconnind->OPT_offset));

        self->srcPort = ntohs(*(uint16_t*)((char *)this->tcpp->tcp_connp+0x108));
        self->srcAddr = *(uint32_t*)((char*)this->tcpp->tcp_connp+0x104);
}

The printout then is common to both, since all of it is data driven from the previous stages:

fbt:ip:tcp_conn_request:entry,
fbt:ip:tcp_send_conn_ind:entry
/self->lport == $1/
{
        printf("%d,%Y,%s,0x%08x,%d,%d,%d,%d,%d\n",
                timestamp, walltimestamp,
                self->action,
                self->srcAddr, self->srcPort,
                self->lport,
                self->tcpq->tcp_conn_req_cnt_q0,
                self->tcpq->tcp_conn_req_cnt_q,
                self->tcpq->tcp_conn_req_max
        );
}

Finally we clear the buffers to tidy up:

fbt:ip:tcp_conn_request:return,
fbt:ip:tcp_send_conn_ind:return
/self->lport/
{
        self->tcpq = 0;
        self->lport = 0;
        self->action = 0;
        self->srcAddr = 0;
        self->secPort = 0;
}

If we run this on a connection we can see the two parts. Notice the value of Q0. On entry it is one as we have an embryonic connection which is about to transition to Q.

root@sol10-u9-t4# ./tcpq_ex11.d 22
TS,Date,Action,Source IP,Source Port,Dest Port,Q0,Q,Qmax
185255847646301,2016 Aug 11 15:55:29,syn,0x01aa10ac,41010,22,0,0,8
185255847845639,2016 Aug 11 15:55:29,syn-aa,0x01aa10ac,41010,22,1,0,8

If we fill up Q so it equals Qmax (using my test script from a previous blog) and retry, accepting a connection on the queue part way through to free the kernel backlog we get this, clearly showing the remote servers tcp retries.

root@sol10-u9-t4# ./tcpq_ex11.d 2000
TS,Date,Action,Source IP,Source Port,Dest Port,Q0,Q,Qmax
185458485779874,2016 Aug 11 15:58:52,syn,0x01aa10ac,50066,2000,0,2,2
185459488194513,2016 Aug 11 15:58:53,syn,0x01aa10ac,50066,2000,0,2,2
185461492294103,2016 Aug 11 15:58:55,syn,0x01aa10ac,50066,2000,0,2,2
185465496397317,2016 Aug 11 15:58:59,syn,0x01aa10ac,50066,2000,0,1,2
185465496517076,2016 Aug 11 15:58:59,syn-aa,0x01aa10ac,50066,2000,1,1,2

As you can see, the max'ed out queue resulted in the connection taking 185465496517076 minus 185458485779874 = 7010737202 ns (or 7.01 seconds) to establish the connection.

In the next part we will look at extending this into the application, using both syscall tracing and userspace tracing.

As always, please can you provide feedback so I can improve the blog.

No comments:

Post a Comment