Time to poke around the kernel and do a simple
bit of reversing.
Whether you are a sysadmin, penetration tester,
or reverse engineer, if you don't know about Solaris DTrace you will want to.
It allows for low-latency instrumentation of the system. This includes function
boundary tracing (FBT) of pretty much any function in the kernel and any
application running on the server.
People have been able to solve complex problems
in applications on other platforms simply because the application also ran on
Solaris and they could use DTrace on that port of the application.
You can inspect memory in the kernel and
application based on events, and much more. You can even get the kernel to lie
to userland processes (with filtering on what it will lie to).
But we are getting ahead of ourselves.
As an example, which I'll cover a very small
part here, I had an application that had multiple serious issues that even the
vendor was struggling to solve.
To identify the problem areas the most powerful
script I wrote was a single DTrace
script. This single script was able to trace all the individual requests from
the arrival of the initial connection and the three-way handshake using kernel
FBT (including when Q = Qmax), through to the system call (syscall tracing), on
to the listening thread within the Java app, and then the handover to the
worker thread (user FBT) and keep an
eye on all Java safepoints and then
tell me what was blocking the threads (filesystem issues – sometimes closing a
file descriptor took several seconds). All these stages had issues, and
sometimes the Java safepoints aggravated the situation; but the picture it
painted was invaluable – we could prove
what was the cause before fixing it, not just say something was the probable
cause, and do this directly on the live system. Tracing all of this without
impacting the live application.
To start we will look at monitoring the
incoming connection (the first syn packet), and whilst we are in the listener
perimeter (mutual exclusion of the listener in interrupt context) we will
report on the value of q0/q/qmax. i.e. we know if a connection is ignored
because we can prove q==qmax. Using ndd
only gives you a point in time, here we are provably showing what the settings
are at the time that specific connection is evaluated within the kernel.
It helps to have the Solaris source code (or a
version of it) – see https://github.com/illumos/illumos-gate.
Our
first DTrace scripts
In source file usr/src/stand/lib/tcp/tcp.c we have tcp_conn_request().
It's first parameter is a tcp_t,
which is defined in /usr/include/inet/tcp.h.
As a reverse engineer we could figure this out without the source code, but I
will leave that as an exercise.
The tcp_t
has tcp_conn_req_cnt_q0,
tcp_conn_req_cnt_q,
and tcp_conn_req_max
for Q0, Q and Qmax respectively.
Within this structure we also have a struct conn_s (tcp_connp) pointer, which is defined
in /usr/include/inet/ipclassifier.h.
Then in tcp_connp
we have a union of structs, such that the local port is at u_port.tcpu_ports.tcp_lport.
So, the script would be:
#!/usr/sbin/dtrace -Cs
#pragma D option quiet
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/ip.h>
fbt:ip:tcp_conn_request:entry
{
self->tcpq =
(tcp_t*)arg0;
printf("tcpq:
%p\n", self->tcpq );
printf("q0/q/qmax: %d/%d/%d\n",
self->tcpq->tcp_conn_req_cnt_q0,
self->tcpq->tcp_conn_req_cnt_q,
self->tcpq->tcp_conn_req_max
);
printf("conn_lport: %d\n",
(uint32_t)self->tcpq->tcp_connp->u_port.tcpu_ports.tcpu_lport );
}
If we run the script and then create a new
connection to 22/tcp (e.g. using nc
or telnet
with a dst port) we see this:
# ./tcpq_ex1.d
dtrace: error on enabled probe ID 1 (ID 29873:
fbt:ip:tcp_conn_request:entry): invalid alignment (0x603184280103) in action #5
at DIF offset 32
So, what is wrong. First, lets comment out all
the printf's except the one that prints the address of arg0 and re-try:
root@sol10-u9-t4# ./tcpq_ex2.d
tcpq: ffffffff879e9b00
If we then look for the address using ndd we see that this doesn't
look right:
root@sol10-u9-t4# ndd /dev/tcp tcp_listen_hash | egrep
'(TCP|00022)'
TCP zone IP addr port
seqnum backlog (q0/q/max)
022 ffffffff879e9d00 0 :: 00022
00000081 0/0/8
After a bit of diagnostics it appears that we
may have made a false assumption. After all, we are basing the functionality on
an open source version of Solaris and a major version of Solaris has many major
changes between updates, and I'm running update 9 (on Intel).
root@sol10-u9-t4# head -1 /etc/release
Oracle
Solaris 10 9/10 s10x_u9wos_14a X86
If we look at the open source code we see that
all versions have the following as the first test. It is therefore reasonable
to expect that the closed source version will have that first (or near the
first) test.
if (tcp->tcp_conn_req_cnt_q >= tcp->tcp_conn_req_max) {
So, lets run up a kernel debugger and have a
look at the assembly of tcp_conn_request().
Here is the truncated and annotated output.
root@sol10-u9-t4# mdb -k
Loading modules: …..
> ::dis tcp_conn_request
...
tcp_conn_request+0xc:
movq %rdi,%r15 ; arg0 into r15
tcp_conn_request+0xf:
movq %r12,-0x20(%rbp)
tcp_conn_request+0x13:
movq 0x28(%r15),%r12 ; arg0 offset 0x28 into r12
…
tcp_conn_request+0x39:
cmpl $-0x3,0x20(%r12) ; some other test
tcp_conn_request+0x3f:
je +0x21 <tcp_conn_request+0x60>
tcp_conn_request+0x41:
call +0x7e70c2f <freemsg>
…
tcp_conn_request+0x74:
movl 0x1d0(%r12),%edx
tcp_conn_request+0x7c:
cmpl %edx,0x1cc(%r12) ; cmp q and qmax?
tcp_conn_request+0x84:
jge +0x8fd <tcp_conn_request+0x981>
It appears that arg0 may not be tcp_t
after all, but if we look at the offset 0x28 from arg0 we will find a pointer
to it. In this case, there may be a public structure this maps to, there may
not. I will leave that as an exercise for the reader.
Lets re-write the setting of self->tcpq to be the following and try
again.
self->tcpq = *(tcp_t**)((char*)arg0+0x28);
This time it looks better:
root@sol10-u9-t4# ./tcpq_ex3.d
tcpq: ffffffff879e9d00
q0/q/qmax: 0/0/8
conn_lport: 0
So, Qmax etc look good but conn_lport doesn't.
We know that the tcp_t address is right from ndd, and that from tcp_lookup_listener_ipv4()
we are looking in the right place for the local port, so in this case it may be
that the metadata DTrace is using is not the actual structure (something is
different). In this case lets go back to the kernel debugger to see if we can
find something that could be a “port 22” in the conn_s structure.
> ffffffff879e9d00::print -t tcp_t tcp_connp
struct conn_s *tcp_connp = 0xffffffff879e9b00
> 0xffffffff879e9b00/300B
0xffffffff879e9b00:
0 0 0
0 0 0
0 0 3
0 0 0
0
0 0 0
0 0 0
0 0 0
0 0 9
0
…
0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0
0 16 6
0 0 0
0 0 0
0 0 0
…
This is at offset 266 (0x10a); so lets change
this value to the following and re-run.
printf("conn_lport: %d\n", *(uint16_t*)((char*)self->tcpq->tcp_connp+0x10a)
);
This time we get a value, but it doesn't look
right:
root@sol10-u9-t4# ./tcpq_ex4.d
tcpq: ffffffff879e9d00
q0/q/qmax: 0/0/8
conn_lport: 5632
As we are on Intel and we are printing a
network port, there is a good change this is in network byte order. Simple to
test; we pass it through htons().
printf("conn_lport: %d\n", htons(*(uint16_t*)((char*)self->tcpq->tcp_connp+0x10a))
);
This time all the values look right:
root@sol10-u9-t4# ./tcpq_ex5.d
tcpq: ffffffff879e9d00
q0/q/qmax: 0/0/8
conn_lport: 22
As this is event driven and we haven't added
any predicates we can just try another port to confirm things look ok. e.g.
111/tcp:
tcpq: ffffffff800d8240
q0/q/qmax: 0/0/64
conn_lport: 111
And a quick check with ndd confirms this:
root@sol10-u9-t4# ndd /dev/tcp tcp_listen_hash | egrep
'(TCP|00111)'
TCP zone IP addr port
seqnum backlog (q0/q/max)
367 ffffffff800d8240 0
::ffff:0.0.0.0 00111 00000010 0/0/64
A quick note, the reason q0 is not at least one. We are
evaluating the parameters on entry to the function that will update them.
If we take a step back and think what we have
done. We are now dynamically tracing incoming SYN packets caused by a network
interrupt whilst within the mutual exclusion of the listener perimeter in
interrupt context on a running (live) system. We are therefore certain as to
the values of q0/q/qmax at that point in time. How awesome is that.
In the next part we will update the script so
we also report on the origin of the packet. We will also add some predicates to
only look at a particular listener.