Search This Blog

Thursday, 2 June 2016

Working round problems with DTrace



In this article I'm going to introduce an awesome bit of technology developed by Sun Microsystems (now part of Oracle), called DTrace. Adding to the mix I'll look at setting up a privileged role to perform DTrace operations.

DTrace allows you to instrument any part of the system from internal kernel functions, through syscalls, then into the userland processes (all in the *same* script). In short, if you wish to diagnose a performance issue, this is the tool to use. Not only that, but the probes have zero impact if you do not use them, and virtually no impact (very low latency) when you do use them.

Thinks of it as low impact and simplified kernel debugging; without debug traps and associated race conditions (remember truss and tracing the X server!)

We also have destructive DTrace, which allows you (with restrictions) to modify stuff.

Just as with many tools, it can also be used for other things. It is a awesome tool for reverse engineering (you can monitor live systems in real time) and for evil (it is just as easy to sniff for credentials, for example; and as a part of the OS it can be “below the radar”).

You can find information on DTrace on Oracle's website e.g. the user guide http://docs.oracle.com/cd/E19253-01/819-5488/ and the reference http://docs.oracle.com/cd/E19253-01/817-6223/ .

If you wish to delve deeper, I would recommend Brendan Gregg's website http://www.brendangregg.com and his DTrace books.

Once we have set up the role I'll go through how I solved a problem I was having by using DTrace. There was a Java application that was supplied and maintained by a vendor which runs on a box we maintain. Unfortunately there were problems, and as part of my diagnostics I wanted to enable verbose GC logging to see if there was a problem in this area.

Due to the politics of the situation, I wasn't permitted to modify the application or use 'application tools' like jstat, and the vendor wouldn't agree to do it either at that time. I was permitted to use OS tools; and DTrace was an OS tool... so that was my way in.

Setting up a role

There are a few ways you can use DTrace. The easiest is to run it as root. However, as we are security concious we want least-privileged setups.

You can also assign certain DTrace privileges to other users. However, you may not want a user that can log in remotely to have default privileged access.

Finally, you can create a role. In this case, you may need to consider other factors (giving a user privileges over other users, for example), but a role cannot log in remotely ..... basically risk-assess your situation and decide.

So, lets create a role.

root@sol10-u9-t4# roleadd -m -K defaultpriv=basic,priv_dtrace_proc,dtrace_user,dtrace_kernel,priv_proc_owner -d /export/home/dbguser dbguser
64 blocks
root@sol10-u9-t4# rolemod -s /bin/bash dbguser

There appears to be a bug in Solaris (the version I'm using) in that in order to use the tick probe (in the profile provider), you also need the dtrace_kernel privilege for the probe to fire. As we are tracing other users we also need priv_proc_owner. Note that you can only trace owners that don't have more privileges than you; see the man page privileges(5).

Lets allow a normal user to access that role:

root@sol10-u9-t4# usermod -K roles=dbguser paul

If another user attempts to su to that role it will fail:

joe@sol10-u9-t4$ su - dbguser
Password:
Roles can only be assumed by authorized users
su: Sorry

But we can access it:

paul@sol10-u9-t4$ su - dbguser
Password:
Oracle Corporation      SunOS 5.10      Generic Patch   January 2005
dbguser@sol10-u9-t4$ ppriv $$
3974:   -bash
flags = <none>
        E: basic,dtrace_kernel,dtrace_proc,dtrace_user,proc_owner
        I: basic,dtrace_kernel,dtrace_proc,dtrace_user,proc_owner
        P: basic,dtrace_kernel,dtrace_proc,dtrace_user,proc_owner
        L: all

Solving my problem

I needed to find evidence to prove whether or not various areas could be the cause of some problems in a live environment. One area of interest was the Java JVM, which was running on Solaris 10.

A bit of background to Java JVM's. One thing that surprisingly few people fully appreciate is that a Java application will regularly freeze (halt). It is part of the design. This is done by the VM Thread and is colloquially known as a stop-the-world (STW) event, and are usually very short in duration. Technically it is known as a safepoint, and is only initiated within the VM Thread.

One of the main users of this are the Java garbage collectors (GC). For example, the CMS collector will freeze the JVM during the initial-mark and the remark phases. So, if the JVM isn't tuned well, or is having memory or demand issues, this can cause longer and/or more frequent GC events, and thus more time spent frozen and less time doing productive work; this can cascade into complete failure of the app in serious situations.

Within DTrace we have a number of providers. There are two that perform function boundary tracing, fbt for kernel functions (this one is really fun to play with) and the pid provider for userland functions. There are many other providers, such as syscall, io, sched, etc.

Secondly, if we look at the OpenJDK source code (e.g. https://java.net/projects/openjdk6/downloads/), we see a couple of interesting functions:

RuntimeService::record_safepoint_begin()
RuntimeService::record_safepoint_end()

The key files you will want to see, if you wish to examine it further, are:

hotspot/src/share/vm/runtime/safepoint.cpp
hotspot/src/share/vm/runtime/vmthread.cpp
hotspot/src/share/vm/services/runtimeService.cpp

Now, on the (closed-source) JVM we can have a look at the symbols via nm. We see that the closed-source version also has these functions. You can use other techniques to verify the functionality if you wish.

paul@sol10-u9-t4$ /usr/ccs/bin/nm -p libjvm.so | fgrep record_safepoint
0001366368 t __1cORuntimeServicebDrecord_safepoint_synchronized6F_v_
0001369140 t __1cORuntimeServiceUrecord_safepoint_end6F_v_
0001366776 t __1cORuntimeServiceWrecord_safepoint_begin6F_v_

So, let's start writing the DTrace script.

Within DTrace there are a load of ways we can handle this, but lets just look at summing the time we are in a safepoint over a defined interval.

We create an empty text document (say sumstw.d) with the following header:

#!/usr/sbin/dtrace -Cs

#pragma D option quiet

We then add our two probes. The first triggers on the return of the function call, storing the timestamp (in nanoseconds) in the thread local variable ts. $1 is the argument to the trace script and will be the PID.

pid$1:libjvm:__1cORuntimeServiceWrecord_safepoint_begin6F_v_:return
{
        self->ts = timestamp;
}

The second is triggered on entry into the function call. It is an aggregation (sum) of the difference in the timestamps. i.e. how long we were in a STW event. You can also create frequency distributions using, for example, quantize().

pid$1:libjvm:__1cORuntimeServiceUrecord_safepoint_end6F_v_:entry
/self->ts/
{
        @total = sum(timestamp - self->ts);
        self->ts = 0;
}

Finally, we use the tick probe to trigger every specified unit of time. In this case every second.

tick-1sec
{
        printf("Time: %Y", walltimestamp );
        printa(@total);
        clear(@total);
}

And that's it.

I have written a simple Java program that just creates and destroys Java objects (forgetting to do a few; i.e. memory leak) so as to exercise the GC code. If I run it and then compare with the output from a verbose GC log I can see it all appears to match. NOTE: other things cause STW events so it will not always match, but in this sense the DTrace is better, as I'm looking for all times when the JVM is frozen.

So run the JVM first to get the PID:

paul@sol10-u9-t4$
java -verbosegc -server \
        -XX:+UseParNewGC -XX:+UseConcMarkSweepGC \
        -Xms32m -Xmx32m \
        -XX:+PrintGCDetails -XX:+PrintGCDateStamps \
        testRig

Start the DTrace script, the numbers are in nanoseconds.

./traceSTW-sum.d `pgrep -u paul java`
Time: 2016 May 11 04:49:39
                0
Time: 2016 May 11 04:49:40
         22953072
Time: 2016 May 11 04:49:41
         36557806
Time: 2016 May 11 04:49:42
                0

Comparing we see:

2016-05-11T04:49:38.560+0100: [CMS-concurrent-reset: 0.000/0.000 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
2016-05-11T04:49:40.570+0100: [GC [1 CMS-initial-mark: 7258K(16384K)] 22042K(31168K), 0.0227668 secs] [Times: user=0.02 sys=0.00, real=0.02 secs]
2016-05-11T04:49:40.598+0100: [CMS-concurrent-mark: 0.006/0.006 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
2016-05-11T04:49:40.599+0100: [CMS-concurrent-preclean: 0.000/0.000 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
2016-05-11T04:49:41.547+0100: [GC [ParNew: 14784K->1600K(14784K), 0.0193115 secs] 22042K->10379K(31168K), 0.0193506 secs] [Times: user=0.03 sys=0.00, real=0.02 secs]
2016-05-11T04:49:41.573+0100: [CMS-concurrent-abortable-preclean: 0.045/0.974 secs] [Times: user=0.08 sys=0.00, real=0.97 secs]
2016-05-11T04:49:41.573+0100: [GC[YG occupancy: 9174 K (14784 K)][Rescan (parallel) , 0.0167713 secs][weak refs processing, 0.0000035 secs] [1 CMS-remark: 8779K(16384K)] 17953K(31168K), 0.0168503 secs] [Times: user=0.02 sys=0.00, real=0.02 secs]
2016-05-11T04:49:41.598+0100: [CMS-concurrent-sweep: 0.008/0.008 secs] [Times: user=0.00 sys=0.00, real=0.01 secs]
2016-05-11T04:49:41.599+0100: [CMS-concurrent-reset: 0.000/0.000 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]

As you can see, via DTrace we are now tracking what we want to. So, we can go to the live system, and without enabling verbose GC logging or any application tool, we can trace all the events of interest.

This is the tip of the iceberg when it comes to DTrace. The stuff you can do with this is truly awesome. Enjoy.