In this article I'm going to introduce an
awesome bit of technology developed by Sun Microsystems (now part of Oracle),
called DTrace. Adding to the mix I'll look at setting up a privileged role to
perform DTrace operations.
DTrace allows you to instrument any part of the
system from internal kernel functions, through syscalls, then into the userland
processes (all in the *same* script). In short, if you wish to diagnose a
performance issue, this is the tool to use. Not only that, but the probes have
zero impact if you do not use them, and virtually no impact (very low latency)
when you do use them.
Thinks of it as low impact and simplified
kernel debugging; without debug traps and associated race conditions (remember truss and tracing the X server!)
We also have destructive DTrace, which allows
you (with restrictions) to modify stuff.
Just as with many tools, it can also be used
for other things. It is a awesome tool for reverse engineering (you can monitor
live systems in real time) and for evil (it is just as easy to sniff for
credentials, for example; and as a part of the OS it can be “below the radar”).
Once we have set up the role I'll go through
how I solved a problem I was having by using DTrace. There was a Java
application that was supplied and maintained by a vendor which runs on a box we
maintain. Unfortunately there were problems, and as part of my diagnostics I
wanted to enable verbose GC logging to see if there was a problem in this area.
Due to the politics of the situation, I wasn't
permitted to modify the application or use 'application tools' like jstat, and
the vendor wouldn't agree to do it either at that time. I was permitted to use
OS tools; and DTrace was an OS tool... so that was my way in.
Setting
up a role
There are a few ways you can use DTrace. The
easiest is to run it as root. However, as we are security concious we want
least-privileged setups.
You can also assign certain DTrace privileges
to other users. However, you may not want a user that can log in remotely to
have default privileged access.
Finally, you can create a role. In this case,
you may need to consider other factors (giving a user privileges over other
users, for example), but a role cannot log in remotely ..... basically
risk-assess your situation and decide.
So, lets create a role.
root@sol10-u9-t4# roleadd -m -K
defaultpriv=basic,priv_dtrace_proc,dtrace_user,dtrace_kernel,priv_proc_owner -d
/export/home/dbguser dbguser
64 blocks
root@sol10-u9-t4# rolemod -s /bin/bash dbguser
There appears to be a bug in Solaris (the
version I'm using) in that in order to use the tick probe (in the profile
provider), you also need the dtrace_kernel
privilege for the probe to fire. As we are tracing other users we also need priv_proc_owner. Note that you can only
trace owners that don't have more privileges than you; see the man page
privileges(5).
Lets allow a normal user to access that role:
root@sol10-u9-t4# usermod -K roles=dbguser paul
If another user attempts to su to that role it
will fail:
joe@sol10-u9-t4$ su - dbguser
Password:
Roles can only be assumed by authorized users
su: Sorry
But we can access it:
paul@sol10-u9-t4$ su - dbguser
Password:
Oracle Corporation SunOS
5.10 Generic Patch January 2005
dbguser@sol10-u9-t4$ ppriv $$
3974: -bash
flags = <none>
E:
basic,dtrace_kernel,dtrace_proc,dtrace_user,proc_owner
I:
basic,dtrace_kernel,dtrace_proc,dtrace_user,proc_owner
P:
basic,dtrace_kernel,dtrace_proc,dtrace_user,proc_owner
L: all
Solving
my problem
I needed to find evidence to prove whether or
not various areas could be the cause of some problems in a live environment.
One area of interest was the Java JVM, which was running on Solaris 10.
A bit of background to Java JVM's. One thing
that surprisingly few people fully appreciate is that a Java application will
regularly freeze (halt). It is part of the design. This is done by the VM Thread and is colloquially known as
a stop-the-world (STW) event, and are usually very short in duration.
Technically it is known as a safepoint, and is only initiated within the VM Thread.
One of the main users of this are the Java
garbage collectors (GC). For example, the CMS collector will freeze the JVM
during the initial-mark and the remark phases. So, if the JVM isn't tuned well,
or is having memory or demand issues, this can cause longer and/or more
frequent GC events, and thus more time spent frozen and less time doing
productive work; this can cascade into complete failure of the app in serious
situations.
Within DTrace we have a number of providers.
There are two that perform function boundary tracing, fbt for kernel functions (this
one is really fun to play with) and the pid
provider for userland functions. There are many other providers, such as syscall, io, sched, etc.
RuntimeService::record_safepoint_begin()
RuntimeService::record_safepoint_end()
The key files you will want to see, if you wish
to examine it further, are:
hotspot/src/share/vm/runtime/safepoint.cpp
hotspot/src/share/vm/runtime/vmthread.cpp
hotspot/src/share/vm/services/runtimeService.cpp
Now, on the (closed-source) JVM we can have a
look at the symbols via nm.
We see that the closed-source version also has these functions. You can use
other techniques to verify the functionality if you wish.
paul@sol10-u9-t4$ /usr/ccs/bin/nm -p libjvm.so | fgrep
record_safepoint
0001366368 t
__1cORuntimeServicebDrecord_safepoint_synchronized6F_v_
0001369140 t __1cORuntimeServiceUrecord_safepoint_end6F_v_
0001366776 t __1cORuntimeServiceWrecord_safepoint_begin6F_v_
So, let's start writing the DTrace script.
Within DTrace there are a load of ways we can
handle this, but lets just look at summing the time we are in a safepoint over
a defined interval.
We create an empty text document (say sumstw.d) with the following header:
#!/usr/sbin/dtrace -Cs
#pragma D option quiet
We then add our two probes. The first triggers
on the return of the function call, storing the timestamp (in nanoseconds) in
the thread local variable ts.
$1
is the argument to the trace script and will be the PID.
pid$1:libjvm:__1cORuntimeServiceWrecord_safepoint_begin6F_v_:return
{
self->ts =
timestamp;
}
The second is triggered on entry into the
function call. It is an aggregation (sum) of the difference in the timestamps.
i.e. how long we were in a STW event. You can also create frequency
distributions using, for example, quantize().
pid$1:libjvm:__1cORuntimeServiceUrecord_safepoint_end6F_v_:entry
/self->ts/
{
@total =
sum(timestamp - self->ts);
self->ts = 0;
}
Finally, we use the tick probe to trigger every
specified unit of time. In this case every second.
tick-1sec
{
printf("Time:
%Y", walltimestamp );
printa(@total);
clear(@total);
}
And that's it.
I have written a simple Java program that just
creates and destroys Java objects (forgetting to do a few; i.e. memory leak) so
as to exercise the GC code. If I run it and then compare with the output from a
verbose GC log I can see it all appears to match. NOTE: other things cause STW
events so it will not always match, but in this sense the DTrace is better, as
I'm looking for all times when the JVM is frozen.
So run the JVM first to get the PID:
paul@sol10-u9-t4$
java -verbosegc -server \
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC \
-Xms32m -Xmx32m \
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps \
testRig
Start the DTrace script, the numbers are in
nanoseconds.
./traceSTW-sum.d `pgrep -u paul java`
…
Time: 2016 May 11 04:49:39
0
Time: 2016 May 11 04:49:40
22953072
Time: 2016 May 11 04:49:41
36557806
Time: 2016 May 11 04:49:42
0
Comparing we see:
2016-05-11T04:49:38.560+0100: [CMS-concurrent-reset: 0.000/0.000
secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
2016-05-11T04:49:40.570+0100: [GC
[1 CMS-initial-mark: 7258K(16384K)]
22042K(31168K), 0.0227668 secs] [Times:
user=0.02 sys=0.00, real=0.02 secs]
2016-05-11T04:49:40.598+0100:
[CMS-concurrent-mark: 0.006/0.006 secs] [Times: user=0.01 sys=0.00, real=0.01
secs]
2016-05-11T04:49:40.599+0100:
[CMS-concurrent-preclean: 0.000/0.000 secs] [Times: user=0.00 sys=0.00,
real=0.00 secs]
2016-05-11T04:49:41.547+0100: [GC
[ParNew: 14784K->1600K(14784K), 0.0193115
secs] 22042K->10379K(31168K), 0.0193506 secs] [Times: user=0.03 sys=0.00,
real=0.02 secs]
2016-05-11T04:49:41.573+0100:
[CMS-concurrent-abortable-preclean: 0.045/0.974 secs] [Times: user=0.08
sys=0.00, real=0.97 secs]
2016-05-11T04:49:41.573+0100:
[GC[YG occupancy: 9174 K (14784 K)][Rescan (parallel) , 0.0167713 secs][weak
refs processing, 0.0000035 secs] [1 CMS-remark:
8779K(16384K)] 17953K(31168K), 0.0168503
secs] [Times: user=0.02 sys=0.00, real=0.02 secs]
2016-05-11T04:49:41.598+0100:
[CMS-concurrent-sweep: 0.008/0.008 secs] [Times: user=0.00 sys=0.00, real=0.01
secs]
2016-05-11T04:49:41.599+0100: [CMS-concurrent-reset: 0.000/0.000
secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
…
As you can see, via DTrace we are now tracking
what we want to. So, we can go to the live system, and without enabling verbose
GC logging or any application tool, we can trace all the events of interest.
This is the tip of the iceberg
when it comes to DTrace. The stuff you can do with this is truly awesome.
Enjoy.