illumos hackathon 2012: user-land types for DTrace
At the illumos hackathon last week, Robert Mustacchi and I prototyped better support for manipulating user-land structures. As anyone who’s used it knows, DTrace is currently very kernel-centric — this both reflects the reality of how operating systems and DTrace are constructed, and the origins of DTrace itself in the Solaris Kernel Group. Discussions at dtrace.conf(12) this spring prompted me to chart a path to better user-land support. This prototype of copyin-automagic was a first step.
What we implemented was a new ‘user’ keyword to denote that a type is a user-land structure. For example, let’s say we had the address of a 4-byte integer; today we’d access its value using copyin():
this->i = *(int *)copyin(this->addr, sizeof (int));
With our prototype, this gets simpler and more intuitive:
this->i = *(user int *)addr;
The impact is even more apparent when it comes to pointer chasing through structures. Today if we need to get to the third element of a linked list, the D code would look like this:
this->p = (node_t *)copyin(this->addr, sizeof (node_t)); this->p = (node_t *)copyin((uintptr_t)this->p->next, sizeof (node_t)); this->p = (node_t *)copyin((uintptr_t)this->p->next, sizeof (node_t)); trace(this->p->value);
Again, it’s much simpler with the user keyword:
this->p = (user node_t *)this->addr; trace(this->p->next->next->value);
D programs are compiled into a series of instructions — DIF — that are executed by the code DTrace framework when probes fire. We use the new keyword to generate instructions that load from the address space of the currently executing process rather than that of the kernel.
Adding a new keyword feels a little clunky, and I’m not sure if it’s the right way forward, but it does demonstrate a simpler model for accessing user-land structures, and was a critical first step. We already have three main sources of user-land values; the next steps are to make use of those.
Types for system calls
Arguments to system calls (mostly) have well-known types. Indeed those types are encoded in truss in excruciating and exotic detail. We should educate DTrace about those types. What I’d propose is that we create a single repository of all system call metadata. This could be, for example, and XML file that listed every system call, its name, code, subcode, types, etc. Of course we could use that as the source of type information for the syscall provider, but we could also use that to generate everything from the decoding tables in truss to the libc and kernel stubs for system calls.
As an aside, there are a couple of system calls whose parameter types — ioctl(2) is the obvious example. It would be interesting to assess the utility of an ioctl provider whose probes would be the various codes that are passed as the second parameter. Truss already has this information; why not DTrace?
Types for the pid provider
Another obvious source of type information is the process being traced. When a user specifies the -p or -c option to dtrace(1M) we’re examining a particular process, and that process can have embedded type information. We could use those types and implicitly identify them as belonging to user-space rather than the kernel. Pid provider probes correspond to the entry and return from user-land functions; we could identify the appropriate types for those parameters. We could simplify it further by doing all type handling in libdtrace (in user-land) rather than pushing the types into the kernel.
Types for USDT
User-land statically defined tracing — tracepoints explicitly inserted into code — can already have types associated with them. A first step would be to implicitly identify those types as belonging to user-land. I believe that this could even be done without adversely affecting existing scripts.
Thorny issues
While there are some clear paths forward, there are some tricky issues that remain. In particular that processes can have different data models — 32-bit v. 64-bit — presents a real challenge. Both the width of a load and offsets into structures change depending on the process that’s running. There might be some shortcuts for system calls, and we might be able to constrain the problem for the pid provider by requiring -p or -c, or we might have to compile our D program twice and then choose which version to run based on the data model of the process. In the spirit of the hackathon, Robert and I punted for our ‘user’ keyword prototype, but these problems need to be well understood and sufficiently solved.
Next steps
I’ll be working on some of these problems on the back burner; I’m especially interested in the Grand Unified Syscall Project — an idea I’ve been touting for more years than I care to relate — to bring types to the syscall provider. If you have ideas for user-land tracing with DTrace, or want to work on anything I’ve mention, leave a comment or drop me a note.
My New DTrace Favorite
The mantra as we initially developed DTrace was to make impossible things possible, not easy things easier. Since codifying that, the tendency toward the latter in developer tools has been apparent. Our focus on the former however has left certain usability burrs that stymie newbies, and annoy vets. Much of the DTrace development of late has focused on a middle category: simplifying hard things that should be simple.
The print() action
In that vein, my colleague, Eric Schrock, added the print() action to DTrace back in November. Before then, my workflow used to look like this:
- “I want to trace
xdr_bytes()“ - Drop into mdb(1) to print out the structure I want to examine
- Write a D script to emit the members I’m interested in:
fbt::xdr_bytes:entry
{
trace(args[0]->x_base);
trace(args[0]->x_handy);
}
Repeat times a thousand, allow for errors, iterate on chased pointers, and sum up the time. With Eric’s fix, DTrace is a lot easier to use:
- “I want to trace
xdr_bytes()“ - Boom:
fbt::xdr_bytes:entry
{
print(args[0]);
}
print() for translated types
Of course, in addition to tracing any kernel function, DTrace has stable probes that identify points of well-known, (reasonably) well-documented activity. Those probes don’t correspond to kernel functions so mdb isn’t as useful. The workflow is a little more annoying:
- “I want to trace IO”
- either
- google “dtrace io provider”, find the argument and the type definitions
- or (more often)
dtrace -lvn io:::to see the type namesless /usr/lib/dtrace/io.dto find the type definitions- write the D script:
io:::start
{
trace(args[1]->dev_name);
trace(args[1]->dev_pathname);
}
Repeat another thousand, much more annoying times.
Unfortunately, print() wasn’t as helpful in this case:
# dtrace -n 'io:::start{ print(*args[1]); }'
dtrace: invalid probe specifier io:::start{ trace(*args[1]); }: print( ) may not be applied to a dynamic expression
Stable probes such as the io:::start probe can use translated arguments, synthetic types that DTrace populates with stable data from the unstable underlying implementation. For example, despite very different implementations, the io:::start provider exposes the same data on illumos, FreeBSD, Mac OS X, and Oracle Solaris. Parameters are effectively translated one at a time; the * (dereference) operator was invalid for these expressions.
In a recent push to illumos, I added this support:
# dtrace -n 'io:::start{ print(*args[1]); }'
dtrace: description 'io:::start' matched 6 probes
CPU ID FUNCTION:NAME
0 11307 bdev_strategy:start devinfo_t {
int dev_major = 0x62
int dev_minor = 0x40
int dev_instance = 0x1
string dev_name = [ "sd" ]
string dev_statname = [ "sd1" ]
string dev_pathname = [ "/devices/pci@0,0/pci15ad,1976@10/sd@0,0:a" ]
}
Between Eric’s addition and my own, my most commonly encountered DTrace annoyances are no more.
Behind the scenes
For the DTrace super-nerds out there, I thought I’d share a bit of the implementation. In order to trace() or print() an expression, it needs to exist in memory somewhere. Translated types don’t exist in memory, rather individual members are translated statically. We can see this in the output of the DTrace DIF (D intermediate form) disassembler:
# dtrace -n 'io:::start{ trace(args[1]->dev_name); }' -Se
DIFO 0x75e940 returns string (unknown) by ref (size 256)
OFF OPCODE INSTRUCTION
00: 25000001 setx DT_INTEGER[0], %r1 ! 0x0
01: 28000101 ldga DT_VAR(0), %r1, %r1
02: 0e010002 mov %r1, %r2
03: 25000103 setx DT_INTEGER[1], %r3 ! 0xe0
04: 07020302 add %r2, %r3, %r2
05: 22020002 ldx [%r2], %r2
06: 25000003 setx DT_INTEGER[0], %r3 ! 0x0
07: 0f020300 cmp %r2, %r3
08: 1200000b be 11
09: 0e000002 mov %r0, %r2
10: 1100000c ba 12
11: 25000202 setx DT_INTEGER[2], %r2 ! 0x1
12: 10020000 tst %r2
13: 12000011 be 17
14: 26000102 sets DT_STRING[1], %r2 ! "nfs"
15: 0e020002 mov %r2, %r2
16: 1100001e ba 30
17: 25000302 setx DT_INTEGER[3], %r2 ! 0xfffffffffc031110
18: 22020002 ldx [%r2], %r2
19: 0e010003 mov %r1, %r3
20: 25000404 setx DT_INTEGER[4], %r4 ! 0xa8
21: 07030403 add %r3, %r4, %r3
22: 22030003 ldx [%r3], %r3
23: 33000000 flushts
24: 31000003 pushtv DT_TYPE(0), %r3 ! DT_TYPE(0) = D type
25: 2f001403 call DIF_SUBR(20), %r3 ! getmajor
26: 25000504 setx DT_INTEGER[5], %r4 ! 0x70
27: 08030403 mul %r3, %r4, %r3
28: 07020302 add %r2, %r3, %r2
29: 22020002 ldx [%r2], %r2
30: 23000002 ret %r2
In this case, this logic comes from /usr/lib/io.d, and — in particular — this translation:
dev_name = B->b_dip == NULL ? "nfs" :
stringof(`devnamesp[getmajor(B->b_edev)].dn_name);
To implement allow trace() and print() to work on translated types, we now generate code to first use the DTrace build-in alloca() function to get some scratch space, and then generate the translation for each member of the translated type. For example:
# dtrace -n 'io:::start{ print(*args[1]); }' -Se
DIFO 0x9466b0 returns D type (struct) by ref (size 780)
OFF OPCODE INSTRUCTION
00: 25000001 setx DT_INTEGER[0], %r1 ! 0x0
01: 28000101 ldga DT_VAR(0), %r1, %r1
02: 25000102 setx DT_INTEGER[1], %r2 ! 0x30c
03: 33000000 flushts
04: 31000002 pushtv DT_TYPE(0), %r2 ! DT_TYPE(0) = D type
05: 2f000f02 call DIF_SUBR(15), %r2 ! alloca
06: 0e010003 mov %r1, %r3
07: 25000204 setx DT_INTEGER[2], %r4 ! 0xe0
08: 07030403 add %r3, %r4, %r3
09: 22030003 ldx [%r3], %r3
10: 25000004 setx DT_INTEGER[0], %r4 ! 0x0
11: 0f030400 cmp %r3, %r4
12: 1300000f bne 15
13: 0e000003 mov %r0, %r3
14: 11000010 ba 16
15: 25000303 setx DT_INTEGER[3], %r3 ! 0x1
...
316: 2f001603 call DIF_SUBR(22), %r3 ! ddi_pathname
317: 25001204 setx DT_INTEGER[18], %r4 ! 0x20c
318: 07020404 add %r2, %r4, %r4
319: 25000e05 setx DT_INTEGER[14], %r5 ! 0x100
320: 3b030504 copys %r3, %r5, %r4
321: 23000002 ret %r2
More to come
Usability was a big topic at dtrace.conf a few months ago. Expect to see more contributions along this theme.
BTrace: DTrace for Java… ish
DTrace first peered into Java in early 2005 thanks to an early prototype by Jarod Jenson that led eventually to the inclusion of USDT probes in the HotSpot JVM. If you want to see where, say, the java.net.SocketOutputStream.write() method is called, you can simply run this DTrace script:
hotspot$target:::method-entry
/copyinstr(arg1, arg2) == "java/net/SocketOutputStream" &&
copyinstr(arg3, arg4) == "write"/
{
jstack(50, 8000);
}
And that will work as long as you rememember to start your JVM with the -XX:+ExtendedDTraceProbes option or you use the jinfo utility to enable it after the fact. And as long as you don’t mind a crippling performance penalty (hint: you probably do).
Inspired by dtrace.conf a few weeks ago, I wanted to sketch out what the real Java provider would look like:
java$target:java.net.SocketOutputStream:write:entry
{
jstack(50,8000);
}
And check it out:
# jdtrace.pl -p $(pgrep java) -n 'java$target:java.net.SocketOutputStream::entry{ jstack(50,8000); }'
dtrace: script '/tmp/jdtrace.19092/jdtrace.d' matched 0 probes
CPU ID FUNCTION:NAME
0 64991 Java_com_sun_btrace_BTraceRuntime_dtraceProbe0:event
libbtrace.so`Java_com_sun_btrace_BTraceRuntime_dtraceProbe0+0xbb
com/sun/btrace/BTraceRuntime.dtraceProbe0(Ljava/lang/String;Ljava/lang/String;II)I
com/sun/btrace/BTraceRuntime.dtraceProbe(Ljava/lang/String;Ljava/lang/String;II)I
com/sun/btrace/BTraceUtils$D.probe(Ljava/lang/String;Ljava/lang/String;II)I
com/sun/btrace/BTraceUtils$D.probe(Ljava/lang/String;Ljava/lang/String;)I
java/net/SocketOutputStream.$btrace$jdtrace$probe1(Ljava/lang/String;Ljava/lang/String;)V
java/net/SocketOutputStream.write([BII)V
sun/nio/cs/StreamEncoder.writeBytes()V
sun/nio/cs/StreamEncoder.implFlushBuffer()V
sun/nio/cs/StreamEncoder.implFlush()V
sun/nio/cs/StreamEncoder.flush()V
java/io/OutputStreamWriter.flush()V
java/io/BufferedWriter.flush()V
java/io/PrintWriter.newLine()V
java/io/PrintWriter.println()V
java/io/PrintWriter.println(Ljava/lang/String;)V
com/delphix/appliance/server/ham/impl/HAMonitorServerThread.run()V
java/util/concurrent/ThreadPoolExecutor$Worker.runTask(Ljava/lang/Runnable;)V
java/util/concurrent/ThreadPoolExecutor$Worker.run()V
java/lang/Thread.run()V
StubRoutines (1)
libjvm.so`__1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v_+0x21d
libjvm.so`__1cCosUos_exception_wrapper6FpFpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v2468_v_+0x27
libjvm.so`__1cJJavaCallsMcall_virtual6FpnJJavaValue_nGHandle_nLKlassHandle_nMsymbolHandle_5pnGThread__v_+0x149
libjvm.so`__1cMthread_entry6FpnKJavaThread_pnGThread__v_+0x113
libjvm.so`__1cKJavaThreadDrun6M_v_+0x2c6
libjvm.so`java_start+0x1f2
libc.so.1`_thrp_setup+0x9b
libc.so.1`_lwp_start
Obviously there's something fishy going on. First, we're using perl -- the shibboleth of fake-o-ware -- and there's this BTrace stuff in the output.
Faking it with BTrace
BTrace is a dynamic instrumentation tool for Java; it is both inspired by DTrace and contains some DTrace integration. The perl script above takes the DTrace syntax and generates a DTrace script and a BTrace-enabled Java source file.
Like DTrace, BTrace lets you specify the points of instrumentation in your Java program as well as the actions to take. Here's what our generated source file looks like.
import com.sun.btrace.annotations.*;
import static com.sun.btrace.BTraceUtils.*;
@BTrace
public class jdtrace {
@OnMethod(clazz="java.net.SocketOutputStream", method="write", location=@Location(Kind.ENTRY))
public static void probe1(@ProbeClassName String c,
@ProbeMethodName String m) {
String name = "entry";
String p = Strings.strcat(c, Strings.strcat(":",
Strings.strcat(m, Strings.strcat(":", name))));
D.probe(p, "");
}
}
Note that we specify where to trace (this can be a regular expression), and then take the action of joining the class, method, and "entry" string into a single string that we pass to the D.probe() method that causes a BTrace USDT probe to fire.
Here's what the D script looks like:
btrace$target:::event
{
this->__jd_arg = copyinstr(arg0);
this->__jd_mod = strtok(this->__jd_arg, ":");
this->__jd_func = strtok(NULL, ":");
this->__jd_name = strtok(NULL, ":");
}
btrace$target:::event
/((this->__jd_mod == "java.net.SocketOutputStream" &&
this->__jd_func == "write" &&
this->__jd_name == "entry"))/
{
jstack(50,8000);
}
It's pretty simple. We parse the string that was passed to D.probe(), and disassemble it into the DTrace notion of module, function, and name. We then use that information so that the specified actions are executed as appropriate (we could have specified different Java methods to probe, and different actions to take for each). Here's the code if you're interested.
This isn't the real Java provider, but is it close enough? Unfortunately not. The most glaring problem is that BTrace sometimes renders my Java process unresponsive. Other times it leaves instrumentation behind with no way of extracting it. The word "safe" appears as the third word on the BTrace website ("BTrace is safe"), but apparently there's still some way to go to achieve the requisite level of safety.
A Better BTrace
BTrace is an interesting tool for examining Java programs, but one obvious obstacle is that the programs are pretty cumbersome to write. With BTrace, we should be able to write a simple one-liner to see where we are when the java.net.SocketOutputStream.write() method is called, but instead we have to write a fairly cumbersome program:
import com.sun.btrace.annotations.*;
import static com.sun.btrace.BTraceUtils.*;
@BTrace
public class TraceWrite {
@OnMethod(clazz="java.net.SocketOutputStream", method="write", location=@Location(Kind.ENTRY))
public static void onWrite() {
jstack();
}
}
DTrace-inspired syntax would let users iterate much more quickly:
$ dbtrace -p $(pgrep -n java) -n 'java.net.SocketOutputStream:write:entry{ jstack(); }'
java.net.SocketOutputStream.write(SocketOutputStream.java)
sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)
sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272)
sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:276)
sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:122)
java.io.OutputStreamWriter.flush(OutputStreamWriter.java:212)
java.io.BufferedWriter.flush(BufferedWriter.java:236)
java.io.PrintWriter.newLine(PrintWriter.java:438)
java.io.PrintWriter.println(PrintWriter.java:585)
java.io.PrintWriter.println(PrintWriter.java:696)
com.delphix.appliance.server.ham.impl.HAMonitorServerThread.run(HAMonitorServerThread.java:56)
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:662)
With BTrace, you can trace nearly arbitrary information about a program's state, but instead of doing something like this:
dbtrace -p $(pgrep -n java) -n 'java.net.SocketOutputStream:write:entry{ printFields(this.impl); }'
You have to do this:
import com.sun.btrace.annotations.*;
import com.sun.btrace.AnyType;
import static com.sun.btrace.BTraceUtils.Reflective.*;
@BTrace
public class TraceWrite {
@OnMethod(clazz="java.net.SocketOutputStream", method="write", location=@Location(Kind.ENTRY))
public static void onWrite(@Self Object self) {
Object impl = get(field(classOf(self), "impl"), self);
printFields(impl);
}
}
$ ./bin/btrace $(pgrep -n java) TraceWrite.java
{server=null, port=1080, external_address=null, useV4=false, cmdsock=null, cmdIn=null, cmdOut=null, applicationSetProxy=false, timeout=0, trafficClass=0, shut_rd=false, shut_wr=false, socketInputStream=java.net.SocketInputStream@9993a1, fdUseCount=0, fdLock=java.lang.Object@ab5443, closePending=false, CONNECTION_NOT_RESET=0, CONNECTION_RESET_PENDING=1, CONNECTION_RESET=2, resetState=0, resetLock=java.lang.Object@292936, fd1=null, anyLocalBoundAddr=null, lastfd=-1, stream=false, socket=Socket[addr=/127.0.0.1,port=38832,localport=8765], serverSocket=null, fd=java.io.FileDescriptor@50abcc, address=/127.0.0.1, port=38832, localport=8765, }
BTrace needs a language that enables rapid iteration — piggybacking on Java is holding it back — and it needs some hard safety guarantees. With those, many developers and support engineers would use BTrace as part of their daily work — we certainly would here at Delphix.
Back to DTrace. Even with a useable solution for Java only, the ability to have lightweight and focused tracing for Java (and other dynamic languages) could be highly valuable. We’ll see how far BTrace can take us.
dtrace.conf(12) wrap-up
For the second time in as many quadrennial dtrace.confs, I was impressed at how well the unconference format worked out. Sharing coffee with the DTrace community, it was great to see some of the oldest friends of DTrace — Jarod Jenson, Stephen O’Grady, Jonathan Adams to name a few — and to put faces to names — Scott Fritchie, Dustin Sallings, Blake Irvin, etc — of the many new additions to the DTrace community. You can see all the slides and videos; these are my thoughts and notes on the day.
Bryan provided a typically eloquent review of the state of the community. DTrace development is alive and well — after a lull while Oracle’s acquisition of Sun settled in — with new support for a variety of languages and runtimes, and new products that rely heavily on DTrace as a secret sauce. Bryan laid out some important development goals, areas where many have started straying from the edges of the completed DTrace features into the partially complete or starkly missing. We all then set to work hammering out a loose schedule for the day; I’ll admit that at first I was worried that we’d have too many listeners and not enough presenters, but the schedule quickly filled — and with more topics than we’d end up having time to cover.
User-land CTF and Dynamic Translators
DTrace, from its inception, has been a systemic analysis tool, but the earliest development focused on kernel observability — not a surprise since Bryan, Mike, and I developed it while working in the Solaris kernel development. After its use spread (quickly) beyond the kernel team, use shifted more and more to features focused on understanding C and C++ applications in user-land, and then to applications written in a variety of higher-level languages — Java, Ruby, Perl, Javascript, Erlang, etc. User-land Statically Defined Tracing (USDT) is the DTrace facility that enables rich tracing of higher-level languages. It was a relatively late addition to DTrace (integrated in 2004, well after the initial integration in 2003), and since then we’ve learned a lot about what we got right, what we got wrong, and where it’s rough — in some cases very rough — around the edges.
In his opening remarks, Bryan identified USDT improvements as a key area for the community’s focus. In DTrace development we tried to focus on making the impossible possible rather than making the possible easier. In its current form, some things are still impossible with DTrace, namely consumption of type structures from user-land programs; stable, non-privileged use of DTrace; and support for different runtime versions. Dave Pacheco and I took the first slot on the schedule and spoke (at length — sorry) about solutions to these problems.
While others had the benefit of a bit more time to prepare, I did have the advantage of spending many years idly contemplating the problem space and possible solutions. On the subject of user-land type information (in the form of CTF), I identified the key parts of the code that would would need some work. For the USDT enhancements, we discussed dynamic translators — D code that would be linked and executed at runtime, contrasted with today’s static translators that are compiled into a D program — how they would address the problem, and how these ideas could be extended to the kernel (for once, user-land is actually a bit ahead).
I’ll go into the details of our off the cuff proposals, and delve into the code to firm up those ideas in a future blog post. Beyond the extensive implementation work we laid out, the next step is to gather the most complicated, extant USDT providers and proposals for other providers, and figure out what they should look like in the new, dynamic translator world.
The D Language
Next up, my long-time colleague, DTrace contributor, Eric Schrock led the discussion on D language additions. The format of a D program is heavily tied to DTrace’s implementation: all clauses must trace a fixed amount of data, and infinite loops are forbidden. For this reason, D lacks the backward branches needed for traditional looping, subroutines for common code, and if/else clauses for control flow. Each of these has a work-alike — unrolled loops, macros, and predicates or the ternary operator — but their absence renders D confusing to some — especially those unaware of the motivation. Further, the D language need not necessarily hold the underlying implementation so central.
Eric discussed some proposals for how each might be addressed, and I noted that it would be possible to create a prototype environment where we could try out these “D++” features by compiling into D work-alikes. The next step is to identify the most complicated D scripts, and see what they might look like for various incarnations of those language features.
Work with DTrace
The next few sessions focused not on changes to DTrace, but interesting work done using DTrace:
John Thompson of Sony talked about their port of DTrace to the Playstation Vita (!). Sony developers are given access to DTrace, but found it to be unfamiliar and unapproachable. John spoke his attempts to remedy this by replacing D with a C++-like interface which he implemented by replacing the D compiler with Clang.
My Fishworks colleague, Brendan Gregg, showed some of beautiful visualizations they’ve been developing at Joyent, and talked about the analyses those visualizations enabled. As always, it was fascinating stuff. If you don’t read Brendan’s blog, you really should. Long-time DTrace advocate, Theo Schlossnagle, talked about the visualizations they’re doing in Circonus — also fascinating stuff for anyone thinking about how to present system activity in comprehensible ways. Richard Elling showed the DTrace-based visualizations Nexenta used at VMworld to rave reviews.
Mark Cavage presented Joyent’s work bringing DTrace to node.js; Scott Fritchie talked about DTrace for Erlang. Both were useful sources of ideas for how we could improve USDT.
Ryan Stone presented the state of DTrace on FreeBSD. That DTrace is not enabled in the build by default remains a key obstacle for adoption. I hope that Ryan et al. are able to persuade the FreeBSD leadership that their licensing fears are misguided.
DTrace for OEL
I was delighted that Kris van Hees was able to attend to present the Oracle port to Linux. DTrace for OEL was announced at Oracle Open World 2011, but the initial beta didn’t live up to its billing at OOW. As is often the case, this was more a failure of messaging than of engineering. Kris and his team are making steady progress. While it’s not yet in the public beta, they have the kernel function boundary tracing provider (fbt) implemented. Most heartening of all, Oracle intends to keep DTrace for OEL moving forward as the community evolves and improves DTrace — rather than forking it. How that plays out, and what that means for DTrace on Oracle Solaris will be interesting to see, but it’s great to hear that Kris sees the value of DTrace ubiquity and DTrace compatibility.
As was remarked several times, having DTrace available on the fastest growing deployment platform will be the single most significant accelerator for DTrace adoption. The work Kris and his team at Oracle are doing is probably the most important in the DTrace ecosystem, and I think that I speak for the entire DTrace community in offering to assist in any way possible.
A ZFS DTrace Provider
Matt Ahrens and George Wilson — respectively the co-inventor of ZFS, and the preeminent SPA developer — presented a proposal for a DTrace provider for ZFS. ZFS is a highly sophisticated filesystem, but one that is also difficult to understand. Building in rich instrumentation is going to be a tremendous step forward for anyone using ZFS (for example, our mutual employer, Delphix).
Whither DTrace?
Jarod Jenson — the first DTrace user outside of Sun — took the stage in the final session to talk about DTrace adoption. Jarod has made DTrace a significant part of his business for many years. What continues to amazing him, despite numerous presentations, demonstrations, and lessons, is the relatively low level of DTrace adoption. DTrace is a tool that comes alive in the hands of a skilled, scientific, incisive practitioner — and in all of those, Jarod is superlative — but it can have a high bar of entry. There were many concrete suggestions for how to improve DTrace adoption. Most of them didn’t hold water for me — different avenue for education, further documentation, community outreach, higher level tools, visualizations, etc. — but two were quite compelling: DTrace for Linux, and DTrace on stackoverflow.com (and the like). I don’t know how much room there is to participate in the former, but by all means if there are DTrace one-liners that solve problems (on Mac OS X for example), post them, and get people covertly using DTrace.
The core DTrace community is growing. It was great to see old friends like Steve Peters who worked on porting DTrace to Mac OS X in the same room as Kris van Hees as he spoke about his port to Linux. It was inspiring to see so many new members of the community, eager to use, build and improve DTrace. And personally it inspired me to get back into the code to finish up some projects I had in flight, and to chart out the course for some of the projects we discussed.
Thanks to everyone who attended dtrace.conf in person or online. And thanks especially to Deirdre Straughan who made it happen.
DTrace OEL update
A few months ago I took DTrace on OEL for a spin after Oracle announced it. The results were ugly; as one of the authors of DTrace, I admit to being shocked by shoddiness of the effort. Yesterday, Oracle dropped an updated beta so I wanted to see how far they’ve come in the 4+ months since that initial false start.
Whither the probes?
Back in October there were 574 functional probes (and 13 more that didn’t work). Here’s the quantitative state of DTrace for OEL today:
[root@screven drivers]# dtrace -l | wc -l 618
Okay. Steady improvement. By way of unfair comparison, here’s what it looks like on my Mac OS X laptop:
qadi /Users/ahl # dtrace -l | wc -l 578044
What’s new?
Back in October, I tried enabling all system call probes (i.e. all functional probes); the result was that ssh started failing mysteriously. It was a gross violation of the core principles — it would be unacceptable for DTrace to cause harm to the production systems on which it operates. Good to see that Oracle fixed it.
Previously, profile provider probes weren’t working. The profile probes have been removed — you can’t do arbitrary resolution timer-based profiling — but the simple, tick probes are there:
[root@screven drivers]# dtrace -l -n profile::: ID PROVIDER MODULE FUNCTION NAME 612 profile tick-1 613 profile tick-10 614 profile tick-100 615 profile tick-500 616 profile tick-1000 617 profile tick-5000
… and seem to work:
[root@screven ~]# dtrace -n 'tick-1{ printf("%Y", walltimestamp); }'
dtrace: description 'tick-1' matched 1 probe
CPU ID FUNCTION:NAME
0 612 :tick-1 2012 Feb 23 04:31:27
0 612 :tick-1 2012 Feb 23 04:31:28
0 612 :tick-1 2012 Feb 23 04:31:29
They’ve also added some inscrutable SDT (statically defined tracing) probes:
[root@screven ~]# dtrace -l -n sdt::: ID PROVIDER MODULE FUNCTION NAME 597 sdt vmlinux __handle_sysrq -handle_sysrq 601 sdt vmlinux oom_kill_process oom_kill_process 602 sdt vmlinux check_hung_task check_hung_task 603 sdt vmlinux sys_init_module init_module 604 sdt vmlinux sys_delete_module delete_module 611 sdt vmlinux signal_fault signal_fault
More usefully, the beta includes a partially implemented proc provider; the proc provider traces high level process activity (check the docs).
[root@screven ~]# dtrace -l -n proc::: ID PROVIDER MODULE FUNCTION NAME 598 proc vmlinux do_execve_common exec-success 599 proc vmlinux do_execve_common exec-failure 600 proc vmlinux do_execve_common exec 605 proc vmlinux get_signal_to_deliver signal-handle 606 proc vmlinux __send_signal signal-send 607 proc vmlinux do_exit exit 608 proc vmlinux do_exit lwp-exit 609 proc vmlinux do_fork create 610 proc vmlinux do_fork lwp-create
For reference, here’s what it looks like on DelphixOS, an illumos derivative (which of course includes DTrace):
root@argos:~# dtrace -l -n proc::: ID PROVIDER MODULE FUNCTION NAME 10589 proc unix lwp_rtt_initial lwp-start 10629 proc unix lwp_rtt_initial start 10631 proc unix trap fault 10761 proc genunix sigtimedwait signal-clear 10762 proc genunix psig signal-handle 10763 proc genunix sigtoproc signal-discard 10764 proc genunix sigtoproc signal-send 10831 proc genunix lwp_create lwp-create 10868 proc genunix cfork create 10870 proc genunix proc_exit exit 10871 proc genunix lwp_exit lwp-exit 10872 proc genunix proc_exit lwp-exit 10873 proc genunix exec_common exec-success 10874 proc genunix exec_common exec-failure 10875 proc genunix exec_common exec
Each DTrace probe has arguments that convey information about the activity that caused the probe to fire. For example, with the kernel function boundary tracing (fbt) provider (not yet implemented in OEL), the arguments for the function entry probe correspond to the arguments passed to the function. With static providers such as the proc provider, the parameters include useful information… but I can never seem to remember the types and order. Fortunately, DTrace lets you add in the -v option to get more information about a probe. Unfortunately, this hasn’t been hooked up in Oracle’s port (just an bug, I’m sure):
[root@screven ~]# dtrace -lv -n proc:::signal-send ID PROVIDER MODULE FUNCTION NAME 606 proc vmlinux __send_signal signal-send Probe Description Attributes Identifier Names: Private Data Semantics: Private Dependency Class: Unknown Argument Attributes Identifier Names: Evolving Data Semantics: Evolving Dependency Class: ISA Argument Types args[0]: (unknown) args[1]: (unknown) args[2]: (unknown) args[3]: (unknown) args[4]: (unknown) args[5]: (unknown) args[6]: (unknown) args[7]: (unknown) args[8]: (unknown) args[9]: (unknown) args[10]: (unknown) args[11]: (unknown) args[12]: (unknown) args[13]: (unknown) args[14]: (unknown) args[15]: (unknown) args[16]: (unknown) args[17]: (unknown) args[18]: (unknown) args[19]: (unknown) args[20]: (unknown) args[21]: (unknown) args[22]: (unknown) args[23]: (unknown) args[24]: (unknown) args[25]: (unknown) args[26]: (unknown) args[27]: (unknown) args[28]: (unknown) args[29]: (unknown) args[30]: (unknown) args[31]: (unknown)
Here’s what it looks like on DelphixOS:
root@argos:~# dtrace -lv -n proc:::signal-send
ID PROVIDER MODULE FUNCTION NAME
10764 proc genunix sigtoproc signal-send
Probe Description Attributes
Identifier Names: Private
Data Semantics: Private
Dependency Class: Unknown
Argument Attributes
Identifier Names: Evolving
Data Semantics: Evolving
Dependency Class: ISA
Argument Types
args[0]: lwpsinfo_t *
args[1]: psinfo_t *
args[2]: int
Even without the type system being hooked up, you can definitely do some useful work with this beta. For example, I can use the proc provider to look at what commands are executing on my system:
[root@screven ~]# dtrace -n proc:::exec'{ trace(stringof(arg0)); }'
dtrace: description 'proc:::exec' matched 1 probe
CPU ID FUNCTION:NAME
0 600 do_execve_common:exec /usr/bin/staprun
0 600 do_execve_common:exec /usr/sbin/perf
0 600 do_execve_common:exec /bin/uname
0 600 do_execve_common:exec /usr/libexec/perf.2.6.39-101.0.1.el6uek.x86_64
On his blog, Wim Coekaerts showed some examples of use of the proc provider that included this common idiom:
proc:::create
{
this->pid = *((int *)arg0 + 171);
...
It’s hard to know where that 171 constant came from or how a user would figure that out. I assume that this is because OEL doesn’t yet have proper types and it’s a hardcoded offset into some structure. Here’s what that would look like on completed DTrace implementations:
proc:::create
{
this->pid = args[0]->pr_pid;
...
Progress
There’s a long way to go, but it looks like the folks at Oracle are making progress. It will be interesting to see the source code that goes along with this updated beta — as of this writing, the git repository has not been updated. Personally, I’m eager to see what user-land tracing looks like in the form of the pid provider and USDT. In the tradition of other ports such as Apple’s and FreeBSD’s, I’d invite the Oracle team to present their work at the upcoming DTrace conference, dtrace.conf.
The case of the un-unmountable tmpfs
Every once in a rare while our development machines encounter an fatal error during boot because we couldn't unmount tmpfs. This weekend I cracked the case, so I thought I'd share my uses of boot-time DTrace, and the musty corners of the operating systems that I encountered along the way. First I should explain a little bit about what happens during boot and why we were unmounting a tmpfs filesystem.
Upgrade and boot
When you upgrade your Delphix system from one version to the next, we perserve your configuration. Part of that system configuration lives in SMF, the Service Management Facility, which is stored in the filesystem at /etc/svc. We keep /etc/svc in its own ZFS dataset so that we can snapshot and clone it for an upgrade to save the old data (in case we need to roll back) and keep the settings. This gets a tad tricky because the kernel mounts tmpfs on /etc/svc/volatile to provide scratch space for early processes like init(1); before we can mount on /etc/svc, we have to unmount /etc/svc/volatile. Here's what that part of our boot script looks like:
# # The kernel mounts tmpfs on /etc/svc/volatile so we we need to # unmount that before mounting the svc dataset on /etc/svc. # umount /etc/svc/volatile mount -F zfs $base/running/svc /etc/svc mount -F tmpfs /etc/svc/volatile
The problem
Infrequently -- but not never -- we'd see that unmount of/etc/svc/volatile fail with EBUSY. The boot script would stop and report the error; subsequent attempts to unmount /etc/svc/volatile would succeed. So there wasn't much to go on. The tmpfs code did reveal this:
static int
tmp_unmount(struct vfs *vfsp, int flag, struct cred *cr)
{
...
tnp = tm->tm_rootnode;
if (TNTOV(tnp)->v_count > 1) {
mutex_exit(&tm->tm_contents);
return (EBUSY);
}
for (tnp = tnp->tn_forw; tnp; tnp = tnp->tn_forw) {
if ((vp = TNTOV(tnp))->v_count > 0) {
...
return (EBUSY);
}
VN_HOLD(vp);
}
...
So someone had an additional hold on either the root of the filesystem or a file in it. I looked at the contents of /etc/svc/volatile and found one file: init.state. Digging through the code for init(1) I was surprised to find that init(1) keeps a state file around with a list of processes of interest. It doesn't keep the file descriptor open (which would prevent us from unmounting the filesystem), but it does rewrite the file from time to time. I was worried that init(1) might be racing with our script. I didn't want to understand the brutal compexity of the code, so I amended our boot script to do the following:
pstop 1 # stop init umount /etc/svc/volatile mount -F zfs $base/running/svc /etc/svc mount -F tmpfs /etc/svc/volatile prun 1 # resume init
If the unmount failed, I'd be able to use pfiles(1) to see if init(1) did, in fact, have something open in /etc/svc/volatile. I was convinced that in trying to observe the problem, I'd chase it away -- a Heisenbug -- but after a short while of running a reboot loop, we hit the problem, and init(1) didn't have anything open in /etc/svc/volatile. What next…
Boot-time DTrace
The problem was that by the time I'd get to the system, the conditions that caused the error had resolved themselves. What I wanted to do was panic the system when tmp_unmount() returned EBUSY so that I could poke around with a debugger. On many systems that would entail compiling in debug logic, but fortunately a DTrace-enabled system has a better option. My former colleague Bryan Cantrill invented anonymous DTrace for looking at boot-time issues -- earlier in boot than when one could execute the dtrace(1M) command-line utility. To use boot-time DTrace, specify the D program as usual, but add the -A option to add the D program to the DTrace kernel module's boot-time configuration. After rebooting, DTrace will enable the program whose output can later be retreived with dtrace -a. In my case, I wanted to drop into the kernel debugger when tmp_unmount() returned EBUSY, so I ran DTrace like this:
dtrace -A -w -n 'fbt:tmpfs:tmp_unmount:return/arg1 == EBUSY/{ panic(); }'
Again after many reboots, we hit the problem and dropped into the debugger. Thanks to infrastructure put into the kernel by my colleague Eric Schrock, I was able to quickly see the identity of the file whose reference count prevented us from unmounting tmpfs:
[5]> ffffff0197321b00::print vnode_t v_path v_path = 0xffffff0197125d60 "/etc/svc/volatile/init-next.state"
It's worth noting that v_path isn't guaranteed to be correct, but may reflect some state state. In this case, I examined the directory structure of tmpfs and found that the filename was actually /etc/svc/volatile/init.state -- v_path isn't updated on a rename. But I couldn't for the life of me figure out who had that file open. None of the (few) other processes were touching the file. I looked through the fsflush code which flushes cached data back to disk, but that didn't make a lot of sense, and didn't seem to be causing problems. The pageout thread isn't supposed to run unless the system is low on memory. I used kmdb's ::kgrep command to find places where the vnode_t or its associated page were referenced. There were many, and I quickly got lost in the bowels of the VM system. Rather than groveling through the kernel's structures, I decided to turn back to DTrace. The next question I wanted to answer was this: after tmp_unmount() returns EBUSY, who is it that releases the reference on that tmpfs vnode_t? To answer it, I wrote this D script:
fbt:tmpfs:tmp_unmount:entry
{
self->vfs = args[0];
}
fbt:tmpfs:tmp_unmount:return
/arg1 == EBUSY/
{
gotit = self->vfs;
}
fbt:tmpfs:tmp_unmount:return
{
self->vfs = NULL;
}
fbt:genunix:vn_rele:entry
/gotit != NULL && args[0]->v_vfsp = gotit/
{
panic();
}
I installed that as my anonymous DTrace enabling, rebooted, and waited.
Who dunnit
Like the Mystery Inc. gang unmasking the criminal, helplessly caught by the elaborate trap, I used the kernel debugger to identify the subsystem to find that it was none other than harmless old Mr. Pageout. Gasp! Why was pageout running at all? The system had plenty of memory so it wouldn't normally be running except there's an exception made very very early in boot (it turns out). In the first second after boot, pageout will execute exactly four times in order to fill in certain performance-related parameters that let it predict how long it will take to page out memory in the future. When it executes, pageout will identify unused pages and take a temporary hold on them -- this is exactly the pathology at the root of our problem!
Solution
I'm still working on exactly how to solve this. The simplest approach would be to sleep for a second before trying the unmount. Slightly more complicated would be to try unmounting in a loop until a second had passed (checking $SECONDS in bash). More complicated still would be to do a rethink of pageout -- I still don't fully understand how it works, but it really seems like it's making assumptions that have been invalidated in the past decade and contains this gem of a comment:
For now, this thing is in very rough form.
Note that "now" in this case referred to 1987 or possibly earlier -- as Roger Faulkner would say, "it came from New Jersey."
Conclusion
Pageout would have gotten away with it if it hadn't been for these meddling tools! DTrace during boot is awesome -- when you need it, it's a life saver. There are some places so early in boot that DTrace can't help; for that VProbes can give you some DTrace-like functionality. And mature systems can have some musty corners so your tools had better be up to the task.