illumos hackathon 2012: user-land types for DTrace

Thursday, October 11th, 2012Posted by ahl

At the illumos hackathon last week, Robert Mustacchi and I prototyped better support for manipulating user-land structures. As anyone who’s used it knows, DTrace is currently very kernel-centric — this both reflects the reality of how operating systems and DTrace are constructed, and the origins of DTrace itself in the Solaris Kernel Group. Discussions at dtrace.conf(12) this spring prompted me to chart a path to better user-land support. This prototype of copyin-automagic was a first step.

What we implemented was a new ‘user’ keyword to denote that a type is a user-land structure. For example, let’s say we had the address of a 4-byte integer; today we’d access its value using copyin():

this->i = *(int *)copyin(this->addr, sizeof (int));

With our prototype, this gets simpler and more intuitive:

this->i = *(user int *)addr;

The impact is even more apparent when it comes to pointer chasing through structures. Today if we need to get to the third element of a linked list, the D code would look like this:

this->p = (node_t *)copyin(this->addr, sizeof (node_t));
this->p = (node_t *)copyin((uintptr_t)this->p->next, sizeof (node_t));
this->p = (node_t *)copyin((uintptr_t)this->p->next, sizeof (node_t));
trace(this->p->value);

Again, it’s much simpler with the user keyword:

this->p = (user node_t *)this->addr;
trace(this->p->next->next->value);

D programs are compiled into a series of instructions — DIF — that are executed by the code DTrace framework when probes fire. We use the new keyword to generate instructions that load from the address space of the currently executing process rather than that of the kernel.

Adding a new keyword feels a little clunky, and I’m not sure if it’s the right way forward, but it does demonstrate a simpler model for accessing user-land structures, and was a critical first step. We already have three main sources of user-land values; the next steps are to make use of those.

Types for system calls

Arguments to system calls (mostly) have well-known types. Indeed those types are encoded in truss in excruciating and exotic detail. We should educate DTrace about those types. What I’d propose is that we create a single repository of all system call metadata. This could be, for example, and XML file that listed every system call, its name, code, subcode, types, etc. Of course we could use that as the source of type information for the syscall provider, but we could also use that to generate everything from the decoding tables in truss to the libc and kernel stubs for system calls.

As an aside, there are a couple of system calls whose parameter types — ioctl(2) is the obvious example. It would be interesting to assess the utility of an ioctl provider whose probes would be the various codes that are passed as the second parameter. Truss already has this information; why not DTrace?

Types for the pid provider

Another obvious source of type information is the process being traced. When a user specifies the -p or -c option to dtrace(1M) we’re examining a particular process, and that process can have embedded type information. We could use those types and implicitly identify them as belonging to user-space rather than the kernel. Pid provider probes correspond to the entry and return from user-land functions; we could identify the appropriate types for those parameters. We could simplify it further by doing all type handling in libdtrace (in user-land) rather than pushing the types into the kernel.

Types for USDT

User-land statically defined tracing — tracepoints explicitly inserted into code — can already have types associated with them. A first step would be to implicitly identify those types as belonging to user-land. I believe that this could even be done without adversely affecting existing scripts.

Thorny issues

While there are some clear paths forward, there are some tricky issues that remain. In particular that processes can have different data models — 32-bit v. 64-bit — presents a real challenge. Both the width of a load and offsets into structures change depending on the process that’s running. There might be some shortcuts for system calls, and we might be able to constrain the problem for the pid provider by requiring -p or -c, or we might have to compile our D program twice and then choose which version to run based on the data model of the process. In the spirit of the hackathon, Robert and I punted for our ‘user’ keyword prototype, but these problems need to be well understood and sufficiently solved.

Next steps

I’ll be working on some of these problems on the back burner; I’m especially interested in the Grand Unified Syscall Project — an idea I’ve been touting for more years than I care to relate — to bring types to the syscall provider. If you have ideas for user-land tracing with DTrace, or want to work on anything I’ve mention, leave a comment or drop me a note.

My New DTrace Favorite

Saturday, July 28th, 2012Posted by ahl

The mantra as we initially developed DTrace was to make impossible things possible, not easy things easier. Since codifying that, the tendency toward the latter in developer tools has been apparent. Our focus on the former however has left certain usability burrs that stymie newbies, and annoy vets. Much of the DTrace development of late has focused on a middle category: simplifying hard things that should be simple.

The print() action

In that vein, my colleague, Eric Schrock, added the print() action to DTrace back in November. Before then, my workflow used to look like this:

  • “I want to trace xdr_bytes()
  • Drop into mdb(1) to print out the structure I want to examine
  • Write a D script to emit the members I’m interested in:
fbt::xdr_bytes:entry
{
        trace(args[0]->x_base);
        trace(args[0]->x_handy);
}

Repeat times a thousand, allow for errors, iterate on chased pointers, and sum up the time. With Eric’s fix, DTrace is a lot easier to use:

  • “I want to trace xdr_bytes()
  • Boom:
fbt::xdr_bytes:entry
{
        print(args[0]);
}

print() for translated types

Of course, in addition to tracing any kernel function, DTrace has stable probes that identify points of well-known, (reasonably) well-documented activity. Those probes don’t correspond to kernel functions so mdb isn’t as useful. The workflow is a little more annoying:

  • “I want to trace IO”
  • either
    • google “dtrace io provider”, find the argument and the type definitions
  • or (more often)
    • dtrace -lvn io::: to see the type names
    • less /usr/lib/dtrace/io.d to find the type definitions
  • write the D script:
io:::start
{
        trace(args[1]->dev_name);
        trace(args[1]->dev_pathname);
}

Repeat another thousand, much more annoying times.

Unfortunately, print() wasn’t as helpful in this case:

# dtrace -n 'io:::start{ print(*args[1]); }'
dtrace: invalid probe specifier io:::start{ trace(*args[1]); }: print( ) may not be applied to a dynamic expression

Stable probes such as the io:::start probe can use translated arguments, synthetic types that DTrace populates with stable data from the unstable underlying implementation. For example, despite very different implementations, the io:::start provider exposes the same data on illumos, FreeBSD, Mac OS X, and Oracle Solaris. Parameters are effectively translated one at a time; the * (dereference) operator was invalid for these expressions.

In a recent push to illumos, I added this support:

# dtrace -n 'io:::start{ print(*args[1]); }'
dtrace: description 'io:::start' matched 6 probes
CPU ID FUNCTION:NAME
0 11307 bdev_strategy:start devinfo_t {
    int dev_major = 0x62
    int dev_minor = 0x40
    int dev_instance = 0x1
    string dev_name = [ "sd" ]
    string dev_statname = [ "sd1" ]
    string dev_pathname = [ "/devices/pci@0,0/pci15ad,1976@10/sd@0,0:a" ]
}

Between Eric’s addition and my own, my most commonly encountered DTrace annoyances are no more.

Behind the scenes

For the DTrace super-nerds out there, I thought I’d share a bit of the implementation. In order to trace() or print() an expression, it needs to exist in memory somewhere. Translated types don’t exist in memory, rather individual members are translated statically. We can see this in the output of the DTrace DIF (D intermediate form) disassembler:

# dtrace -n 'io:::start{ trace(args[1]->dev_name); }' -Se
DIFO 0x75e940 returns string (unknown) by ref (size 256)
OFF OPCODE INSTRUCTION
00: 25000001 setx DT_INTEGER[0], %r1 ! 0x0
01: 28000101 ldga DT_VAR(0), %r1, %r1
02: 0e010002 mov %r1, %r2
03: 25000103 setx DT_INTEGER[1], %r3 ! 0xe0
04: 07020302 add %r2, %r3, %r2
05: 22020002 ldx [%r2], %r2
06: 25000003 setx DT_INTEGER[0], %r3 ! 0x0
07: 0f020300 cmp %r2, %r3
08: 1200000b be 11
09: 0e000002 mov %r0, %r2
10: 1100000c ba 12
11: 25000202 setx DT_INTEGER[2], %r2 ! 0x1
12: 10020000 tst %r2
13: 12000011 be 17
14: 26000102 sets DT_STRING[1], %r2 ! "nfs"
15: 0e020002 mov %r2, %r2
16: 1100001e ba 30
17: 25000302 setx DT_INTEGER[3], %r2 ! 0xfffffffffc031110
18: 22020002 ldx [%r2], %r2
19: 0e010003 mov %r1, %r3
20: 25000404 setx DT_INTEGER[4], %r4 ! 0xa8
21: 07030403 add %r3, %r4, %r3
22: 22030003 ldx [%r3], %r3
23: 33000000 flushts
24: 31000003 pushtv DT_TYPE(0), %r3 ! DT_TYPE(0) = D type
25: 2f001403 call DIF_SUBR(20), %r3 ! getmajor
26: 25000504 setx DT_INTEGER[5], %r4 ! 0x70
27: 08030403 mul %r3, %r4, %r3
28: 07020302 add %r2, %r3, %r2
29: 22020002 ldx [%r2], %r2
30: 23000002 ret %r2

In this case, this logic comes from /usr/lib/io.d, and — in particular — this translation:

        dev_name = B->b_dip == NULL ? "nfs" :
            stringof(`devnamesp[getmajor(B->b_edev)].dn_name);

To implement allow trace() and print() to work on translated types, we now generate code to first use the DTrace build-in alloca() function to get some scratch space, and then generate the translation for each member of the translated type. For example:

# dtrace -n 'io:::start{ print(*args[1]); }' -Se
DIFO 0x9466b0 returns D type (struct) by ref (size 780)
OFF OPCODE INSTRUCTION
00: 25000001 setx DT_INTEGER[0], %r1 ! 0x0
01: 28000101 ldga DT_VAR(0), %r1, %r1
02: 25000102 setx DT_INTEGER[1], %r2 ! 0x30c
03: 33000000 flushts
04: 31000002 pushtv DT_TYPE(0), %r2 ! DT_TYPE(0) = D type
05: 2f000f02 call DIF_SUBR(15), %r2 ! alloca
06: 0e010003 mov %r1, %r3
07: 25000204 setx DT_INTEGER[2], %r4 ! 0xe0
08: 07030403 add %r3, %r4, %r3
09: 22030003 ldx [%r3], %r3
10: 25000004 setx DT_INTEGER[0], %r4 ! 0x0
11: 0f030400 cmp %r3, %r4
12: 1300000f bne 15
13: 0e000003 mov %r0, %r3
14: 11000010 ba 16
15: 25000303 setx DT_INTEGER[3], %r3 ! 0x1
...
316: 2f001603 call DIF_SUBR(22), %r3 ! ddi_pathname
317: 25001204 setx DT_INTEGER[18], %r4 ! 0x20c
318: 07020404 add %r2, %r4, %r4
319: 25000e05 setx DT_INTEGER[14], %r5 ! 0x100
320: 3b030504 copys %r3, %r5, %r4
321: 23000002 ret %r2

More to come

Usability was a big topic at dtrace.conf a few months ago. Expect to see more contributions along this theme.

Filed under: DTrace No Comments

BTrace: DTrace for Java… ish

Tuesday, April 24th, 2012Posted by ahl

DTrace first peered into Java in early 2005 thanks to an early prototype by Jarod Jenson that led eventually to the inclusion of USDT probes in the HotSpot JVM. If you want to see where, say, the java.net.SocketOutputStream.write() method is called, you can simply run this DTrace script:

hotspot$target:::method-entry
/copyinstr(arg1, arg2) == "java/net/SocketOutputStream" &&
 copyinstr(arg3, arg4) == "write"/
{
        jstack(50, 8000);
}

And that will work as long as you rememember to start your JVM with the -XX:+ExtendedDTraceProbes option or you use the jinfo utility to enable it after the fact. And as long as you don’t mind a crippling performance penalty (hint: you probably do).

Inspired by dtrace.conf a few weeks ago, I wanted to sketch out what the real Java provider would look like:

java$target:java.net.SocketOutputStream:write:entry
{
        jstack(50,8000);
}

And check it out:

# jdtrace.pl -p $(pgrep java) -n 'java$target:java.net.SocketOutputStream::entry{ jstack(50,8000); }'
dtrace: script '/tmp/jdtrace.19092/jdtrace.d' matched 0 probes
CPU     ID                    FUNCTION:NAME
0  64991 Java_com_sun_btrace_BTraceRuntime_dtraceProbe0:event
libbtrace.so`Java_com_sun_btrace_BTraceRuntime_dtraceProbe0+0xbb
com/sun/btrace/BTraceRuntime.dtraceProbe0(Ljava/lang/String;Ljava/lang/String;II)I
com/sun/btrace/BTraceRuntime.dtraceProbe(Ljava/lang/String;Ljava/lang/String;II)I
com/sun/btrace/BTraceUtils$D.probe(Ljava/lang/String;Ljava/lang/String;II)I
com/sun/btrace/BTraceUtils$D.probe(Ljava/lang/String;Ljava/lang/String;)I
java/net/SocketOutputStream.$btrace$jdtrace$probe1(Ljava/lang/String;Ljava/lang/String;)V
java/net/SocketOutputStream.write([BII)V
sun/nio/cs/StreamEncoder.writeBytes()V
sun/nio/cs/StreamEncoder.implFlushBuffer()V
sun/nio/cs/StreamEncoder.implFlush()V
sun/nio/cs/StreamEncoder.flush()V
java/io/OutputStreamWriter.flush()V
java/io/BufferedWriter.flush()V
java/io/PrintWriter.newLine()V
java/io/PrintWriter.println()V
java/io/PrintWriter.println(Ljava/lang/String;)V
com/delphix/appliance/server/ham/impl/HAMonitorServerThread.run()V
java/util/concurrent/ThreadPoolExecutor$Worker.runTask(Ljava/lang/Runnable;)V
java/util/concurrent/ThreadPoolExecutor$Worker.run()V
java/lang/Thread.run()V
StubRoutines (1)
libjvm.so`__1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v_+0x21d
libjvm.so`__1cCosUos_exception_wrapper6FpFpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v2468_v_+0x27
libjvm.so`__1cJJavaCallsMcall_virtual6FpnJJavaValue_nGHandle_nLKlassHandle_nMsymbolHandle_5pnGThread__v_+0x149
libjvm.so`__1cMthread_entry6FpnKJavaThread_pnGThread__v_+0x113
libjvm.so`__1cKJavaThreadDrun6M_v_+0x2c6
libjvm.so`java_start+0x1f2
libc.so.1`_thrp_setup+0x9b
libc.so.1`_lwp_start

Obviously there's something fishy going on. First, we're using perl -- the shibboleth of fake-o-ware -- and there's this BTrace stuff in the output.

Faking it with BTrace

BTrace is a dynamic instrumentation tool for Java; it is both inspired by DTrace and contains some DTrace integration. The perl script above takes the DTrace syntax and generates a DTrace script and a BTrace-enabled Java source file.

Like DTrace, BTrace lets you specify the points of instrumentation in your Java program as well as the actions to take. Here's what our generated source file looks like.

import com.sun.btrace.annotations.*;
import static com.sun.btrace.BTraceUtils.*;
@BTrace
public class jdtrace {
        @OnMethod(clazz="java.net.SocketOutputStream", method="write", location=@Location(Kind.ENTRY))
        public static void probe1(@ProbeClassName String c,
            @ProbeMethodName String m) {
                String name = "entry";
                String p = Strings.strcat(c, Strings.strcat(":",
                    Strings.strcat(m, Strings.strcat(":", name))));
                D.probe(p, "");
        }
}

Note that we specify where to trace (this can be a regular expression), and then take the action of joining the class, method, and "entry" string into a single string that we pass to the D.probe() method that causes a BTrace USDT probe to fire.

Here's what the D script looks like:

btrace$target:::event
{
        this->__jd_arg = copyinstr(arg0);
        this->__jd_mod = strtok(this->__jd_arg, ":");
        this->__jd_func = strtok(NULL, ":");
        this->__jd_name = strtok(NULL, ":");
}

btrace$target:::event
/((this->__jd_mod == "java.net.SocketOutputStream" &&
 this->__jd_func == "write" &&
 this->__jd_name == "entry"))/
{
        jstack(50,8000);
}

It's pretty simple. We parse the string that was passed to D.probe(), and disassemble it into the DTrace notion of module, function, and name. We then use that information so that the specified actions are executed as appropriate (we could have specified different Java methods to probe, and different actions to take for each). Here's the code if you're interested.

This isn't the real Java provider, but is it close enough? Unfortunately not. The most glaring problem is that BTrace sometimes renders my Java process unresponsive. Other times it leaves instrumentation behind with no way of extracting it. The word "safe" appears as the third word on the BTrace website ("BTrace is safe"), but apparently there's still some way to go to achieve the requisite level of safety.

A Better BTrace

BTrace is an interesting tool for examining Java programs, but one obvious obstacle is that the programs are pretty cumbersome to write. With BTrace, we should be able to write a simple one-liner to see where we are when the java.net.SocketOutputStream.write() method is called, but instead we have to write a fairly cumbersome program:

import com.sun.btrace.annotations.*;
import static com.sun.btrace.BTraceUtils.*;
@BTrace
public class TraceWrite {
        @OnMethod(clazz="java.net.SocketOutputStream", method="write", location=@Location(Kind.ENTRY))
        public static void onWrite() {
                jstack();
        }
}

DTrace-inspired syntax would let users iterate much more quickly:

$ dbtrace -p $(pgrep -n java) -n 'java.net.SocketOutputStream:write:entry{ jstack(); }'
java.net.SocketOutputStream.write(SocketOutputStream.java)
sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)
sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272)
sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:276)
sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:122)
java.io.OutputStreamWriter.flush(OutputStreamWriter.java:212)
java.io.BufferedWriter.flush(BufferedWriter.java:236)
java.io.PrintWriter.newLine(PrintWriter.java:438)
java.io.PrintWriter.println(PrintWriter.java:585)
java.io.PrintWriter.println(PrintWriter.java:696)
com.delphix.appliance.server.ham.impl.HAMonitorServerThread.run(HAMonitorServerThread.java:56)
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:662)

With BTrace, you can trace nearly arbitrary information about a program's state, but instead of doing something like this:

dbtrace -p $(pgrep -n java) -n 'java.net.SocketOutputStream:write:entry{ printFields(this.impl); }'

You have to do this:

import com.sun.btrace.annotations.*;
import com.sun.btrace.AnyType;
import static com.sun.btrace.BTraceUtils.Reflective.*;
@BTrace
public class TraceWrite {
        @OnMethod(clazz="java.net.SocketOutputStream", method="write", location=@Location(Kind.ENTRY))
        public static void onWrite(@Self Object self) {
                Object impl = get(field(classOf(self), "impl"), self);
                printFields(impl);
        }
}
$ ./bin/btrace $(pgrep -n java) TraceWrite.java
{server=null, port=1080, external_address=null, useV4=false, cmdsock=null, cmdIn=null, cmdOut=null, applicationSetProxy=false, timeout=0, trafficClass=0, shut_rd=false, shut_wr=false, socketInputStream=java.net.SocketInputStream@9993a1, fdUseCount=0, fdLock=java.lang.Object@ab5443, closePending=false, CONNECTION_NOT_RESET=0, CONNECTION_RESET_PENDING=1, CONNECTION_RESET=2, resetState=0, resetLock=java.lang.Object@292936, fd1=null, anyLocalBoundAddr=null, lastfd=-1, stream=false, socket=Socket[addr=/127.0.0.1,port=38832,localport=8765], serverSocket=null, fd=java.io.FileDescriptor@50abcc, address=/127.0.0.1, port=38832, localport=8765, }

BTrace needs a language that enables rapid iteration — piggybacking on Java is holding it back — and it needs some hard safety guarantees. With those, many developers and support engineers would use BTrace as part of their daily work — we certainly would here at Delphix.

Back to DTrace. Even with a useable solution for Java only, the ability to have lightweight and focused tracing for Java (and other dynamic languages) could be highly valuable. We’ll see how far BTrace can take us.

Tagged as: , No Comments

dtrace.conf(12) wrap-up

Monday, April 9th, 2012Posted by ahl

For the second time in as many quadrennial dtrace.confs, I was impressed at how well the unconference format worked out. Sharing coffee with the DTrace community, it was great to see some of the oldest friends of DTrace — Jarod Jenson, Stephen O’Grady, Jonathan Adams to name a few — and to put faces to names — Scott Fritchie, Dustin Sallings, Blake Irvin, etc — of the many new additions to the DTrace community. You can see all the slides and videos; these are my thoughts and notes on the day.

Bryan provided a typically eloquent review of the state of the community. DTrace development is alive and well — after a lull while Oracle’s acquisition of Sun settled in — with new support for a variety of languages and runtimes, and new products that rely heavily on DTrace as a secret sauce. Bryan laid out some important development goals, areas where many have started straying from the edges of the completed DTrace features into the partially complete or starkly missing. We all then set to work hammering out a loose schedule for the day; I’ll admit that at first I was worried that we’d have too many listeners and not enough presenters, but the schedule quickly filled — and with more topics than we’d end up having time to cover.

User-land CTF and Dynamic Translators

DTrace, from its inception, has been a systemic analysis tool, but the earliest development focused on kernel observability — not a surprise since Bryan, Mike, and I developed it while working in the Solaris kernel development. After its use spread (quickly) beyond the kernel team, use shifted more and more to features focused on understanding C and C++ applications in user-land, and then to applications written in a variety of higher-level languages — Java, Ruby, Perl, Javascript, Erlang, etc. User-land Statically Defined Tracing (USDT) is the DTrace facility that enables rich tracing of higher-level languages. It was a relatively late addition to DTrace (integrated in 2004, well after the initial integration in 2003), and since then we’ve learned a lot about what we got right, what we got wrong, and where it’s rough — in some cases very rough — around the edges.

In his opening remarks, Bryan identified USDT improvements as a key area for the community’s focus. In DTrace development we tried to focus on making the impossible possible rather than making the possible easier. In its current form, some things are still impossible with DTrace, namely consumption of type structures from user-land programs; stable, non-privileged use of DTrace; and support for different runtime versions. Dave Pacheco and I took the first  slot on the schedule and spoke (at length — sorry) about solutions to these problems.

While others had the benefit of a bit more time to prepare, I did have the advantage of spending many years idly contemplating the problem space and possible solutions. On the subject of user-land type information (in the form of CTF), I identified the key parts of the code that would would need some work. For the USDT enhancements, we discussed dynamic translators — D code that would be linked and executed at runtime, contrasted with today’s static translators that are compiled into a D program — how they would address the problem, and how these ideas could be extended to the kernel (for once, user-land is actually a bit ahead).

I’ll go into the details of our off the cuff proposals, and delve into the code to firm up those ideas in a future blog post. Beyond the extensive implementation work we laid out, the next step is to gather the most complicated, extant USDT providers and proposals for other providers, and figure out what they should look like in the new, dynamic translator world.

The D Language

Next up, my long-time colleague, DTrace contributor, Eric Schrock led the discussion on D language additions. The format of a D program is heavily tied to DTrace’s implementation: all clauses must trace a fixed amount of data, and infinite loops are forbidden. For this reason, D lacks the backward branches needed for traditional looping, subroutines for common code, and if/else clauses for control flow. Each of these has a work-alike — unrolled loops, macros, and predicates or the ternary operator — but their absence renders D confusing to some — especially those unaware of the motivation. Further, the D language need not necessarily hold the underlying implementation so central.

Eric discussed some proposals for how each might be addressed, and I noted that it would be possible to create a prototype environment where we could try out these “D++” features by compiling into D work-alikes. The next step is to identify the most complicated D scripts, and see what they might look like for various incarnations of those language features.

Work with DTrace

The next few sessions focused not on changes to DTrace, but interesting work done using DTrace:

John Thompson of Sony talked about their port of DTrace to the Playstation Vita (!). Sony developers are given access to DTrace, but found it to be unfamiliar and unapproachable. John spoke his attempts to remedy this by replacing D with a C++-like interface which he implemented by replacing the D compiler with Clang.

My Fishworks colleague, Brendan Gregg, showed some of beautiful visualizations they’ve been developing at Joyent, and talked about the analyses those visualizations enabled. As always, it was fascinating stuff. If you don’t read Brendan’s blog, you really should. Long-time DTrace advocate, Theo Schlossnagle, talked about the visualizations they’re doing in Circonus — also fascinating stuff for anyone thinking about how to present system activity in comprehensible ways. Richard Elling showed the DTrace-based visualizations Nexenta used at VMworld to rave reviews.

Mark Cavage presented Joyent’s work bringing DTrace to node.js; Scott Fritchie talked about DTrace for Erlang. Both were useful sources of ideas for how we could improve USDT.

Ryan Stone presented the state of DTrace on FreeBSD. That DTrace is not enabled in the build by default remains a key obstacle for adoption. I hope that Ryan et al. are able to persuade the FreeBSD leadership that their licensing fears are misguided.

DTrace for OEL

I was delighted that Kris van Hees was able to attend to present the Oracle port to Linux. DTrace for OEL was announced at Oracle Open World 2011, but the initial beta didn’t live up to its billing at OOW. As is often the case, this was more a failure of messaging than of engineering. Kris and his team are making steady progress. While it’s not yet in the public beta, they have the kernel function boundary tracing provider (fbt) implemented. Most heartening of all, Oracle intends to keep DTrace for OEL moving forward as the community evolves and improves DTrace — rather than forking it. How that plays out, and what that means for DTrace on Oracle Solaris will be interesting to see, but it’s great to hear that Kris sees the value of DTrace ubiquity and DTrace compatibility.

As was remarked several times, having DTrace available on the fastest growing deployment platform will be the single most significant accelerator for DTrace adoption. The work Kris and his team at Oracle are doing is probably the most important in the DTrace ecosystem, and I think that I speak for the entire DTrace community in offering to assist in any way possible.

A ZFS DTrace Provider

Matt Ahrens and George Wilson — respectively the co-inventor of ZFS, and the preeminent SPA developer — presented a proposal for a DTrace provider for ZFS. ZFS is a highly sophisticated filesystem, but one that is also difficult to understand. Building in rich instrumentation is going to be a tremendous step forward for anyone using ZFS (for example, our mutual employer, Delphix).

Whither DTrace?

Jarod Jenson — the first DTrace user outside of Sun — took the stage in the final session to talk about DTrace adoption. Jarod has made DTrace a significant part of his business for many years. What continues to amazing him, despite numerous presentations, demonstrations, and lessons, is the relatively low level of DTrace adoption. DTrace is a tool that comes alive in the hands of a skilled, scientific, incisive practitioner — and in all of those, Jarod is superlative — but it can have a high bar of entry. There were many concrete suggestions for how to improve DTrace adoption. Most of them didn’t hold water for me — different avenue for education, further documentation, community outreach, higher level tools, visualizations, etc. — but two were quite compelling: DTrace for Linux, and DTrace on stackoverflow.com (and the like). I don’t know how much room there is to participate in the former, but by all means if there are DTrace one-liners that solve problems (on Mac OS X for example), post them, and get people covertly using DTrace.

The core DTrace community is growing. It was great to see old friends like Steve Peters who worked on porting DTrace to Mac OS X in the same room as Kris van Hees as he spoke about his port to Linux. It was inspiring to see so many new members of the community, eager to use, build and improve DTrace. And personally it inspired me to get back into the code to finish up some projects I had in flight, and to chart out the course for some of the projects we discussed.

Thanks to everyone who attended dtrace.conf in person or online. And thanks especially to Deirdre Straughan who made it happen.

DTrace OEL update

Thursday, February 23rd, 2012Posted by ahl

A few months ago I took DTrace on OEL for a spin after Oracle announced it. The results were ugly; as one of the authors of DTrace, I admit to being shocked by shoddiness of the effort. Yesterday, Oracle dropped an updated beta so I wanted to see how far they’ve come in the 4+ months since that initial false start.

Whither the probes?

Back in October there were 574 functional probes (and 13 more that didn’t work). Here’s the quantitative state of DTrace for OEL today:

[root@screven drivers]# dtrace -l | wc -l
618

Okay. Steady improvement. By way of unfair comparison, here’s what it looks like on my Mac OS X laptop:

qadi /Users/ahl # dtrace -l | wc -l
  578044

What’s new?

Back in October, I tried enabling all system call probes (i.e. all functional probes); the result was that ssh started failing mysteriously. It was a gross violation of the core principles — it would be unacceptable for DTrace to cause harm to the production systems on which it operates. Good to see that Oracle fixed it.

Previously, profile provider probes weren’t working. The profile probes have been removed — you can’t do arbitrary resolution timer-based profiling — but the simple, tick probes are there:

[root@screven drivers]# dtrace -l -n profile:::
ID   PROVIDER            MODULE                          FUNCTION NAME
612    profile                                                     tick-1
613    profile                                                     tick-10
614    profile                                                     tick-100
615    profile                                                     tick-500
616    profile                                                     tick-1000
617    profile                                                     tick-5000

… and seem to work:

[root@screven ~]# dtrace -n 'tick-1{ printf("%Y", walltimestamp); }'
dtrace: description 'tick-1' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0    612                          :tick-1 2012 Feb 23 04:31:27
  0    612                          :tick-1 2012 Feb 23 04:31:28
  0    612                          :tick-1 2012 Feb 23 04:31:29

They’ve also added some inscrutable SDT (statically defined tracing) probes:

[root@screven ~]# dtrace -l -n sdt:::
   ID   PROVIDER            MODULE                          FUNCTION NAME
  597        sdt           vmlinux                    __handle_sysrq -handle_sysrq
  601        sdt           vmlinux                  oom_kill_process oom_kill_process
  602        sdt           vmlinux                   check_hung_task check_hung_task
  603        sdt           vmlinux                   sys_init_module init_module
  604        sdt           vmlinux                 sys_delete_module delete_module
  611        sdt           vmlinux                      signal_fault signal_fault

More usefully, the beta includes a partially implemented proc provider; the proc provider traces high level process activity (check the docs).

[root@screven ~]# dtrace -l -n proc:::
   ID   PROVIDER            MODULE                          FUNCTION NAME
  598       proc           vmlinux                  do_execve_common exec-success
  599       proc           vmlinux                  do_execve_common exec-failure
  600       proc           vmlinux                  do_execve_common exec
  605       proc           vmlinux             get_signal_to_deliver signal-handle
  606       proc           vmlinux                     __send_signal signal-send
  607       proc           vmlinux                           do_exit exit
  608       proc           vmlinux                           do_exit lwp-exit
  609       proc           vmlinux                           do_fork create
  610       proc           vmlinux                           do_fork lwp-create

For reference, here’s what it looks like on DelphixOS, an illumos derivative (which of course includes DTrace):

root@argos:~# dtrace -l -n proc:::
   ID   PROVIDER            MODULE                          FUNCTION NAME
10589       proc              unix                   lwp_rtt_initial lwp-start
10629       proc              unix                   lwp_rtt_initial start
10631       proc              unix                              trap fault
10761       proc           genunix                      sigtimedwait signal-clear
10762       proc           genunix                              psig signal-handle
10763       proc           genunix                         sigtoproc signal-discard
10764       proc           genunix                         sigtoproc signal-send
10831       proc           genunix                        lwp_create lwp-create
10868       proc           genunix                             cfork create
10870       proc           genunix                         proc_exit exit
10871       proc           genunix                          lwp_exit lwp-exit
10872       proc           genunix                         proc_exit lwp-exit
10873       proc           genunix                       exec_common exec-success
10874       proc           genunix                       exec_common exec-failure
10875       proc           genunix                       exec_common exec

Each DTrace probe has arguments that convey information about the activity that caused the probe to fire. For example, with the kernel function boundary tracing (fbt) provider (not yet implemented in OEL), the arguments for the function entry probe correspond to the arguments passed to the function. With static providers such as the proc provider, the parameters include useful information… but I can never seem to remember the types and order. Fortunately, DTrace lets you add in the -v option to get more information about a probe. Unfortunately, this hasn’t been hooked up in Oracle’s port (just an bug, I’m sure):

[root@screven ~]# dtrace -lv -n proc:::signal-send
   ID   PROVIDER            MODULE                          FUNCTION NAME
  606       proc           vmlinux                     __send_signal signal-send

	Probe Description Attributes
		Identifier Names: Private
		Data Semantics:   Private
		Dependency Class: Unknown

	Argument Attributes
		Identifier Names: Evolving
		Data Semantics:   Evolving
		Dependency Class: ISA

	Argument Types
		args[0]: (unknown)
		args[1]: (unknown)
		args[2]: (unknown)
		args[3]: (unknown)
		args[4]: (unknown)
		args[5]: (unknown)
		args[6]: (unknown)
		args[7]: (unknown)
		args[8]: (unknown)
		args[9]: (unknown)
		args[10]: (unknown)
		args[11]: (unknown)
		args[12]: (unknown)
		args[13]: (unknown)
		args[14]: (unknown)
		args[15]: (unknown)
		args[16]: (unknown)
		args[17]: (unknown)
		args[18]: (unknown)
		args[19]: (unknown)
		args[20]: (unknown)
		args[21]: (unknown)
		args[22]: (unknown)
		args[23]: (unknown)
		args[24]: (unknown)
		args[25]: (unknown)
		args[26]: (unknown)
		args[27]: (unknown)
		args[28]: (unknown)
		args[29]: (unknown)
		args[30]: (unknown)
		args[31]: (unknown)

Here’s what it looks like on DelphixOS:

root@argos:~# dtrace -lv -n proc:::signal-send
   ID   PROVIDER            MODULE                          FUNCTION NAME
10764       proc           genunix                         sigtoproc signal-send

        Probe Description Attributes
                Identifier Names: Private
                Data Semantics:   Private
                Dependency Class: Unknown

        Argument Attributes
                Identifier Names: Evolving
                Data Semantics:   Evolving
                Dependency Class: ISA

        Argument Types
                args[0]: lwpsinfo_t *
                args[1]: psinfo_t *
                args[2]: int

Even without the type system being hooked up, you can definitely do some useful work with this beta. For example, I can use the proc provider to look at what commands are executing on my system:

[root@screven ~]# dtrace -n proc:::exec'{ trace(stringof(arg0)); }'
dtrace: description 'proc:::exec' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0    600            do_execve_common:exec   /usr/bin/staprun
  0    600            do_execve_common:exec   /usr/sbin/perf
  0    600            do_execve_common:exec   /bin/uname
  0    600            do_execve_common:exec   /usr/libexec/perf.2.6.39-101.0.1.el6uek.x86_64

On his blog, Wim Coekaerts showed some examples of use of the proc provider that included this common idiom:

proc:::create
{
        this->pid = *((int *)arg0 + 171);
        ...

It’s hard to know where that 171 constant came from or how a user would figure that out. I assume that this is because OEL doesn’t yet have proper types and it’s a hardcoded offset into some structure. Here’s what that would look like on completed DTrace implementations:

proc:::create
{
        this->pid = args[0]->pr_pid;
        ...

Progress

There’s a long way to go, but it looks like the folks at Oracle are making progress. It will be interesting to see the source code that goes along with this updated beta — as of this writing, the git repository has not been updated. Personally, I’m eager to see what user-land tracing looks like in the form of the pid provider and USDT. In the tradition of other ports such as Apple’s and FreeBSD’s, I’d invite the Oracle team to present their work at the upcoming DTrace conference, dtrace.conf.

Tagged as: , No Comments

The case of the un-unmountable tmpfs

Monday, December 12th, 2011Posted by ahl

Every once in a rare while our development machines encounter an fatal error during boot because we couldn't unmount tmpfs. This weekend I cracked the case, so I thought I'd share my uses of boot-time DTrace, and the musty corners of the operating systems that I encountered along the way. First I should explain a little bit about what happens during boot and why we were unmounting a tmpfs filesystem.

Upgrade and boot

When you upgrade your Delphix system from one version to the next, we perserve your configuration. Part of that system configuration lives in SMF, the Service Management Facility, which is stored in the filesystem at /etc/svc. We keep /etc/svc in its own ZFS dataset so that we can snapshot and clone it for an upgrade to save the old data (in case we need to roll back) and keep the settings. This gets a tad tricky because the kernel mounts tmpfs on /etc/svc/volatile to provide scratch space for early processes like init(1); before we can mount on /etc/svc, we have to unmount /etc/svc/volatile. Here's what that part of our boot script looks like:

#
# The kernel mounts tmpfs on /etc/svc/volatile so we we need to
# unmount that before mounting the svc dataset on /etc/svc.
#
umount /etc/svc/volatile
mount -F zfs $base/running/svc /etc/svc
mount -F tmpfs /etc/svc/volatile

The problem

Infrequently -- but not never -- we'd see that unmount of/etc/svc/volatile fail with EBUSY. The boot script would stop and report the error; subsequent attempts to unmount /etc/svc/volatile would succeed. So there wasn't much to go on. The tmpfs code did reveal this:

static int
tmp_unmount(struct vfs *vfsp, int flag, struct cred *cr)
{
        ...
        tnp = tm->tm_rootnode;
        if (TNTOV(tnp)->v_count > 1) {
                mutex_exit(&tm->tm_contents);
                return (EBUSY);
        }
        for (tnp = tnp->tn_forw; tnp; tnp = tnp->tn_forw) {
                if ((vp = TNTOV(tnp))->v_count > 0) {
                        ...
                        return (EBUSY);
                }
                VN_HOLD(vp);
        }
        ...

 
So someone had an additional hold on either the root of the filesystem or a file in it. I looked at the contents of /etc/svc/volatile and found one file: init.state. Digging through the code for init(1) I was surprised to find that init(1) keeps a state file around with a list of processes of interest. It doesn't keep the file descriptor open (which would prevent us from unmounting the filesystem), but it does rewrite the file from time to time. I was worried that init(1) might be racing with our script. I didn't want to understand the brutal compexity of the code, so I amended our boot script to do the following:

pstop 1 # stop init
umount /etc/svc/volatile
mount -F zfs $base/running/svc /etc/svc
mount -F tmpfs /etc/svc/volatile
prun 1 # resume init

 
If the unmount failed, I'd be able to use pfiles(1) to see if init(1) did, in fact, have something open in /etc/svc/volatile. I was convinced that in trying to observe the problem, I'd chase it away -- a Heisenbug -- but after a short while of running a reboot loop, we hit the problem, and init(1) didn't have anything open in /etc/svc/volatile. What next…

Boot-time DTrace

The problem was that by the time I'd get to the system, the conditions that caused the error had resolved themselves. What I wanted to do was panic the system when tmp_unmount() returned EBUSY so that I could poke around with a debugger. On many systems that would entail compiling in debug logic, but fortunately a DTrace-enabled system has a better option. My former colleague Bryan Cantrill invented anonymous DTrace for looking at boot-time issues -- earlier in boot than when one could execute the dtrace(1M) command-line utility. To use boot-time DTrace, specify the D program as usual, but add the -A option to add the D program to the DTrace kernel module's boot-time configuration. After rebooting, DTrace will enable the program whose output can later be retreived with dtrace -a. In my case, I wanted to drop into the kernel debugger when tmp_unmount() returned EBUSY, so I ran DTrace like this:

dtrace -A -w -n 'fbt:tmpfs:tmp_unmount:return/arg1 == EBUSY/{ panic(); }'

 
Again after many reboots, we hit the problem and dropped into the debugger. Thanks to infrastructure put into the kernel by my colleague Eric Schrock, I was able to quickly see the identity of the file whose reference count prevented us from unmounting tmpfs:

[5]> ffffff0197321b00::print vnode_t v_path
v_path = 0xffffff0197125d60 "/etc/svc/volatile/init-next.state"

 
It's worth noting that v_path isn't guaranteed to be correct, but may reflect some state state. In this case, I examined the directory structure of tmpfs and found that the filename was actually /etc/svc/volatile/init.state -- v_path isn't updated on a rename. But I couldn't for the life of me figure out who had that file open. None of the (few) other processes were touching the file. I looked through the fsflush code which flushes cached data back to disk, but that didn't make a lot of sense, and didn't seem to be causing problems. The pageout thread isn't supposed to run unless the system is low on memory. I used kmdb's ::kgrep command to find places where the vnode_t or its associated page were referenced. There were many, and I quickly got lost in the bowels of the VM system. Rather than groveling through the kernel's structures, I decided to turn back to DTrace. The next question I wanted to answer was this: after tmp_unmount() returns EBUSY, who is it that releases the reference on that tmpfs vnode_t? To answer it, I wrote this D script:

fbt:tmpfs:tmp_unmount:entry
{
        self->vfs = args[0];
}

fbt:tmpfs:tmp_unmount:return
/arg1 == EBUSY/
{
        gotit = self->vfs;
}

fbt:tmpfs:tmp_unmount:return
{
        self->vfs = NULL;
}

fbt:genunix:vn_rele:entry
/gotit != NULL && args[0]->v_vfsp = gotit/
{
        panic();
}

 
I installed that as my anonymous DTrace enabling, rebooted, and waited.

Who dunnit

Like the Mystery Inc. gang unmasking the criminal, helplessly caught by the elaborate trap, I used the kernel debugger to identify the subsystem to find that it was none other than harmless old Mr. Pageout. Gasp! Why was pageout running at all? The system had plenty of memory so it wouldn't normally be running except there's an exception made very very early in boot (it turns out). In the first second after boot, pageout will execute exactly four times in order to fill in certain performance-related parameters that let it predict how long it will take to page out memory in the future. When it executes, pageout will identify unused pages and take a temporary hold on them -- this is exactly the pathology at the root of our problem!

Solution

I'm still working on exactly how to solve this. The simplest approach would be to sleep for a second before trying the unmount. Slightly more complicated would be to try unmounting in a loop until a second had passed (checking $SECONDS in bash). More complicated still would be to do a rethink of pageout -- I still don't fully understand how it works, but it really seems like it's making assumptions that have been invalidated in the past decade and contains this gem of a comment:

For now, this thing is in very rough form.

 
Note that "now" in this case referred to 1987 or possibly earlier -- as Roger Faulkner would say, "it came from New Jersey."

Conclusion

Pageout would have gotten away with it if it hadn't been for these meddling tools! DTrace during boot is awesome -- when you need it, it's a life saver. There are some places so early in boot that DTrace can't help; for that VProbes can give you some DTrace-like functionality. And mature systems can have some musty corners so your tools had better be up to the task.