AGILE DATA MANAGEMENT

Customer Problems

Shortly after joining I was asked to help some colleagues with an issue that was blocking a proof of concept deployment at a customer site. The customer was unable to import data from one of their databases into Delphix, each time they tried the job would fail with an exception “Failed to parse chunk header.” While we’d seen this at several customer sites we were unable to reproduce the problem locally. At first glance things seemed relatively straight forward. The team already had logs with a full back-trace and a packet trace from when the problem occurred, how hard could it be to figure out what had gone wrong?

Given the data that we had I we decided to start by looking to see where the exception was coming from. It turned out that the exception was thrown from within Grizzly, an open source Java HTTP server implementation. We started looking through the Grizzly source code to try to understand what was going on but nothing obvious jumped out, and it’s a large body of code.  The interface being invoked to read the header data could be one of 7 different implementations.  Unfortunately at the time we didn’t build the Grizzly source, we simply included the jar file in our build, so using a debugger to walk through the stack wasn’t an easy option. Thankfully we did have DTrace; through the magic of Java support for DTrace (subject of a future blog post) we were able to write some probes to capture a Java stack trace for each method invoked from the troublesome call. This gave us a much clearer picture of what Grizzly was doing and led us to the following piece of code:

tmpKey = channel.register(readSelector, SelectionKey.OP_READ);
tmpKey.interestOps(tmpKey.interestOps() | SelectionKey.OP_READ);
int code = readSelector.select(readTimeout);
tmpKey.interestOps(tmpKey.interestOps() & (~SelectionKey.OP_READ));

if ( code == 0 ){
    return r; <-- *** We believed we were hitting this case
}

 

A call to select returning with no events ready explained the behavior we were seeing, but why was select returning? Had the timeout expired?  Had the thread been interrupted? Had we received a signal? Did we have a spurious wakeup? Did we get a wakeup for some unrelated event? We had a lot of theories but nothing concrete, so we decided to look at the packet trace in a bit more detail to see if we could rule anything out.  The packet trace clearly showed the client sending us data, a delay of ~30 seconds, and the connection being closed with the relevant exception. This seemed to point to the timeout expiring, but the configured timeout was much larger then 30 seconds.

While we had narrowed down where the problem was likely coming from we still couldn’t reproduce it in-house. For the first time, however, we had a customer that was hitting it consistently, which gave us an opportunity to get more data.  We decided that the best approach was to create instrumentation to test each of the theories that we had and to then gather data from the customer’s system.  DTrace was a natural way to add this instrumentation, but unfortunately DTrace support for Java method-entry / method-return probes is very expensive.  Enabling support for these probes results in a large spike in CPU utilization for the JVM, enabling them on a customer system wasn’t an option.  To work around this we decided to look at how readSelector.select was implemented to see if we could instrument the system calls to catch the select / wakeups.

readSelector.select is implemented via the java.nio.channels.Selector class in the JDK. Tracing through the JDK implementation we eventually arrived at the native c implementation of poll0 for the DevPollArrayWrapper. More specifically we found the idevpoll function (bonus point if you can spot the unrelated timing bug here):

   gettimeofday(&t, NULL);
   start = t.tv_sec * 1000 + t.tv_usec / 1000;

   for (;;) {
        /*  poll(7d) ioctl does not return remaining count */
        int res = ioctl(wfd, dpctl, &a);
        if (res < 0 && errno == EINTR) {             if (remaining >= 0) {
                gettimeofday(&t, NULL);
                now = t.tv_sec * 1000 + t.tv_usec / 1000;
                diff = now - start;
                remaining -= diff;
                if (diff < 0 || remaining <= 0) {
                    return 0;
                }
                start = now;
            }
        } else {
            return res;
        }
    }

 

It doesn’t seem like much, but this really helped limit the scope of what we were searching for. While the implementation of /dev/poll is a bit convoluted (man poll.7d) if we could trace all of the reasons why a call to DP_POLL ioctl would wakeup we would have our answer. Writing the D for this was slightly complicated, to limit the noise we needed to restrict the probes to firing only for the cases that we cared about. The call to DP_POLL doesn’t include the file descriptors that the caller is polling on, those are set via a previous call to write. We knew the two possible file descriptors Grizzly would be reading data from, but we needed a way to traverse the hash table of file descriptors to see if they were present. This isn’t straightforward as DTrace doesn’t allow for iteration, we came up with the following:

#!/usr/sbin/dtrace -Cs

#include <sys/devpoll.h>

#pragma D option jstackstrsize=4k
#pragma D option nspec=30

fbt::dpioctl:entry
{
        this->match = 0;
        this->found = 0;
        this->p0 = (struct polldat *)NULL;
}

fbt::dpioctl:entry
/pid == $1 && arg1 == DP_POLL/
{
        /*
         * Generic logic to check whether a DP_POLL ioctl is being
         * called for a cached fd set with a given fd ($2) present
         * with events != 0.
         */
        this->minor = arg0 & 0x3ffff;

        this->dpep = `devpolltbl[this->minor];
        this->pcp = this->dpep->dpe_pcache;
        this->p0 = this->pcp->pc_hash[$2 % this->pcp->pc_hashsize];
}

/*
 * Use the NEXT macro to generate a set of predicates and probes
 * to walk an entry in the pcache hash chain. Repeat NEXT for
 * as many levels as we want to walk.
 */
#define NEXT(P)
fbt::dpioctl:entry
/!this->match && this->P != NULL && this->P->pd_fd == $2/
{
        this->match = 1;
}

fbt::dpioctl:entry
/!this->match && this->P != NULL/
{
        this->P = this->P->pd_next;
}

NEXT(p0)
NEXT(p0)
NEXT(p0)
NEXT(p0)

fbt::dpioctl:entry
/this->match && this->p0->pd_events != 0/
{
        this->found = 1;
}

 

The file descriptors that are being polled on are kept in a hash table (pollcache_t) in a per device entry (dp_entry_t) that is indexed by the minor number of the device. To check if the file descriptor we cared about was present we implemented the same logic that exists in pcache_lookup_fd in our script. Since we couldn’t iterate over the hash table we used a macro to generate predicates to allow us to handle a limited number of collisions. This D code let us limit collecting data to calls that we knew were coming from Grizzly. Given this we added logic to speculatively trace all wakeups and logic to dump the speculative data if we hit any of the error conditions we thought might cause the problem. The rest of the script looked like:

/*
 * Convenience macros for dumping entry and return fd information
 * consistently.
 *
 * Note: These rely on self / this context being set correctly,
 *       so be careful where you use them.
 */
#define PRINT_ENTRY_FD(LABEL)
        this->elapse = (timestamp - self->ts) / 1000000;
        printf("Tid: %d %s(fd=%d,events=%d,count=%d,timeout=%d,elapseMS=%d) ",
               tid, LABEL, $2, self->events, self->nfds, self->timeout,
               this->elapse);

#define PRINT_RETURN_FD(LABEL)
        this->elapse = (timestamp - self->ts) / 1000000;
        printf("Tid:%d %s(fd=%d,events=%d,revents=%d,elapseMS=%d) ",
               tid, LABEL, this->fds.fd, this->fds.events,
               this->fds.revents, this->elapse);

#define DUMP_DATA()
        jstack();
        self->fail = 1;

fbt::dpioctl:entry
/this->found/
{
        self->yes = 1;
        self->events = this->p0->pd_events;
        self->ts = timestamp;

        /*
         * This ioctl's argument is a dvpoll32 structure:
         *
         * typedef struct dvpoll32 {
         *         caddr32_t       dp_fds;          pollfd_t array
         *         uint32_t        dp_nfds;         num of pollfd's in dp_fds[]
         *         int32_t         dp_timeout;      time out in milisec
         * } dvpoll32_t;
         *
         * dp_fds is a pointer to an array of:
         *
         * typedef struct pollfd {
         *      int fd;                         fd
         *      short events;                   in events
         *      short revents;                  out events
         * } pollfd_t
         *
         * This function returns:
         *      > 0:
         *              This indicates an error, the return is an errno value
         *      0:
         *              Success. $arg5 is a pointer to the count of fd's
         *              returned, dvpoll->dp_fds will be densely packed
         *              with the resulting fd's and the associated events
         *              that were polled for (events) and the events that
         *              are now ready (revents).
         */
        this->dvpoll32 = (dvpoll32_t *) copyin(arg2, sizeof (dvpoll32_t));

        /*
         * Save a pointer to the pollfd_t array, the size of the array, and
         * the timeout so that we can examine them in the return probe.
         */
        self->fds = this->dvpoll32->dp_fds;
        self->nfds = this->dvpoll32->dp_nfds;
        self->timeout = this->dvpoll32->dp_timeout;
        self->rvalp = args[5];
}

fbt::dpioctl:entry
/self->yes/
{
        self->spec = speculation();
}

/*
 * Record when the thread of interest goes to sleep.
 */
sched:::sleep
/self->yes/
{
        interested[pid, tid] = self->spec;
}

sched:::wakeup
/interested[args[1]->pr_pid, args[0]->pr_lwpid]/
{
        speculate(interested[args[1]->pr_pid, args[0]->pr_lwpid]);
        stack();
        jstack();
}

/*
 * cv_waituntil_sig tests
 *
 *      cv_waituntil_sig is used by dpioctl to wait for event data or
 *      a timeout. It's semantics are:
 *
 *      0 if a signal was received
 *      -1 if timeout occured
 *      >0 if awakened via cv_signal() or cv_broadcast() or spurious wakeup
 *
 *      TEST 1: Check whether cv_waituntil_sig was interrupted by a signal
 *      TEST 2: Check whether cv_waituntil_sig has a timeout expire
 */

/*
 * TEST 1, 2
 */
fbt::cv_waituntil_sig:return
/self->yes && args[1] yes && arg1 != 0/
{
        PRINT_ENTRY_FD("FAIL ");
        printf("returns error: %d", arg1);
        DUMP_DATA();
}

/*
 * TEST 4
 */
fbt::dpioctl:return
/self->yes && arg1 == 0 && *self->rvalp != 1/
{
        PRINT_ENTRY_FD("FAIL ");
        printf("returned with %d ready (expected 1)", *self->rvalp);
        DUMP_DATA();
}

/*
 * Grab the fd that is being returned, the set of events it waited on, and
 * the set of events being returned.
 */
fbt:poll:dpioctl:return
/self->yes && arg1 == 0 && *self->rvalp == 1/
{
        this->fds = *(pollfd_t *) copyin(self->fds, sizeof (pollfd_t));
        self->rmatch = 1;
}

/*
 * TEST 5, 6
 */
fbt:poll:dpioctl:return
/self->rmatch && (this->fds.fd != $2 || self->events != this->fds.revents)/
{
        PRINT_ENTRY_FD("FAIL ");
        PRINT_RETURN_FD("returns ");
        DUMP_DATA();
}

fbt::dpioctl:return
/self->spec && self->fail/
{
        commit(self->spec);
}

fbt::dpioctl:return
/self->spec && !self->fail/
{
        discard(self->spec);
}

fbt::dpioctl:return
/self->yes/
{
        /*
         * Clear all of the dynamic variables we used
         */
        self->yes = 0;
        self->events = 0;
        self->ts = 0;

        self->fds = 0;
        self->nfds = 0;
        self->timeout = 0;
        self->rvalp = 0;
        self->rmatch = 0;
        self->spec = 0;
}

 

The script was structured to have a test case for each of the theories that we had come up with. We spent some time walking through each test to determine if we would have enough information from the failure to diagnose the root cause and what additional data we would try to capture. We felt pretty confident that the data from the script would either lead us to the root cause or give us enough data to narrow down what to look at next. All we needed now was a way to reproduce the problem.

Asking to debug on a customer’s system is always a sensitive issue, you don’t want to waste their time nor disrupt their environment. We were very fortunate to be working with an incredibly accommodating customer that was genuinely interested in getting to the root cause of the issue. Our customer contact had been a huge champion of Delphix, he was willing to invite us on-site to try to debug the issue. Several of my colleagues went to visit the customer and brought me with them virtually (I’m based in Boston and the customer is in California). We sat down at their system, installed our DTrace script on the Delphix appliance, and started the job to reproduce the problem. In the customer’s environment the bug tended to reproduce after about 30 minutes into the transfer, which meant we would run through this code path several thousands of times before seeing the issue. Speculative tracing was really key to avoid generating more data then we could sift through. We got pretty lucky, 20 minutes after we started trying to reproduce the problem we had our first clue, test 2 had failed! The output looked like:

cv_waituntil_sig:return Tid: 303 TIMEOUT(fd=3466,events=1,count=8192,timeout=3600000,elapseMS=84542) returns: (334) 4294967295
dpioctl:return Tid: 303 FAIL (fd=3466,events=1,count=8192,timeout=3600000,elapseMS=84543) returned with 0 ready (expected 1)

3   5917                cv_unsleep:wakeup  1541282395925158
              genunix`setrun_locked+0x96
              genunix`pokelwps+0x12c
              genunix`holdlwps+0x95
              genunix`cfork+0x91
              genunix`forksys+0x63
              unix`sys_syscall32+0xff
              libc.so.1`vfork+0x29

 

Unfortunately we had drops in our speculative tracing, so we we didn’t capture all of the wakeups. cv_waituntil_sig was telling us that the timeout had expired after 85 seconds, which was odd as the specified timeout was an hour. We had seen the spurious wakeups from vfork in our own testing, but saw that they were handled correctly and were the normal result of other activity in our management stack. We assumed that the actual wakeup as a result of the timeout expiring had been lost due to speculative drops. We knew we were getting close, but we wanted a bit more data. The beauty of DTrace is how easy it was for us to add more instrumentation on the fly, we added some new probes to try to catch the wakeup from within cv_waituntil_sig. This time we nailed it! The tracing happened to show that -1 was returned from cv_waituntil_sig but not from any of the functions it called! We finally had our answer!

cv_waituntil_sig takes an absolute timeout rather then a relative timeout as a parameter, which means it says wait until 6 pm instead of wait 3 hours. To avoid cases where the system time changing would invalidate a future timeout it explicitly returns -1 if it determines that time has been changed. The calling thread blocked in cv_waituntil_sig was waking up after the time was changed as a side effect of the process doing a vfork, at which point dpioctl would see that no events were ready and would call back into cv_waituntil_sig to wait for events. As a result of the time change cv_waituntil_sig would immediately return -1, causing dpioctl to return prematurely. The spurious wakeups from vfork were handled incorrectly only in the case where the system time had changed after the blocking thread had called into cv_waituntil_sig. Using mdb we were able to confirm that NTP was frantically changing the time as the ESX host’s clock had skewed wildly. The best news about this was that we could temporarily work around the problem by disabling timeouts in Grizzly, allowing the customer to successfully link their database!

This was my first experience debugging on a system with DTrace and mdb available, while I was aware of both DTrace and mdb functionality I’d never really used them before. The ability to dynamically add instrumentation on a customer system is an invaluable resource. In a few hours at the customer site without making any changes to the installed software we were able to test out our theories, refine the data we gathered, and root cause the problem. Without DTrace we would have had to create a custom kernel tracing framework to gather the events we cared about, delivered a one-off kernel to the customer, and done this for each iteration. Each pass would have taken at least 24 hours rather than 30 minutes. Thanks to DTrace we’ve filed https://www.illumos.org/issues/1605 to track the problem and are in the process of submitting a fix.

6 Responses to “Customer Problems”

  1. [...] Matt’s first blog post is a great war story, describing his first use of DTrace on a customer system. It was vindicating to witness first hand, both in how productive an industry vet could be with DTrace after a short period, and in what a great hire Matt was for Delphix. Working with him has been evocative of our collaboration in college — while making all the more apparent the significant distance we’ve both come. Welcome, Matt! Posted on October 12, 2011 at 10:40 pm by ahl · Permalink In: Delphix · Tagged with: Delphix, DTrace, MattAmdur [...]

  2. [...] Matt's first blog post is a great war story, describing his first use of DTrace on a customer system. It was vindicating to witness first hand, both in how productive an industry vet could be with DTrace after a short period, and in what a great hire Matt was for Delphix. Working with him has been evocative of our collaboration in college — while making all the more apparent the significant distance we've both come. Welcome, Matt! [...]

  3. Neil Williams says:

    Hi,

    I just wanted to say I kept up until paragraph 6 but enjoyed it all ;-)

    BUT I am probably only expected to :-) My job is a sys admin, specifically Solaris. Without undergoing a formal qualification in kernel design from Browns I have no chance :'(

    I think that dtrace provides some developer diagnostic tools that have never existed before. Which will benefit me and my customers. So well done the inventors, the implementors and the people who know how to use it!

    thanks for the blog.
    cheers
    Neil

  4. Richard Elling says:

    Thanks Matt, this is an excellent example of how to dig in and troubleshoot systems engineering problems. In many cases, the observability simply doesn’t exist and without the insight demonstrated here, the problem might never be solved.

  5. Great post, and excellent customer service. btw, your D script’s #include line needs to use < and > rather than angle brackets – otherwise my browser fails to display the filename :|

Leave a Reply