AGILE DATA MANAGEMENT

One good probe deserves another

During my first week at Delphix I was asked to participate in a support call with a customer. The customer had recently upgraded their 4 vCPU Delphix VM (running on VMware ESX) to a new Delphix version and added two additional vCPUs. After the upgrade the VM would no longer boot. Each time the VM booted the OS would start to initialize and then hang. My colleague and I tried to force a crash dump, but unfortunately the VM did not respond to an NMI. We rebooted the VM and used the debugger to walk through early boot, which showed that the system hung sometime after calling start_other_cpus(). start_other_cpus() is called from CPU 0 during early boot to start the other processors in the system, it was not clear from the debugger where or why it was hanging.

The customer was anxious to have their system back, so we took a stab in the dark and changed the vCPU count from 6 to 1. To our surprise, the VM booted. We then changed the vCPU count from 1 back to 6 and the VM still booted. The customer was elated that the system was up, but I was more then a little concerned since we had no idea of the actual root cause.

With the customer temporarily unblocked I came back to Delphix and tried to reproduce the problem in house. I wrote a simple script that matched what was done at the customer site:

1) boot a 4 vCPU VM
2) upgrade the Delphix OS version
3) stop the VM
4) Change the vCPU count from 4 to 6
5) start the VM

It took about a week but I eventually reproduced the problem. While I was time-constrained on the customer system, I could now walk instruction-by-instruction through the boot process. Sadly single-stepping with kmdb was throwing off the timing such that the system no longer hung at boot. I wanted to use DTrace to try to gather some stack backtraces, but couldn’t leverage the post-mortem facility because the hang was so early during boot.

DTrace has made me very accustomed to being able to record the arbitrary data needed to understand and diagnose problems. When problems arise in areas where DTrace can’t shine a light, it makes that lack of information all the more painful. Fortunately my old colleague Keith Adams built a DTrace-like facility into VMware that provides observability into what the guest operating systems is doing. ESX 4.1 didn’t have VProbes, so I migrated the VM to ESX 5.0 and after a bit if work reproduced the problem there.

My first experiment was to write a simple VProbe script to grab a backtrace on the 1Hz timer to see where the various vCPUS were. I came up with the following:


@VM=147
(defstring stack)
(VProbe VMM1Hz
   (gueststack stack)
      (printf "CPUD: %d Stack: %s\n" VCPUID stack))

This generated the following:

CPUD: 1 Stack: GUEST_0xfffffffffb8545c6_0xfffffffffb854a79
CPUD: 3 Stack: GUEST_0xfffffffffb841c8e_0xfffffffffb842595_0xfffffffffb84278e_0xfffffffffb84357e_0x0
CPUD: 0 Stack: GUEST_0xfffffffffb84cd24_0xfffffffffb841e1c_0xfffffffffb841f5c_0xfffffffffb842220_0xfffffffffba724f4_0xfffffffffb8000a0_0x0
CPUD: 2 Stack: GUEST_0xfffffffffb801d94_0xfffffffffb9ec9ed_0xfffffffffb9ed69d_0xfffffffffb9d4b3f_0xfffffffffb9d54db_0xfffffffffb9dc2a3_0xfffffffffb9dc172_0xfffffffff79d376d_0xfffffffffb86e938_0x0

This was huge! Until VProbes I had no idea where the CPUs were hung, now I had locations for all of them! Looking at the data for a while I noticed that the backtraces for CPU 1, 2, and 3 were always identical, the backtrace for CPU 3 varied between two different offsets in the same function. Using another copy of the VM running on the old ESX host, I wrote a script to use mdb to convert the hex addresses from VProbes into a symbolic stack:


CPU 0:
        tsc_sync_master+0x6c:
        mp_start_cpu_common+0x134:
        start_cpu+0x44:
        start_other_cpus+0x198:
        main+0x2ac:
        _locore_start+0x90:

CPU 1:
        xc_serv+0x12e:
        xc_common+0x229:

CPU 2:
        cbe_xcall+0x95:
        cyclic_reprogram_here+0x4d:
        cyclic_reprogram+0x95:
        callout_heap_insert+0x6f:
        timeout_generic+0x333:
        cv_timedwait_hires+0xab:
        cv_timedwait+0x5a:
        arc_reclaim_thread+0x13d:
        thread_start+8:

CPU 3:
        mp_startup_signal+0x6e:
        mp_startup_common+0x1b5:
        mp_startup_boot+0xe:
        0xfffffffffb84357e:

 

The backtraces for CPUs 0 and 3 appeared to be stuck in tight loops:

CPU 0 was stuck in tsc_sync_master():


...
        for (cnt = 0; cnt < SYNC_ITERATIONS; cnt++) {
                while (tsc_sync_go != TSC_SYNC_GO)
                        SMT_PAUSE(); <-- *** STUCK HERE ***
...

CPU 3 was stuck in the second call to mp_startup_signal() from mp_startup_common():


...
        CPUSET_ATOMIC_ADD(*(cpuset_t *)sp, cpuid);
        for (tempset = *sp; CPU_IN_SET(tempset, cpuid);
            tempset = *(volatile cpuset_t *)sp) {
                SMT_PAUSE();  <-- *** STUCK HERE ***
        }
...

At a high level CPU 0 was waiting for CPU 3 to indicate that it was ready to sync its TSC. CPU 3 had already finished syncing its TSC and was waiting for the boot CPU to signal it to finish initialization. Looking a bit more closely at tsc_sync_master() I saw an obvious issue:


...
        hwtype = get_hwenv();
        if (!tsc_master_slave_sync_needed || hwtype == HW_XEN_HVM ||
            hwtype == HW_VMWARE) cpu_id;

        for (cnt = 0; cnt < SYNC_ITERATIONS; cnt++) {
                while (tsc_sync_go != TSC_SYNC_GO)
                        SMT_PAUSE(); <--- *** STUCK HERE ***

...

tsc_sync_master() has explicit logic to _not_ sync the TSC if we are running on VMware, but somehow get_hwenv() had returned something other then HW_VMWARE. Looking at get_hwenv() quickly led me to determine platform():


...
int
get_hwenv(void)
{
        if (platform_type == -1)
                determine_platform();

        return (platform_type);
}

static void
determine_platform()
{
        struct cpuid_regs cp;
        char *xen_str;
        uint32_t xen_signature[4], base;

        platform_type = HW_NATIVE;

        ...
}

determine_platform() is clearly broken if multiple CPUs call it, as it resets the global platform_type each time it’s called. It turns out that we call determine_platform() from cpuid_pass1(), which is called as part of mp_cpu_start_common() for each CPU. Voila, I found the race!

CPU 3 calls determine_platform() as part of startup which sets platform_type to HW_NATIVE briefly before setting it to HW_VMWARE. CPU 0 happens to call get_hwenv() during that window, which causes tsc_sync_master() to get stuck waiting for CPU 3 to set tsc_sync_go to TSC_SYNC_GO, which will never happen.

I’m always amazed at how easy diagnosing complex problems can be once you have the right data. I tried to debug the issue for several days using only mdb and didn’t really make any progress, but within several hours of getting the VProbe data I had a root cause. While DTrace can collect data from nearly anywhere, there are some parts of the system that it can’t reach. VProbes was a huge savior and another great debugging tool to have at your disposal.

Comments are closed.