ZFS fundamentals: transaction groups
I’ve continued to explore ZFS as I try to understand performance pathologies, and improve performance. A particular point of interest has been the ZFS write throttle, the mechanism ZFS uses to avoid filling all of system memory with modified data. I’m eager to write about the strides we’re making in that regard at Delphix, but it’s hard to appreciate without an understanding of how ZFS batches data. Unfortunately that explanation is literally nowhere to be found. Back in 2001 I had not yet started working on DTrace, and was talking to Matt and Jeff, the authors of ZFS, about joining them. They had only been at it for a few months; I was fortunate to be in a conference with them as the ideas around transaction groups formulated. Transaction groups are how ZFS batches up chunks of data to be written to disk (“groups” of “transactions”). Jeff stood at the whiteboard and drew the progression of states for transaction groups, from open, accepting new transactions, to quiescing, allowing transactions to complete, to syncing, writing data out to disk. As far as I can tell, that was both the first time that picture had been drawn and the last. If you search for information on ZFS transaction groups you’ll find mention of those states… and not much else. The header comment in usr/src/uts/common/fs/zfs/txg.c isn’t particularly helpful:
/* * Pool-wide transaction groups. */
I set out to write a proper description of ZFS transaction groups. I’m posting it here first, and I’ll be offering it as a submission to illumos. Many thanks to Matt Ahrens, George Wilson, and Max Bruning for their feedback.
ZFS Transaction Groups
ZFS transaction groups are, as the name implies, groups of transactions that act on persistent state. ZFS asserts consistency at the granularity of these transaction groups. Each successive transaction group (txg) is assigned a 64-bit consecutive identifier. There are three active transaction group states: open, quiescing, or syncing. At any given time, there may be an active txg associated with each state; each active txg may either be processing, or blocked waiting to enter the next state. There may be up to three active txgs, and there is always a txg in the open state (though it may be blocked waiting to enter the quiescing state). In broad strokes, transactions — operations that change in-memory structures — are accepted into the txg in the open state, and are completed while the txg is in the open or quiescing states. The accumulated changes are written to disk in the syncing state.
Open
When a new txg becomes active, it first enters the open state. New transactions — updates to in-memory structures — are assigned to the currently open txg. There is always a txg in the open state so that ZFS can accept new changes (though the txg may refuse new changes if it has hit some limit). ZFS advances the open txg to the next state for a variety of reasons such as it hitting a time or size threshold, or the execution of an administrative action that must be completed in the syncing state.
Quiescing
After a txg exits the open state, it enters the quiescing state. The quiescing state is intended to provide a buffer between accepting new transactions in the open state and writing them out to stable storage in the syncing state. While quiescing, transactions can continue their operation without delaying either of the other states. Typically, a txg is in the quiescing state very briefly since the operations are bounded by software latencies rather than, say, slower I/O latencies. After all transactions complete, the txg is ready to enter the next state.
Syncing
In the syncing state, the in-memory state built up during the open and (to a lesser degree) the quiescing states is written to stable storage. The process of writing out modified data can, in turn modify more data. For example when we write new blocks, we need to allocate space for them; those allocations modify metadata (space maps)… which themselves must be written to stable storage. During the sync state, ZFS iterates, writing out data until it converges and all in-memory changes have been written out. The first such pass is the largest as it encompasses all the modified user data (as opposed to filesystem metadata). Subsequent passes typically have far less data to write as they consist exclusively of filesystem metadata.
To ensure convergence, after a certain number of passes ZFS begins overwriting locations on stable storage that had been allocated earlier in the syncing state (and subsequently freed). ZFS usually allocates new blocks to optimize for large, continuous, writes. For the syncing state to converge however it must complete a pass where no new blocks are allocated since each allocation requires a modification of persistent metadata. Further, to hasten convergence, after a prescribed number of passes, ZFS also defers frees, and stops compressing.
In addition to writing out user data, we must also execute synctasks during the syncing context. A synctask is the mechanism by which some administrative activities work such as creating and destroying snapshots or datasets. Note that when a synctask is initiated it enters the open txg, and ZFS then pushes that txg as quickly as possible to completion of the syncing state in order to reduce the latency of the administrative activity. To complete the syncing state, ZFS writes out a new uberblock, the root of the tree of blocks that comprise all state stored on the ZFS pool. Finally, if there is a quiesced txg waiting, we signal that it can now transition to the syncing state.
What else?
Please let me know if you have suggestions for how to improve the descriptions above. There’s more to be written on the specifics of the implementation, transactions, the DMU, and, well, ZFS in general. One thing that I’d note is that Matt mentioned to me recently that were he starting from scratch, he might eliminate the quiescing state. I didn’t understand fully until I researched the subsystem. Typically transactions take a very brief amount of time to “complete”, time measured by CPU latency as opposed, say, to I/O latency. Had the quiescing phase been merged into the syncing phase, the design would be slightly simpler, and it would eliminate the mostly idle intermediate phase where a bunch of dirty data can sit in memory relatively idle.
Next I’ll write about the ZFS write throttle, it’s various brokenness, and our ideas for how to fix it.
dtrace.conf(12) wrap-up
For the second time in as many quadrennial dtrace.confs, I was impressed at how well the unconference format worked out. Sharing coffee with the DTrace community, it was great to see some of the oldest friends of DTrace — Jarod Jenson, Stephen O’Grady, Jonathan Adams to name a few — and to put faces to names — Scott Fritchie, Dustin Sallings, Blake Irvin, etc — of the many new additions to the DTrace community. You can see all the slides and videos; these are my thoughts and notes on the day.
Bryan provided a typically eloquent review of the state of the community. DTrace development is alive and well — after a lull while Oracle’s acquisition of Sun settled in — with new support for a variety of languages and runtimes, and new products that rely heavily on DTrace as a secret sauce. Bryan laid out some important development goals, areas where many have started straying from the edges of the completed DTrace features into the partially complete or starkly missing. We all then set to work hammering out a loose schedule for the day; I’ll admit that at first I was worried that we’d have too many listeners and not enough presenters, but the schedule quickly filled — and with more topics than we’d end up having time to cover.
User-land CTF and Dynamic Translators
DTrace, from its inception, has been a systemic analysis tool, but the earliest development focused on kernel observability — not a surprise since Bryan, Mike, and I developed it while working in the Solaris kernel development. After its use spread (quickly) beyond the kernel team, use shifted more and more to features focused on understanding C and C++ applications in user-land, and then to applications written in a variety of higher-level languages — Java, Ruby, Perl, Javascript, Erlang, etc. User-land Statically Defined Tracing (USDT) is the DTrace facility that enables rich tracing of higher-level languages. It was a relatively late addition to DTrace (integrated in 2004, well after the initial integration in 2003), and since then we’ve learned a lot about what we got right, what we got wrong, and where it’s rough — in some cases very rough — around the edges.
In his opening remarks, Bryan identified USDT improvements as a key area for the community’s focus. In DTrace development we tried to focus on making the impossible possible rather than making the possible easier. In its current form, some things are still impossible with DTrace, namely consumption of type structures from user-land programs; stable, non-privileged use of DTrace; and support for different runtime versions. Dave Pacheco and I took the first slot on the schedule and spoke (at length — sorry) about solutions to these problems.
While others had the benefit of a bit more time to prepare, I did have the advantage of spending many years idly contemplating the problem space and possible solutions. On the subject of user-land type information (in the form of CTF), I identified the key parts of the code that would would need some work. For the USDT enhancements, we discussed dynamic translators — D code that would be linked and executed at runtime, contrasted with today’s static translators that are compiled into a D program — how they would address the problem, and how these ideas could be extended to the kernel (for once, user-land is actually a bit ahead).
I’ll go into the details of our off the cuff proposals, and delve into the code to firm up those ideas in a future blog post. Beyond the extensive implementation work we laid out, the next step is to gather the most complicated, extant USDT providers and proposals for other providers, and figure out what they should look like in the new, dynamic translator world.
The D Language
Next up, my long-time colleague, DTrace contributor, Eric Schrock led the discussion on D language additions. The format of a D program is heavily tied to DTrace’s implementation: all clauses must trace a fixed amount of data, and infinite loops are forbidden. For this reason, D lacks the backward branches needed for traditional looping, subroutines for common code, and if/else clauses for control flow. Each of these has a work-alike — unrolled loops, macros, and predicates or the ternary operator — but their absence renders D confusing to some — especially those unaware of the motivation. Further, the D language need not necessarily hold the underlying implementation so central.
Eric discussed some proposals for how each might be addressed, and I noted that it would be possible to create a prototype environment where we could try out these “D++” features by compiling into D work-alikes. The next step is to identify the most complicated D scripts, and see what they might look like for various incarnations of those language features.
Work with DTrace
The next few sessions focused not on changes to DTrace, but interesting work done using DTrace:
John Thompson of Sony talked about their port of DTrace to the Playstation Vita (!). Sony developers are given access to DTrace, but found it to be unfamiliar and unapproachable. John spoke his attempts to remedy this by replacing D with a C++-like interface which he implemented by replacing the D compiler with Clang.
My Fishworks colleague, Brendan Gregg, showed some of beautiful visualizations they’ve been developing at Joyent, and talked about the analyses those visualizations enabled. As always, it was fascinating stuff. If you don’t read Brendan’s blog, you really should. Long-time DTrace advocate, Theo Schlossnagle, talked about the visualizations they’re doing in Circonus — also fascinating stuff for anyone thinking about how to present system activity in comprehensible ways. Richard Elling showed the DTrace-based visualizations Nexenta used at VMworld to rave reviews.
Mark Cavage presented Joyent’s work bringing DTrace to node.js; Scott Fritchie talked about DTrace for Erlang. Both were useful sources of ideas for how we could improve USDT.
Ryan Stone presented the state of DTrace on FreeBSD. That DTrace is not enabled in the build by default remains a key obstacle for adoption. I hope that Ryan et al. are able to persuade the FreeBSD leadership that their licensing fears are misguided.
DTrace for OEL
I was delighted that Kris van Hees was able to attend to present the Oracle port to Linux. DTrace for OEL was announced at Oracle Open World 2011, but the initial beta didn’t live up to its billing at OOW. As is often the case, this was more a failure of messaging than of engineering. Kris and his team are making steady progress. While it’s not yet in the public beta, they have the kernel function boundary tracing provider (fbt) implemented. Most heartening of all, Oracle intends to keep DTrace for OEL moving forward as the community evolves and improves DTrace — rather than forking it. How that plays out, and what that means for DTrace on Oracle Solaris will be interesting to see, but it’s great to hear that Kris sees the value of DTrace ubiquity and DTrace compatibility.
As was remarked several times, having DTrace available on the fastest growing deployment platform will be the single most significant accelerator for DTrace adoption. The work Kris and his team at Oracle are doing is probably the most important in the DTrace ecosystem, and I think that I speak for the entire DTrace community in offering to assist in any way possible.
A ZFS DTrace Provider
Matt Ahrens and George Wilson — respectively the co-inventor of ZFS, and the preeminent SPA developer — presented a proposal for a DTrace provider for ZFS. ZFS is a highly sophisticated filesystem, but one that is also difficult to understand. Building in rich instrumentation is going to be a tremendous step forward for anyone using ZFS (for example, our mutual employer, Delphix).
Whither DTrace?
Jarod Jenson — the first DTrace user outside of Sun — took the stage in the final session to talk about DTrace adoption. Jarod has made DTrace a significant part of his business for many years. What continues to amazing him, despite numerous presentations, demonstrations, and lessons, is the relatively low level of DTrace adoption. DTrace is a tool that comes alive in the hands of a skilled, scientific, incisive practitioner — and in all of those, Jarod is superlative — but it can have a high bar of entry. There were many concrete suggestions for how to improve DTrace adoption. Most of them didn’t hold water for me — different avenue for education, further documentation, community outreach, higher level tools, visualizations, etc. — but two were quite compelling: DTrace for Linux, and DTrace on stackoverflow.com (and the like). I don’t know how much room there is to participate in the former, but by all means if there are DTrace one-liners that solve problems (on Mac OS X for example), post them, and get people covertly using DTrace.
The core DTrace community is growing. It was great to see old friends like Steve Peters who worked on porting DTrace to Mac OS X in the same room as Kris van Hees as he spoke about his port to Linux. It was inspiring to see so many new members of the community, eager to use, build and improve DTrace. And personally it inspired me to get back into the code to finish up some projects I had in flight, and to chart out the course for some of the projects we discussed.
Thanks to everyone who attended dtrace.conf in person or online. And thanks especially to Deirdre Straughan who made it happen.
ZFS+10: illumos meetup
ZFS recently celebrated its informal 10th anniversary; to mark the occasion, Delphix hosted a ZFS-themed meetup for the illumos community (sponsored generously by Joyent). Many thanks to Deirdre Straughan, the new illumos community manager, for helping to organize and for filming the event. Three of my colleagues at Delphix presented work they've been doing in the ZFS ecosystem.
Matt Ahrens, who (with Jeff Bonwick) invented ZFS back in 2001, started the program with a discussion of a new stable interface for ZFS. Initially libzfs had been designed as a set of helper functions in support of the zfs(1M) and zpool(1M) commands; since then, it has outgrown those humble ambitions and a new, simple, stable interface is needed for programmatic consumers of ZFS. In Matt's talk and blog post, he lays out a series of guiding principles for the new libzfs_core library; he's already started to implement these ideas for new ZFS features in development at Delphix.
John Kennedy has been working on a relatively neglected part of illumos: automated testing. At the meetup John spoke about the work he's been doing to revitalize the ZFS test suite, and to build a unit testing framework for illumos at large. I found the questions and enthusiasm from the people in the room particularly encouraging -- everyone knows that we need to be doing more testing, but until John stepped up, no one was leading the charge. The ZFS test suite is available on github. Take a look at John's blog post to see how to execute the ZFS test suite, and how you can contribute to illumos by helping him diagnose and fix the 60+ outstanding failures.
Chris Siden has been at Delphix just since he graduated from Brown University this past spring, but he's already made a tremendous impact on ZFS. Chris presented both the work he's done to finish the work started by Basil Crow (also of Brown, and soon full-time at Delphix) on ZFS feature flags (originally presented to the ZFS community by Matt back in May). Previously, ZFS features followed a single, linear versioning; with Chris and Basil's work it's not a land-grab for the next version, rather each feature can be enabled discretely. Chris also implemented the world's first flagged ZFS feature, Async Destroy (also known to ZFS feature flags as com.delphix:async_destroy) which allows datasets to be destroyed asynchronously in the background -- a huge boon when destroying gigantic ZFS datasets. Chris also presented some work he's been doing on backward compatibility testing for ZFS; check out his blog post on both subjects.
The illumos meetup was a great success. Thank you to everyone who attended in person or on the web. To get involved with the ZFS community, join the illumos ZFS mailing list, and for information on the next illumos meetup, join the group.
ZFS 10th anniversary
Exactly 10 years ago today, Jeff Bonwick and Matt Ahrens got their first ZFS prototype working in user-land. Jeff had scrapped his previous attempt at reinventing filesystems, working through the established filesystem management and engineering channels at Sun, and this time started with a clean sheet of paper. Matt had joined Sun that June shortly after graduating from Brown University. Both prodigious coders, the duo, in remarkably short order, showed us a glint of what ZFS would be. A year later, the master and apprentice had ZFS working in the kernel, moving data from end to end. Three years after that, standing in front of a team of a dozen engineers, Matt typed 'putback' to integrate ZFS into Solaris. The distance ZFS has traveled these past 10 years has been monumental, and ZFS has indelibly impacted the industry. ZFS is one of the load-bearing pillars here at Delphix; without it, our task would have been too ambitious to even begin. Congratulations to our own Matt Ahrens on this milestone, as well as to Jeff, and everyone else who has contributed to ZFS over the last 10 years including the growing community building new products around ZFS and illumos.
Update: Check out Matt's blog post on the subject.