The Future of LibZFS

Tuesday, January 17th, 2012Posted by matt

Last week we had a very successful Illumos meetup, hosted at Delphix HQ in Menlo Park.  Thanks to all who participated! In honor of the ZFS 10 year anniversary, my colleagues Chris Siden and John Kennedy gave great talks about ZFS Feature Flags, and testing strategies.  My contribution was to present plans for a new programmatic interface to ZFS.

I've been running into a lot of problems with libzfs in my work at Delphix, generally falling into 2 categories: making changes to libzfs, and using libzfs.

The ZFS community has added many new capabilities to libzfs since it was integrated on October 31, 2005.  Unfortunately, some of them to not fit into the original design.  For example, back in the day there were only "normal" properties -- the statically defined ones in zfs_prop_t (e.g. quota, used, compression).  Now we have several flavors of dynamic properties: user properties (e.g. "com.delphix:dbname"), userquota-type props (e.g. "userused@mahrens", "userquota@csiden"), and written props (e.g. "written@prevsnap").  Each of these flavors requires special-case code to be added in several different places.  We need to consolidate all the handling of one property "flavor" in one place.

LibZFS has outgrown its original design, to the point that even simple enhancements are overly complicated and risky.

We are using libzfs from our Java stack, via JNA.  We've run into a number of difficulties: libzfs is not thread safe; the interface is unstable; there are different functions for manipulating each "flavor" of property.

To address these issues and more, I propose that we create a new library, libzfs_core, which will generally be a thin wrapper around the kernel ioctls.  Error handling and thread safety issues will be pushed down into the kernel, and user interface issues will be pushed up into the libzfs_core consumers.  Our goals for libzfs_core are:

  • Thread safe
    • Avoid global data (e.g. caching)
  • Committed interface
    • Consumers will work on future releases
    • Interfaces won't be changed
  • Programatic error handling
    • Routines return defined error numbers or error nvlists
    • No printing to stdout/stderr
  • Thin layer
    • Generally just marshalls arguments to/from kernel ioctls
    • Generally 1:1 libzfs_core function -> ioctl
  • Clear Atomicity
    • Generally, every function call will be atomic (because ioctls are generally atomic)

For more details, check out my slides, and the video of my presentation.  (In the video I refer to libzfs_core as "libzfs2".  There was some discussion about the name, and we decided to rename it to libzfs_core.)

If you are using libzfs, I'd like to hear from you!  Will libzfs_core meet your needs?  Do you have other ideas for improving libzfs?  Contact me or the zfs@lists.illumos.org mailing list.

 

Filed under: Uncategorized No Comments

ZFS 10 year anniversary

Tuesday, November 1st, 2011Posted by matt

Halloween has always been a special holiday for ZFS.  We ran our first code 10 years ago in October 2001.  We integrated ZFS into OpenSolaris on October 31, 2005.  It's interesting to look back and remember what we had figured out early on, and which ideas weren't developed until much later.  Ten years ago, we had only been working on ZFS for 4 months.  I had a freshly minted undergraduate degree, a new apartment on the opposite coast, and a lot of work to do.  The key principles we laid out then still ring true: massive scale, easy administration, fault tolerance, snapshots, a copy-on-write always-consistent on-disk format.  But the specifics were  murky ten years ago.

Pooled storage was a key idea -- one pool composed of many disks, and many filesystems consuming space from the pool.  However, the relationship between pools on a system, and filesystems in a pool hadn't been nailed down.  The filesystem namespace was flat -- no nested filesystems, no property inheritance.  We weren't sure if there would be one mountpoint for the entire pool, or one per filesystem.  We hadn't even considered clones, which are now integral to the Delphix product.  We knew we wanted some sort of RAID, but had no idea it would end up looking like RAID-Z.

ZFS send and receive wasn't considered until late in the development cycle.  The idea came to me in 2005.  ZFS was nearing integration, and I was spending a few months working in Sun's new office in Beijing, China.  The network link between Beijing and Menlo Park was low-bandwidth and high-latency, and our NFS-based source code manager was painful to use.  I needed a way to quickly ship incremental changes to a workspace across the Pacific.  A POSIX-based utility (like rsync) would at best have to traverse all the files and directories to find the few that were modified since a specific date, and at worst it would compare the files on each side, incurring many high-latency round trips.  I realized that the block pointers in ZFS already have all the information we need: the birth time allows us to quickly and precisely find the blocks that are changed since a given snapshot.  It was easiest to implement ZFS send at the DMU layer, just below the ZPL.  This allows the semantically-important changes to be transferred exactly, without any special code to handle features like NFSv4 style ACLs, case-insensitivity, and extended attributes. Storage-specific settings, like compression and RAID type, can be different on the sending and receiving sides.  What began as a workaround for a crappy network link has become one of the pillars of ZFS, and the foundation of several remote replication products, including the one at Delphix.

ZFS is an evolving product, and ZFS send/receive is a great example of that.  I added "zfs send -R" on Halloween 2007, which added the ability to replicate a whole tree of filesystems, including properties and incremental rename and destroy of filesystems and snapshots.  Halloween 2009, Tom Erickson implemented "received properties" -- the distinction between properties set locally on the receiving system, vs properties set by "zfs receive".  For this Halloween, I've been working on estimating the size of the stream generated by "zfs send", so that we can have an accurate progress bar while doing replication.  My coworker Chris Siden is working on resumable ZFS send -- if the send is interrupted by a network outage or one of the machines failing, we will be able to pick up where we left off, without losing any work.

The demands placed on any piece of software change over time.  Software with poorly-designed internal interfaces quickly becomes a minefield of special cases, where every bug fixed introduces one more.  The framework of ZFS has served us well for the past decade, but we have to be ready to re-evaluate as we go.  That's why I've been working on ZFS Feature Flags with Basil Crow and Chris Siden.  This will allow us to evolve the ZFS on-disk format in a flexible way, with multiple independent developers contributing changes.  You can read more about Feature Flags, and the other work we've been doing at Delphix, in the slides for a talk that George Wilson and I gave at the Open Storage Summit last week -- another Halloween milestone in a decade of ZFS development.

Update: video of our presentation on Feature Flags available here.  Chris Siden talks about the current state of Feature Flags here.

Filed under: Uncategorized 10 Comments