ZFS 10 year anniversary

Posted by Matthew Ahrens on in Engineering

Halloween has always been a special holiday for ZFS.  We ran our first code 10 years ago in October 2001.  We integrated ZFS into OpenSolaris on October 31, 2005.  It’s interesting to look back and remember what we had figured out early on, and which ideas weren’t developed until much later.  Ten years ago, we had only been working on ZFS for 4 months.  I had a freshly minted undergraduate degree, a new apartment on the opposite coast, and a lot of work to do.  The key principles we laid out then still ring true: massive scale, easy administration, fault tolerance, snapshots, a copy-on-write always-consistent on-disk format.  But the specifics were  murky ten years ago.

Pooled storage was a key idea — one pool composed of many disks, and many filesystems consuming space from the pool.  However, the relationship between pools on a system, and filesystems in a pool hadn’t been nailed down.  The filesystem namespace was flat — no nested filesystems, no property inheritance.  We weren’t sure if there would be one mountpoint for the entire pool, or one per filesystem.  We hadn’t even considered clones, which are now integral to the Delphix product.  We knew we wanted some sort of RAID, but had no idea it would end up looking like RAID-Z.

ZFS send and receive wasn’t considered until late in the development cycle.  The idea came to me in 2005.  ZFS was nearing integration, and I was spending a few months working in Sun’s new office in Beijing, China.  The network link between Beijing and Menlo Park was low-bandwidth and high-latency, and our NFS-based source code manager was painful to use.  I needed a way to quickly ship incremental changes to a workspace across the Pacific.  A POSIX-based utility (like rsync) would at best have to traverse all the files and directories to find the few that were modified since a specific date, and at worst it would compare the files on each side, incurring many high-latency round trips.  I realized that the block pointers in ZFS already have all the information we need: the birth time allows us to quickly and precisely find the blocks that are changed since a given snapshot.  It was easiest to implement ZFS send at the DMU layer, just below the ZPL.  This allows the semantically-important changes to be transferred exactly, without any special code to handle features like NFSv4 style ACLs, case-insensitivity, and extended attributes. Storage-specific settings, like compression and RAID type, can be different on the sending and receiving sides.  What began as a workaround for a crappy network link has become one of the pillars of ZFS, and the foundation of several remote replication products, including the one at Delphix.

ZFS is an evolving product, and ZFS send/receive is a great example of that.  I added “zfs send -R” on Halloween 2007, which added the ability to replicate a whole tree of filesystems, including properties and incremental rename and destroy of filesystems and snapshots.  Halloween 2009, Tom Erickson implemented “received properties” — the distinction between properties set locally on the receiving system, vs properties set by “zfs receive”.  For this Halloween, I’ve been working on estimating the size of the stream generated by “zfs send”, so that we can have an accurate progress bar while doing replication.  My coworker Chris Siden is working on resumable ZFS send — if the send is interrupted by a network outage or one of the machines failing, we will be able to pick up where we left off, without losing any work.

The demands placed on any piece of software change over time.  Software with poorly-designed internal interfaces quickly becomes a minefield of special cases, where every bug fixed introduces one more.  The framework of ZFS has served us well for the past decade, but we have to be ready to re-evaluate as we go.  That’s why I’ve been working on ZFS Feature Flags with Basil Crow and Chris Siden.  This will allow us to evolve the ZFS on-disk format in a flexible way, with multiple independent developers contributing changes.  You can read more about Feature Flags, and the other work we’ve been doing at Delphix, in the slides for a talk that George Wilson and I gave at the Open Storage Summit last week — another Halloween milestone in a decade of ZFS development.

Update: video of our presentation on Feature Flags available here.  Chris Siden talks about the current state of Feature Flags here.


  Comments: 14


  1. Matt, thank you for sharing some of the background history of ZFS.
    And giving us an insight to the future of ZFS.
    It would be good to hear more from you via this blog and via twitter.

  2. Well said Matt. Thank you for this, for your excellent chat last week, and, well, for ZFS itself.

  3. That is why this is such a great software!
    It is the most advanced filesystem in the world, and you are still working on improvements, and are not afraid to re-evaluate it.
    Open mind and great code, well done!

  4. Thank you for these ten years. Thank you.

  5. You have done an awesome job!

    When will ZFS get asyncronous replication, from a source to [1 or more] destination[s]???

    Reading your post about “zfs send”, understanding “zfs diff” and ZIL, it seems trivial… :-)

  6. You guys have done a fantastic job over the years, but i have to ask: why will no-one talk about what happend to Block Pointer Rewrite?

      • Matt,
        Everyone has been “staying tuned” on this since 2008. The reason it keeps getting asked is because of the significant need to defrag pools. Right now the only way is to copy and write back a filesystem, but that forces you to clear out past snapshots.

        zfs is outstanding and you guys have done a great job, but BP rewrite is a key feature and everyone has been in the dark waiting for 4 years.

        Maybe it’s time to at least provide some light for us poor users….

        • Ah, thanks for reminding me that I meant to write a blog post about that. To be clear: I meant “stay tuned for the story of what happened to bprewrite”; not “stay tuned for bprewrite”. As I’ve posted elsewhere, I’m not aware of anyone working on it.

  7. Always nice to see people (and companies) work to improve ZFS, despite ORacle (lack of) support.
    I was already confident the FreeBSD guys could keep it up, but the more the merrier. 😉

  8. This might be off-topic, but once you have Delphix up and running (which we do at my company), is there any way to monitor it? I’d like to have some tests in xymon (hobbit), is there any interface I can use to get stats on how delphix is performing?