AGILE DATA MANAGEMENT

Performance of ZFS destroy

How fast is “zfs destroy”?  This is a difficult question to answer, because we can destroy many different kinds of datasets (snapshots, filesystems, volumes, clones), and the data in these datasets can be very different (many small files, large sequentially accessed files, large sparse files).  In this post, I will examine a specific case which is important for Delphix: deleting a Virtual Database (VDB).

A VDB is a clone, and contains the files used to store a database.  Most of the space is in a few large files, typically with recordsize=8k, and the blocks are updated somewhat randomly.  This access pattern is common for zvols (used for serving up iSCSI and FC targets), file-based virtual disks (vmdk files), and (most importantly to Delphix) databases.

To reclaim space, we must free the individual blocks that are no longer needed.  For a filesystem or zvol, we must traverse its tree of metadata (indirect blocks and dnodes) to find all the blocks that need to be freed.  For large files with recordsize=8k, each indirect block read from disk from disk yields about 1MB of space that will be freed.  Since the file is updated randomly, the indirect blocks will not be contiguous on disk, so these reads will be to roughly random offsets to disk.  With a single, 7200RPM disk that can do about 100 random IOPS, we will be able to reclaim 1TB of data in less than 3 hours.

There are two main steps to deleting a filesystem or clone.  We must remove it from the namespace (removing linkages to other related snapshots e.g. the clone origin), and we reclaim the free space as described above.  Until recently, we did these steps in the reverse order:  first reclaiming the unused space by deleting each file, and then removing the filesystem from the namespace.

In DelphixOS 2.7.0Chris Siden introduced the “async_destroy” feature, which was integrated into Illumos in May 2012.  When this feature it enabled, we perform these operations in a better order: we first remove the filesystem or clone from the namespace, and then reclaim the unused space in the background.  To do this, we needed to change the way that we traverse the metadata.  The old algorithm found each modified file, and then deleted it.  The new algorithm is a more literal tree traversal, which traverses blocks modified since the clone was created.  I wanted to measure the performance impact of making this algorithmic change, specifically to destroying a VDB (clone of a database).

What I found was that the new code was insanely faster than the old code — over 100x!  This surprised me, so I investigated which blocks each code path was actually reading on-disk.  I discovered that the old code was actually reading every indirect block of every modified file, even if only a few blocks of that file were modified.  The new code only reads blocks that were modified since the clone was created.  Without the async_destroy feature, destroying a clone is as slow as destroying the original filesystem (up to 3 hours for a clone that references 1TB, but has negligible new data, and thus reclaims negligible space).

With async_destroy, the worst-case scenario would be that each indirect block modified in the clone has only one modified data block in it, so each read from disk yields about 8K of space that will be freed, so we will be able to reclaim about 800K/second.  This could still be slow, but at least is is proportional to the amount of space that is being reclaimed (that is, the amount of space modified in the clone).

I also discovered that in DelphixOS 2.7.0 – 2.7.2, we only issued one read i/o at a time to the disk, when traversing the tree of metadata.  In DelphixOS 2.7.3, I made the traversal code issue several i/os concurrently, which improves performance considerably when the pool is composed of several physical disks.

We have considerably improved the performance of deleting clones in DelphixOS and Illumos.  It now does its work in the background, and it takes time proportional to the amount of space that is being reclaimed.  This code is available for any open-source ZFS implementation to take advantage of (FreeBSD, Linux, Nexenta, etc).

5 Responses to “Performance of ZFS destroy”

  1. Tom Shaw says:

    Hi Matt, great to see ZFS thriving in Delphix and the whole community. For those of us still on Solaris, I’m curious about the actual impact of the old snapshot destroy algorithm. In practice, don’t metadata caching and prefetch greatly speed up the metadata traversal?

  2. matt says:

    Metadata caching and prefetch will certainly help with the metadata traversal. However, if you have a clone referencing 1TB of data with 8k recordsize, there is about 16GB of metadata, all of which will be read in on Oracle Solaris. That’s quite a lot of data, even on modern hardware. Even with 90% cache hit rate, 200 IOPS per disk, and 4 disks, it will take 120 seconds to destroy. Compared with less than a second on Illumos, it’s still slow!

  3. Surya says:

    I am sorry to be asking a question on an old blog post.
    But I am still asking, as the author referred me to this.
    I have a question and a comment on this.
    My question is – on this line of the blog post :
    “discovered that the old code was actually reading every indirect block of every modified file,”
    That doesn’t seem to be right.
    If you refer to the block traversal code in dnode_next_offset_level() – this piece of code, exactly
    prevents what you are saying :
    if (bp[i].blk_fill >= minfill &&
    bp[i].blk_fill txg))
    See the txg check.
    For ex, while looking at an L4 indirect block of an
    object, of the 128 L3bps, we will be interested only in
    those bps which have a birth txg higher than the snapshot
    txg. Rest are skimmed over.
    Once it comes across a relevant ‘bp’, then the descent
    starts.
    Now read in the L3 bock and of the 128 L2 bps in there
    -check which has higer birth txg and repeat the process.
    So its not that we read in all the indirect blocks of
    all the modified files.

    Now the comment is – on this line of your reply :
    “16GB of metadata, all of which will be read in on Oracle Solaris.”
    This is wrong – it won’t be – atleast as of 18Dec2013.
    I am sure Mark Maybe would be surprised to see this
    remark coming from another ZFS architect.
    -Surya

    • matt says:

      As you later discovered, the problem is that object freeing can’t skip the blocks that are part of snapshots, because they must be added to the dead list.

      This problem existed in all versions of Oracle Solaris as of November 2010. If they have fixed it since then, I’d appreciate a link to documentation of this fix and will update the post if additional information comes to light.

  4. Surya says:

    Matt’s observation is correct – because in dsl_dataset_destroy(), its using snap_txg to find the
    next modified object – but when it comes to freeing
    the object, its not using the snap_txg – which results
    in passing 0 to the dnode_next_offset() – and hence
    reading all the indirects :-(
    -Surya
    PS: Mark?

Leave a Reply