AGILE DATA MANAGEMENT

Performance of ZFS destroy

How fast is “zfs destroy”?  This is a difficult question to answer, because we can destroy many different kinds of datasets (snapshots, filesystems, volumes, clones), and the data in these datasets can be very different (many small files, large sequentially accessed files, large sparse files).  In this post, I will examine a specific case which is important for Delphix: deleting a Virtual Database (VDB).

A VDB is a clone, and contains the files used to store a database.  Most of the space is in a few large files, typically with recordsize=8k, and the blocks are updated somewhat randomly.  This access pattern is common for zvols (used for serving up iSCSI and FC targets), file-based virtual disks (vmdk files), and (most importantly to Delphix) databases.

To reclaim space, we must free the individual blocks that are no longer needed.  For a filesystem or zvol, we must traverse its tree of metadata (indirect blocks and dnodes) to find all the blocks that need to be freed.  For large files with recordsize=8k, each indirect block read from disk from disk yields about 1MB of space that will be freed.  Since the file is updated randomly, the indirect blocks will not be contiguous on disk, so these reads will be to roughly random offsets to disk.  With a single, 7200RPM disk that can do about 100 random IOPS, we will be able to reclaim 1TB of data in less than 3 hours.

There are two main steps to deleting a filesystem or clone.  We must remove it from the namespace (removing linkages to other related snapshots e.g. the clone origin), and we reclaim the free space as described above.  Until recently, we did these steps in the reverse order:  first reclaiming the unused space by deleting each file, and then removing the filesystem from the namespace.

In DelphixOS 2.7.0Chris Siden introduced the “async_destroy” feature, which was integrated into Illumos in May 2012.  When this feature it enabled, we perform these operations in a better order: we first remove the filesystem or clone from the namespace, and then reclaim the unused space in the background.  To do this, we needed to change the way that we traverse the metadata.  The old algorithm found each modified file, and then deleted it.  The new algorithm is a more literal tree traversal, which traverses blocks modified since the clone was created.  I wanted to measure the performance impact of making this algorithmic change, specifically to destroying a VDB (clone of a database).

What I found was that the new code was insanely faster than the old code — over 100x!  This surprised me, so I investigated which blocks each code path was actually reading on-disk.  I discovered that the old code was actually reading every indirect block of every modified file, even if only a few blocks of that file were modified.  The new code only reads blocks that were modified since the clone was created.  Without the async_destroy feature, destroying a clone is as slow as destroying the original filesystem (up to 3 hours for a clone that references 1TB, but has negligible new data, and thus reclaims negligible space).

With async_destroy, the worst-case scenario would be that each indirect block modified in the clone has only one modified data block in it, so each read from disk yields about 8K of space that will be freed, so we will be able to reclaim about 800K/second.  This could still be slow, but at least is is proportional to the amount of space that is being reclaimed (that is, the amount of space modified in the clone).

I also discovered that in DelphixOS 2.7.0 – 2.7.2, we only issued one read i/o at a time to the disk, when traversing the tree of metadata.  In DelphixOS 2.7.3, I made the traversal code issue several i/os concurrently, which improves performance considerably when the pool is composed of several physical disks.

We have considerably improved the performance of deleting clones in DelphixOS and Illumos.  It now does its work in the background, and it takes time proportional to the amount of space that is being reclaimed.  This code is available for any open-source ZFS implementation to take advantage of (FreeBSD, Linux, Nexenta, etc).

2 Responses to “Performance of ZFS destroy”

  1. Tom Shaw says:

    Hi Matt, great to see ZFS thriving in Delphix and the whole community. For those of us still on Solaris, I’m curious about the actual impact of the old snapshot destroy algorithm. In practice, don’t metadata caching and prefetch greatly speed up the metadata traversal?

  2. matt says:

    Metadata caching and prefetch will certainly help with the metadata traversal. However, if you have a clone referencing 1TB of data with 8k recordsize, there is about 16GB of metadata, all of which will be read in on Oracle Solaris. That’s quite a lot of data, even on modern hardware. Even with 90% cache hit rate, 200 IOPS per disk, and 4 disks, it will take 120 seconds to destroy. Compared with less than a second on Illumos, it’s still slow!

Leave a Reply