ZFS Day
ZFS Day, the first ever conference dedicated to ZFS, was a great success, thanks to all of our great speakers and attendees. I especially enjoyed the panel discussion with the engineers working on ZFS on Linux, FreeBSD, and MacOS. I opened the day with a talk on what's unique in Open ZFS, and the benefits of platform diversity. My slides and a video of the presentation are available. You can watch the whole conference and find other speakers' slides at zfsday.com.
What is Shared Snapshot Space?
In the Delphix Appliance GUI, the Capacity Management screen shows how much space is used by each dSource, VDB, and snapshot. It also shows how much space is used by "Snapshots" and "Shared Snapshot Space". What are these quantities? And why is the space "used" by each snapshot often so small?
Space used by all snapshots of a filesystem
The space used by "Snapshots" (and the ZFS "used by snapshots" property) is the amount of space that would be recovered if all snapshots were deleted. This is the space that is referenced by (accessible via) any snapshot, but not by the current copy (a.k.a filesystem). It answers the question, "How much storage is it costing me to have all these snapshots of this filesystem?". As the filesystem remove or overwrites files, we have to keep around the old version because of the snapshots, so the amount of space used by snapshots will increase.
Space "Used" by an individual snapshot
The space "Used" by an individual snapshot is the amount of space that would be recovered if that single snapshot was deleted. This is the amount of space unique to the snapshot -- that is, the space that is referenced by only this snapshot, and not by any other snapshot or the filesystem (ignoring clones / VDBs created from this snapshot). As the filesystem removes or overwrites files, the amount of space "used" by the most recent snapshot will increase.
Shared Snapshot Space
The "Shared Snapshot Space" is simply the amount used by "Snapshots" minus the space "Used" by each snapshot. This is the space that is referenced by two or more snapshots, but not by the filesystem.
Space "Written" by a snapshot
The ZFS "Written" property tells us how much data was written between the previous snapshot and this one. It gives us an idea of the change rate of the filesystem, and we're considering exposing it in the Delphix GUI. The space "Written" by a snapshot may be shared with many snapshots after it, and by the live ("current copy") filesystem. So if a snapshot has a large amount of space "Written", deleting that snapshot won't necessarily recover that space; you may have to delete many snapshots after it, and even remove data from the live filesystem.
Destroying Snapshots
When we destroy a snapshot, the snapshot may have been sharing space with the adjacent (previous and next) snapshots. If the shared space now becomes unique to the adjacent snapshots, those snapshots ("used") space will increase. So when we delete a snapshot, we will recover that snapshot's "used" space, and some of the "Shared Snapshot Space" may become unique to the adjacent snapshots, and be transferred to their "used" space.
Examples
Let's imagine that we have a filesystem with 1 TB of data in it, but no snapshots. Now we take some snapshots and manipulate the data in them. What will these space accounting values look like?
Initial snapshot creation
When a snapshot is initially created, it has the same contents as the filesystem, sharing all of its space with the filesystem. It didn't take any space to create, and no space will be reclaimed if it is destroyed. It references the same 1 TB as the filesystem, but has no unique space, so its "Used" is zero. The space used by all snapshots, and the shared snapshot space, are also zero. In the Capacity Management screen of the Delphix GUI, we'll see:
Current Copy Size: 1 TB
Used By All Snapshots: Zero
Shared Snapshot Space: Zero
Snapshot A Used: Zero
And from the command line:
$ zfs get referenced,usedbysnapshots domain0/.../datafile
NAME PROPERTY VALUE (GUI TERMINOLOGY)
domain0/.../datafile referenced 1.0T ("Current Copy Size")
domain0/.../datafile usedbysnapshots 0 ("Used By All Snapshots")
$ zfs get -r used domain0/.../datafile
NAME PROPERTY VALUE
domain0/.../datafile used 1.0T (part of DB's space used)
domain0/.../datafile@snap-1 used 0 (snapshot "used")
If we remove all files
If we were to remove (or overwrite) all the files from the filesystem, what would happen to these quantities? We can't actually recover the space used by the deleted/overwritten files, because they are still referenced by the snapshot. The snapshot would continue to have the same contents, but now it wouldn't be sharing any space with the filesystem. The 1 TB of space referenced by the snapshot would only be accessible via the snapshot, and not from any other snapshot (because there are no other snapshots) and not by the filesystem (because we've deleted or overwritten all the files). Now all of the snapshot's space is unique, so its space "Used" will be the same as its space "Referenced": 1 TB. The space used by all snapshots will also be 1 TB. The shared snapshot space will still be zero, because there's only one snapshot.
Current Copy Size: Zero
Used By All Snapshots: 1 TB
Shared Snapshot Space: Zero
Snapshot A Used: 1 TB
And from the command line:
NAME PROPERTY VALUE SOURCE domain0/.../datafile referenced 0 - domain0/.../datafile usedbysnapshots 1.0T - domain0/.../datafile used 1.0T - domain0/.../datafile@oracle_snapshot-1 used 1.0T -
What if we took two snapshots?
What if we took two snapshots before removing (or overwriting) all the files from the filesystem? These two snapshots have the same contents -- they reference the exact same 1 TB of data. After deleting the files, the snapshots don't share any space with the current copy. We're keeping their blocks around just for the snapshots, so if we deleted all the snapshots, we'd recover that space. Therefore the space "used by all snapshots" will be 1 TB.
However, if we delete either one of the snapshots, we can't reclaim any space, because the other snapshot will still reference it. Each snapshot does not have any unique space (it's all shared with the other snapshot), so each snapshot's space "Used" is zero! If we were to delete both snapshots, we'd recover the 1 TB, so the "Shared Snapshot Space" is 1 TB. In this situation, we know that the snapshots are taking up space, but there is no one snapshot that is responsible for the space, so we don't know which snapshots are to blame.
Current Copy Size: Zero
Used By All Snapshots: 1 TB
Shared Snapshot Space: 1 TB
Snapshot A Used: Zero
Snapshot B Used: Zero
And from the command line:
NAME PROPERTY VALUE SOURCE domain0/.../datafile referenced 0 - domain0/.../datafile usedbysnapshots 1.0T - domain0/.../datafile used 1.0T - domain0/.../datafile@oracle_snapshot-1 used 0 - domain0/.../datafile@oracle_snapshot-2 used 0 -
What about 3 snapshots?
In the previous example, there were only 2 snapshots, so it's obvious that the shared snapshot space is shared between those two snapshots. But if there is a third snapshot, things get more complicated. If two or three of the snapshots are sharing space, we won't be able to tell which of them are sharing how much:
Current Copy Size: Zero
Used By All Snapshots: 1 TB
Shared Snapshot Space: 1 TB
Snapshot A Used: Zero
Snapshot B Used: Zero
Snapshot C Used: Zero
And from the command line:
NAME PROPERTY VALUE SOURCE domain0/.../datafile referenced 0 - domain0/.../datafile usedbysnapshots 1.0T - domain0/.../datafile used 1.0T - domain0/.../datafile@oracle_snapshot-1 used 0 - domain0/.../datafile@oracle_snapshot-2 used 0 - domain0/.../datafile@oracle_snapshot-3 used 0 -
It could be that I created the three snapshots, then deleted the files, so I will have to destroy all three snapshot to recover the shared space. Or it could be that I deleted the files before taking the third snapshot, so it really has no consequence and it's only snapshots A and B that are holding onto all that space. Lastly, it could be that I took snapshot A before writing the files, so it's snapshots B and C that are holding onto the space. The situation is increasingly complex with more snapshots, and when considering more than just one chunk of space (e.g. overwriting some parts of some files before each snapshot).
To mitigate this complexity, I implemented a new feature in ZFS and the Delphix management stack that allows us to determine how much space would be reclaimed if several snapshots were destroyed, taking into account the space that is actually shared by those snapshots. In the Delphix Capacity Management screen, you can select snapshots and see the "Total capacity of objects selected for deletion". In our three-snapshot case, this would allow you to experimentally determine which two (or three) snapshots are actually holding onto the space.
This feature is based on the new "zfs destroy -nv <list of snapshots>" feature in ZFS:
$ zfs destroy -nv domain0/.../datafile@oracle_snapshot-1%oracle_snapshot-2 would destroy domain0/.../datafile@oracle_snapshot-1 would destroy domain0/.../datafile@oracle_snapshot-2 would reclaim 0 $ zfs destroy -nv domain0/.../datafile@oracle_snapshot-2%oracle_snapshot-3 would destroy domain0/.../datafile@oracle_snapshot-2 would destroy domain0/.../datafile@oracle_snapshot-3 would reclaim 1.0T
Aha! We need to destroy snapshots 2 and 3 to reclaim the space.
We're also exploring other mechanisms for graphically displaying snapshot space using information, with the goal of letting you see at a glance which snapshots are using space.
Future work on the performance of clone deletion
I recently posted about some improvements Delphix has made to the performance of destroying a ZFS clone. Even with these improvements, the worst case could see us recovering only about 800K/second (when each indirect block points to only one, 8KB block that needs to be freed, and the pool contains a single 7200RPM disk). What could be done to improve this further?
The current prefetching mechanism for issuing multiple concurrent i/os while traversing the metadata is somewhat primitive. It concurrently issues the i/o for the children of a single block. This can be anywhere from 1 to 128 concurrent i/os. In my tests I averaged 6 concurrent i/os, meaning that we should see approximately linear scaling as we add more disks, up to 6x with 6 disks. Additional disks would have diminishing returns. A more sophisticated algorithm should be able to issue an almost unlimited number of i/os, for unlimited scaling with additional physical disks. (Of course, a hard limit would be necessary to avoid running out of memory!)
In the worst case, each indirect block points to only one block that we need to free. This is very low density of relavent information -- we read a 16K indirect block and only use a single, 128-byte, block pointer (less than 1% of the data read). Ideally, we would store this set of blocks compactly -- a "livelist", analogous to snapshots' deadlists. However, maintaining this structure would create a performance penalty when writing to the clone (e.g. locating entries to be removed when blocks are freed, and compacting the structure). A simplistic implementation (using the ZAP, an on-disk hash table) would have similar performance penalty to dedup.
Performance of ZFS destroy
How fast is "zfs destroy"? This is a difficult question to answer, because we can destroy many different kinds of datasets (snapshots, filesystems, volumes, clones), and the data in these datasets can be very different (many small files, large sequentially accessed files, large sparse files). In this post, I will examine a specific case which is important for Delphix: deleting a Virtual Database (VDB).
A VDB is a clone, and contains the files used to store a database. Most of the space is in a few large files, typically with recordsize=8k, and the blocks are updated somewhat randomly. This access pattern is common for zvols (used for serving up iSCSI and FC targets), file-based virtual disks (vmdk files), and (most importantly to Delphix) databases.
To reclaim space, we must free the individual blocks that are no longer needed. For a filesystem or zvol, we must traverse its tree of metadata (indirect blocks and dnodes) to find all the blocks that need to be freed. For large files with recordsize=8k, each indirect block read from disk from disk yields about 1MB of space that will be freed. Since the file is updated randomly, the indirect blocks will not be contiguous on disk, so these reads will be to roughly random offsets to disk. With a single, 7200RPM disk that can do about 100 random IOPS, we will be able to reclaim 1TB of data in less than 3 hours.
There are two main steps to deleting a filesystem or clone. We must remove it from the namespace (removing linkages to other related snapshots e.g. the clone origin), and we reclaim the free space as described above. Until recently, we did these steps in the reverse order: first reclaiming the unused space by deleting each file, and then removing the filesystem from the namespace.
In DelphixOS 2.7.0, Chris Siden introduced the "async_destroy" feature, which was integrated into Illumos in May 2012. When this feature it enabled, we perform these operations in a better order: we first remove the filesystem or clone from the namespace, and then reclaim the unused space in the background. To do this, we needed to change the way that we traverse the metadata. The old algorithm found each modified file, and then deleted it. The new algorithm is a more literal tree traversal, which traverses blocks modified since the clone was created. I wanted to measure the performance impact of making this algorithmic change, specifically to destroying a VDB (clone of a database).
What I found was that the new code was insanely faster than the old code -- over 100x! This surprised me, so I investigated which blocks each code path was actually reading on-disk. I discovered that the old code was actually reading every indirect block of every modified file, even if only a few blocks of that file were modified. The new code only reads blocks that were modified since the clone was created. Without the async_destroy feature, destroying a clone is as slow as destroying the original filesystem (up to 3 hours for a clone that references 1TB, but has negligible new data, and thus reclaims negligible space).
With async_destroy, the worst-case scenario would be that each indirect block modified in the clone has only one modified data block in it, so each read from disk yields about 8K of space that will be freed, so we will be able to reclaim about 800K/second. This could still be slow, but at least is is proportional to the amount of space that is being reclaimed (that is, the amount of space modified in the clone).
I also discovered that in DelphixOS 2.7.0 - 2.7.2, we only issued one read i/o at a time to the disk, when traversing the tree of metadata. In DelphixOS 2.7.3, I made the traversal code issue several i/os concurrently, which improves performance considerably when the pool is composed of several physical disks.
We have considerably improved the performance of deleting clones in DelphixOS and Illumos. It now does its work in the background, and it takes time proportional to the amount of space that is being reclaimed. This code is available for any open-source ZFS implementation to take advantage of (FreeBSD, Linux, Nexenta, etc).
The Future of LibZFS
Last week we had a very successful Illumos meetup, hosted at Delphix HQ in Menlo Park. Thanks to all who participated! In honor of the ZFS 10 year anniversary, my colleagues Chris Siden and John Kennedy gave great talks about ZFS Feature Flags, and testing strategies. My contribution was to present plans for a new programmatic interface to ZFS.
I've been running into a lot of problems with libzfs in my work at Delphix, generally falling into 2 categories: making changes to libzfs, and using libzfs.
The ZFS community has added many new capabilities to libzfs since it was integrated on October 31, 2005. Unfortunately, some of them to not fit into the original design. For example, back in the day there were only "normal" properties -- the statically defined ones in zfs_prop_t (e.g. quota, used, compression). Now we have several flavors of dynamic properties: user properties (e.g. "com.delphix:dbname"), userquota-type props (e.g. "userused@mahrens", "userquota@csiden"), and written props (e.g. "written@prevsnap"). Each of these flavors requires special-case code to be added in several different places. We need to consolidate all the handling of one property "flavor" in one place.
LibZFS has outgrown its original design, to the point that even simple enhancements are overly complicated and risky.
We are using libzfs from our Java stack, via JNA. We've run into a number of difficulties: libzfs is not thread safe; the interface is unstable; there are different functions for manipulating each "flavor" of property.
To address these issues and more, I propose that we create a new library, libzfs_core, which will generally be a thin wrapper around the kernel ioctls. Error handling and thread safety issues will be pushed down into the kernel, and user interface issues will be pushed up into the libzfs_core consumers. Our goals for libzfs_core are:
- Thread safe
- Avoid global data (e.g. caching)
- Committed interface
- Consumers will work on future releases
- Interfaces won't be changed
- Programatic error handling
- Routines return defined error numbers or error nvlists
- No printing to stdout/stderr
- Thin layer
- Generally just marshalls arguments to/from kernel ioctls
- Generally 1:1 libzfs_core function -> ioctl
- Clear Atomicity
- Generally, every function call will be atomic (because ioctls are generally atomic)
For more details, check out my slides, and the video of my presentation. (In the video I refer to libzfs_core as "libzfs2". There was some discussion about the name, and we decided to rename it to libzfs_core.)
If you are using libzfs, I'd like to hear from you! Will libzfs_core meet your needs? Do you have other ideas for improving libzfs? Contact me or the zfs@lists.illumos.org mailing list.
ZFS 10 year anniversary
Halloween has always been a special holiday for ZFS. We ran our first code 10 years ago in October 2001. We integrated ZFS into OpenSolaris on October 31, 2005. It's interesting to look back and remember what we had figured out early on, and which ideas weren't developed until much later. Ten years ago, we had only been working on ZFS for 4 months. I had a freshly minted undergraduate degree, a new apartment on the opposite coast, and a lot of work to do. The key principles we laid out then still ring true: massive scale, easy administration, fault tolerance, snapshots, a copy-on-write always-consistent on-disk format. But the specifics were murky ten years ago.
Pooled storage was a key idea -- one pool composed of many disks, and many filesystems consuming space from the pool. However, the relationship between pools on a system, and filesystems in a pool hadn't been nailed down. The filesystem namespace was flat -- no nested filesystems, no property inheritance. We weren't sure if there would be one mountpoint for the entire pool, or one per filesystem. We hadn't even considered clones, which are now integral to the Delphix product. We knew we wanted some sort of RAID, but had no idea it would end up looking like RAID-Z.
ZFS send and receive wasn't considered until late in the development cycle. The idea came to me in 2005. ZFS was nearing integration, and I was spending a few months working in Sun's new office in Beijing, China. The network link between Beijing and Menlo Park was low-bandwidth and high-latency, and our NFS-based source code manager was painful to use. I needed a way to quickly ship incremental changes to a workspace across the Pacific. A POSIX-based utility (like rsync) would at best have to traverse all the files and directories to find the few that were modified since a specific date, and at worst it would compare the files on each side, incurring many high-latency round trips. I realized that the block pointers in ZFS already have all the information we need: the birth time allows us to quickly and precisely find the blocks that are changed since a given snapshot. It was easiest to implement ZFS send at the DMU layer, just below the ZPL. This allows the semantically-important changes to be transferred exactly, without any special code to handle features like NFSv4 style ACLs, case-insensitivity, and extended attributes. Storage-specific settings, like compression and RAID type, can be different on the sending and receiving sides. What began as a workaround for a crappy network link has become one of the pillars of ZFS, and the foundation of several remote replication products, including the one at Delphix.
ZFS is an evolving product, and ZFS send/receive is a great example of that. I added "zfs send -R" on Halloween 2007, which added the ability to replicate a whole tree of filesystems, including properties and incremental rename and destroy of filesystems and snapshots. Halloween 2009, Tom Erickson implemented "received properties" -- the distinction between properties set locally on the receiving system, vs properties set by "zfs receive". For this Halloween, I've been working on estimating the size of the stream generated by "zfs send", so that we can have an accurate progress bar while doing replication. My coworker Chris Siden is working on resumable ZFS send -- if the send is interrupted by a network outage or one of the machines failing, we will be able to pick up where we left off, without losing any work.
The demands placed on any piece of software change over time. Software with poorly-designed internal interfaces quickly becomes a minefield of special cases, where every bug fixed introduces one more. The framework of ZFS has served us well for the past decade, but we have to be ready to re-evaluate as we go. That's why I've been working on ZFS Feature Flags with Basil Crow and Chris Siden. This will allow us to evolve the ZFS on-disk format in a flexible way, with multiple independent developers contributing changes. You can read more about Feature Flags, and the other work we've been doing at Delphix, in the slides for a talk that George Wilson and I gave at the Open Storage Summit last week -- another Halloween milestone in a decade of ZFS development.
Update: video of our presentation on Feature Flags available here. Chris Siden talks about the current state of Feature Flags here.