I recently posted about some improvements Delphix has made to the performance of destroying a ZFS clone. Even with these improvements, the worst case could see us recovering only about 800K/second (when each indirect block points to only one, 8KB block that needs to be freed, and the pool contains a single 7200RPM disk). What could be done to improve this further?
The current prefetching mechanism for issuing multiple concurrent i/os while traversing the metadata is somewhat primitive. It concurrently issues the i/o for the children of a single block. This can be anywhere from 1 to 128 concurrent i/os. In my tests I averaged 6 concurrent i/os, meaning that we should see approximately linear scaling as we add more disks, up to 6x with 6 disks. Additional disks would have diminishing returns. A more sophisticated algorithm should be able to issue an almost unlimited number of i/os, for unlimited scaling with additional physical disks. (Of course, a hard limit would be necessary to avoid running out of memory!)
In the worst case, each indirect block points to only one block that we need to free. This is very low density of relavent information — we read a 16K indirect block and only use a single, 128-byte, block pointer (less than 1% of the data read). Ideally, we would store this set of blocks compactly — a “livelist”, analogous to snapshots’ deadlists. However, maintaining this structure would create a performance penalty when writing to the clone (e.g. locating entries to be removed when blocks are freed, and compacting the structure). A simplistic implementation (using the ZAP, an on-disk hash table) would have similar performance penalty to dedup.