ZFS Write Performance (Impact of fragmentation)

Posted by Uday Vallamsetty on in Benchmarking, Performance

At Delphix we are constantly trying to improve performance. Some of the enhancements come from customers where we see performance pathologies that need to be addressed while some come from our understanding of the bottlenecks in Delphix OS. In the past few months, George Wilson and Adam Leventhal made significant improvements to how writes are handled in ZFS. In this multi part blog post, I will talk about a benchmark we created to  measure improvements in ZFS write performance as we make changes to the OS. In this post, I will talk about the benchmark setup and run. I will show some of the results from this benchmark on Delphix OS. In part two, I will present data and some analysis on the bottlenecks discovered and how they are going to be addressed.

The Benchmark

The benchmark is a simple program that writes random data to random offsets in a file. I am pleasantly surprised at how much I was able to learn from this simple 20-line program. Hopefully there is information here that would help some of you as well. The program was thought up by Adam and George as a way to observe the behavior of ZFS when a zpool is fragmented.


  randFd = open("/dev/urandom", O_RDONLY);
  fd = open(argv[2], O_CREAT|O_WRONLY|O_APPEND, 0777);
  while (1) {
    read(randFd, buf, sizeof(char)*BLOCKSIZE);
    pwrite(fd, buf, BLOCKSIZE, (lrand48() % (test_size / BLOCKSIZE)) * BLOCKSIZE);


I ran the program on a clean zpool and observed the performance of the system using DTrace. The program fragments the pool as it writes to random offsets. I wanted to measure the steady state throughput achieved as a function of the zpool usage (10% full, 50% full etc.). This will demonstrate interesting ZFS pathologies and track our progress from release to release.

Benchmark Setup

Before each run, I created a new zpool and a filesystem with record size of 8k, using the following commands.

zpool create domain0 c3t1d0
zfs create -o recordsize=8k -o compression=on domain0/tests


I used an 8 vCPU VM, with 48GB of RAM and 32GB of pool size for the Delphix server. I used a small pool size so I could fill it quickly. While 32GB is far smaller than a typical customer’s pool, I believe that the results are still representative. I ran eight concurrent instances of the program to generate load. (I also tested the benchmark with fewer than eight writers, it showed similar characteristics, but took longer to reach steady state.)

Benchmark Run

Once I created a zpool, I populated it with random data using “dd”.

dd if=/dev/urandom of=/domain0/tests/testfileX.dat bs=8k count=Y

I monitored the file system using dtrace as the pool was getting full. The output below shows a histogram of IOs, by size, being performed at the backend storage while the above “dd” command is running. The majority of writes are 64K and 128K in size, even though I did 8k writes at the user level. This aggregation is performed at the IO layer in ZFS.

2013 Feb 16 18:45:42 (14.999s elapsed) /engineering/uday/scripts/io.d

  bytes                                               write                                             
           value  ------------- Distribution ------------- count    
             256 |                                         0        
             512 |@                                        309      
            1024 |                                         217      
            2048 |@                                        281      
            4096 |@                                        230      
            8192 |@                                        231      
           16384 |                                         176      
           32768 |                                         119      
           65536 |@@@@@                                    2352     
          131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@          14321    
          262144 |                                         0        

                                avg latency         iops   throughput
write                                7311us       1302/s    142394k/s

total                                             1302/s    142394k/s


ZFS divides each IO device into a few hundred regions called “Metaslabs”. The figure 1 below shows a visualization of the zpool after it is filled to 25% using “dd” command above. Each cell represents one metaslab. The pool here has 256, 128MB Metaslabs. The percentage value is the amount of free space in each metaslab. As you can see, around 32 metaslabs are full, accounting for 8GB of data.

Fig 1: Metaslab utilization after 25% of Disk was filled using Sequential Writes

At this point, the pool is compact -most metaslabs are either full or empty and consume minimal space on disk. When I started my program with the zpool in the above state, I saw an initial spurt of high write throughput followed by a steep fall in throughput stabilizing after some time. The figure 2 below depicts the zpool after the program ran for 30 mins -note that pool is still only 25% full. This image did not change even after running the program for a few hours longer, the benchmark had reached steady state.

Fig 2: Metaslab utilization after zpool is written to using random writes, still only 25% full

As you can see the pool is fairly fragmented at this point. The pool is not compact, 2.5x times as many metaslabs are used. The majority of metaslabs have regions of free space. So what is going on here? When the program started writing to random blocks in the zpool, ZFS tried to place the random blocks in contiguous region of the pool by allocating new metaslabs. Due to copy-on-write, every time a block is overwritten it will create a hole somewhere in the pool. New data blocks were allocate to new metaslabs until ZFS found free regions within currently allocated metaslabs. At this point, ZFS started filling those holes before allocating to new metaslabs. Hence the number of metaslabs used did not increase after a point in the test. The chart in Fig 3 below shows write throughput during the benchmark, at different usage level of the zpool.

Fig 3: Random write throughput as the zpool gets fragmented

After the initial high write throughput, there is a precipitous drop with the benchmark settling at a steady state throughput. Steady state throughput is computed using average of throughput from the final 50 samples in the experiment. The steady state throughput is proportional to the amount of free space in the zpool at the beginning of the benchmark run. The chart in Fig 4 below plots this throughput against zpool occupancy.

Fig.4: Steady state write throughput under various levels of zpool occupancy


The experiments show an interesting performance pathology as ZFS pools become increasingly full and fragmented. Visualizing the metaslabs and their usage through the benchmark run, helped understand the reasons behind the drop in performance. The Fig.4 above shows a tight correlation between amount of free space in the pool and the random write throughput achieved. This test provides an interesting benchmark for future ZFS performance work. We can both work to push the curve to the right, delaying the point at which performance degrades, and push the curve up, improving the worst case. The simple, 20-line C program proved to be a fountain of useful information. In my next blog post I’ll dissect the results in more detail, and provide some insight into what ZFS is doing — and how we might improve it.


  Comments: 22


  1. Great article. I’m seeing exactly the same performance degradation over time on my file servers. We start noticing issues when we approach 30% capacity, which looks exactly like what you were able to determine.

    I cannot wait till your next article, especially the ‘what can be done about it’ part 😉

  2. Kristoffer Sheather

    Nice article, I guess the obvious question is how to fix the terrible performance degradation that you’ve observed once the pool has had a large quantity of random writes and fragmentation?

  3. Thanks for posting your test methodology and results.

    Just curious why you open the file with O_APPEND in the code snippet, when your intention is to write at random offsets within the file? Some implementations of pwrite() (i.e. on Linux) ignore the offset on file descriptors opened with O_APPEND.

    • Ned,

      Thanks for your comments. You make a good point. I verified the offsets during my testing. But I will address this to make the test more portable.


  4. A couple more observations.

    It’s not clear from Figure 3 whether the drop in performance is due to cache effects or fragmentation. Adding units to the time axis would be helpful. It would also help to provide a baseline graph of random write speeds on a completely unfragmented pool (i.e. perhaps writing to random blocks in a sparse file but never overwriting a block).

    The write speeds of your benchmark may be limited by the rate at which /dev/urandom can supply random bytes. Your results still show that performance decays with fullness, but I suspect the curves would be much higher if you eliminated that bottleneck. If you just want to make the data uncompressible, you could fill a small buffer from /dev/urandom and reuse random bytes from there.

    Thanks again for sharing your work.

    • Ned,

      For Fig.3, the units for time are 15 sec samples. I think cache effects are part of the cause for the initial drop in performance.

      I looked at the speeds for /dev/urandom and you do bring up a good point. But it looks like we should be able to read more than we are driving out to the IO in this case. Also urandom will not block on account of entropy of the random numbers generated, so it will run faster compared to random.

      I will talk about some of the reasons for drop in performance in my next blog which I am working on. I hope to have it out soon.

      Thanks again for your comments.

  5. H Uday,

    can you share your script /engineering/uday/scripts/io.d


    • Thanks for looking at my post.

      That is a simple dtrace script to generate information at the IO layer. You should be able to find something very similar to this in the dtrace toolkit. Let me know if you cannot, I can point you. This specific script needs an internal wrapper that we use. I will figure put if/how/where I can post that.

  6. Nice article. Enough food for thought, surely.

    How did you get the numbers for the metaslab utilization table?

    And what are the characteristics of device c3t1d0 that the pool was created on?

    Keep up the good work. Thanks for sharing.

    • Matthias,

      Thanks for your comments. I used “zdb -mm” to get a dump of the metaslab utilization and then post-processed that data into a heat map.

      I have data on the characteristics of the device which I will include in my next post.


  7. Great post! I would love to acquire the full code in order to understand the full benchmark.

    Also, is there a linear or curvilinear correlation between the number of drives within the zpool and the performance effects here – namely if you increase the number of drives, will the same curve still apply.

    Lastly, I would be interested on the effects of caching on performance in this case – another interesting experiment.

    Great work!


    SB Nelson

    • Nelson,

      Thanks for your comment. There is nothing more to the code than what I have in the blog. It really is a 20-line program causing all this trouble :)

      I am evaluating the sensitivity on a few other dimensions (number of LUNs, caching etc.). I will have some of that data in an upcoming post, hopefully soon.

      Uday Vallamsetty.

      • Hi Uday

        I cannot see your 20 line code, and it’s not quite clear when you say “The figure 1 below shows a visualization of the zpool after it is filled to 25% using “dd” command above.”, as the dd command is for random writes, but in figure 1 the disk is filled using Sequential Writes (per caption)
        Do you first do a sequential dd, followed by the random dd mentioned in the post?

  8. It might also be interesting to do the same test on an uncompressed dataset for comparison, and to try using a separate zil. I might also say that for repeatability, perhaps drawing your “random” data from a pre-populated source file might be helpful. I would put that on a different drive using a different fs.

  9. Been fighting this issue degenerate performance issue on every ZFS system I have deployed. Sometimees I long for UFS and raw devices especially for RDBMS.

    From what I read BPR would solve it but doesn’t seem like anyone will add this maybe it is just too complex. For know copy things to fresh zpool/spindles and recycle the old.

    The bottom line toss 70% of your storage to limit the performance drop to just “50%” in a high random IO write environment. Your chart although modeling real life if very discouraging.

  10. “…significant improvements to how writes are handled in ZFS.”

    Is Delphix passing along the improvements it makes to ZFS by commiting these code changes to the illumos ZFS code base? The updates to the ZFS test suite are greatly appreciated…was just curious if any of the ZFS performance or functionality improvements mentioned in some of these blog posts are being passed along.

  11. Sure, it operates on one disc. The householding doesn’t scale on a single disc. extra IOs are required.

    Try it with 4 vdevs (4-disk-stripe). Do you see the same performance drops? Also: what happens if you add a SSD-ZIL

    At this point one realizes: zfs is not for single-disk-usage, not even for a single mirror. It’s a storage os (hence the zetta) and should be setup with the right properties.

    it’s like benchmarking a socket without a ratchet.

  12. Great article. I wish I had done more research before deploying zfs as the filesystem on my home NAS server. It’s amazing that even on a simple home NAS, the performance degradation is monstrous. I just crossed 50% utilization, and it’s so bad I get read and write timeouts periodically.

  13. I could not find a dd command with count=Y; usually it is how many records to run and it does not accept “Y”
    dd: bad numeric argument: “Y”

  14. Australian Cloud computing

    Thank you for share this useful information about the Cloud computing.