ZFS RAIDZ stripe width, or: How I Learned to Stop Worrying and Love RAIDZ

Posted by Matthew Ahrens on in Engineering

The popularity of OpenZFS has spawned a great community of users, sysadmins, architects and developers, contributing a wealth of advice, tips and tricks, and rules of thumb on how to configure ZFS. In general, this is a great aspect of the ZFS community, but I’d like to take the opportunity to address one piece of misinformed advice about how many disks to put in each RAID-Z group (terminology: “zpool create tank raidz1 A1 A2 A3 A4 raidz1 B1 B2 B3 B4” has 2 RAIDZ groups or “vdevs”, each of which has 4 disks or is “4-wide”). To do so, let’s start by looking at what concerns play into choice of group width.

TL;DR: Choose a RAID-Z stripe width based on your IOPS needs and the amount of space you are willing to devote to parity information. If you need more IOPS, use fewer disks per stripe. If you need more usable space, use more disks per stripe. Trying to optimize your RAID-Z stripe width based on exact numbers is irrelevant in nearly all cases.

For best performance on random IOPS, use a small number of disks in each RAID-Z group. E.g, 3-wide RAIDZ1, 6-wide RAIDZ2, or 9-wide RAIDZ3 (all of which use ⅓ of total storage for parity, in the ideal case of using large blocks). This is because RAID-Z spreads each logical block across all the devices (similar to RAID-3, in contrast with RAID-4/5/6). For even better performance, consider using mirroring.

For best reliability, use more parity (e.g. RAIDZ3 instead of RAIDZ1), and architect your groups to match your storage hardware. E.g, if you have 10 shelves of 24 disks each, you could use 24 RAIDZ3 groups, each with 10 disks – one from each shelf. This can tolerate any 3 whole shelves dying (or any 1 whole shelf dying plus any 2 other disks dying).

Space used by parity information for RAIDZ1

Space used by parity information for RAIDZ1

For best space efficiency, use a large number of disks in each RAID-Z group.  Wider stripes never hurts space efficiency. (In certain exceptional cases, use at least 5, 6, or 11 disks (for RAIDZ-1, 2, or 3 respectively) – see below for more details.) When trading off between these concerns, it is useful to know how much it helps to vary the parameters.

For performance on random IOPS, each RAID-Z group has approximately the performance of a single disk in the group. To double your write IOPS, you would need to halve the number of disks in the RAID-Z group. To double your read IOPS, you would need to halve the number of “data” disks in the RAID-Z group (e.g. with RAIDZ-2, go from 12 to 7 disks). Note that streaming read performance is independent of RAIDZ configuration, because only the data is read. Streaming write performance is proportional to space efficiency.

For space efficiency, typically doubling the number of “data” disks will halve the amount of parity per MB of data (e.g. with RAIDZ-2, going from 7 to 12 disks will reduce the amount of parity information from 40% to 20%).

RAIDZ

RAID-Z block layout

RAID-Z parity information is associated with each block, rather than with specific stripes as with RAID-4/5/6. Take for example a 5-wide RAIDZ-1. A 3-sector block will use one sector of parity plus 3 sectors of data  (e.g. the yellow block at left in row 2). A 11-sector block will use 1 parity + 4 data + 1 parity + 4 data + 1 parity + 3 data (e.g. the blue block at left in rows 9-12). Note that if there are several blocks sharing what would traditionally be thought of as a single “stripe”, there will be multiple parity blocks in the “stripe”. RAID-Z also requires that each allocation be a multiple of (p+1), so that when it is freed it does not leave a free segment which is too small to be used (i.e. too small to fit even a single sector of data plus p parity sectors – e.g. the light blue block at left in rows 8-9 with 1 parity + 2 data + 1 padding). Therefore, RAID-Z requires a bit more space for parity and overhead than RAID-4/5/6.

A misunderstanding of this overhead, has caused some people to recommend using “(2^n)+p” disks, where p is the number of parity “disks” (i.e. 2 for RAIDZ-2), and n is an integer. These people would claim that for example, a 9-wide (2^3+1) RAIDZ1 is better than 8-wide or 10-wide. This is not generally true. The primary flaw with this recommendation is that it assumes that you are using small blocks whose size is a power of 2. While some workloads (e.g. databases) do use 4KB or 8KB logical block sizes (i.e. recordsize=4K or 8K), these workloads benefit greatly from compression. At Delphix, we store Oracle, MS SQL Server, and PostgreSQL databases with LZ4 compression and typically see a 2-3x compression ratio. This compression is more beneficial than any RAID-Z sizing. Due to compression, the physical (allocated) block sizes are not powers of two, they are odd sizes like 3.5KB or 6KB. This means that we can not rely on any exact fit of (compressed) block size to the RAID-Z group width.

To help understand where these (generally incorrect) recommendations come from, and what the hypothetical benefit would be if you were to use recordsize=8K and compression=off with various RAID-Z group widths, I have created a spreadsheet which shows how much space is used for parity+padding given various block sizes and RAID-Z group widths, for RAIDZ1, 2, or 3. You can see that there are a few cases where, if setting a small recordsize with 512b sector disks and not using compression, using (2^n+p) disks uses substantially less space than one less disk. However, more disks in the RAID-Z group is never worse for space efficiency.

Space used by parity information for RAIDZ1, with varying number of disks (columns) and number of sectors per block (rows)

Space used by parity information for RAIDZ1, with varying number of disks and block size
(click for full google doc spreadsheet which includes RAIDZ2 and RAIDZ3)

Note that setting a small recordsize with 4KB sector devices results in universally poor space efficiency — RAIDZ-p is no better than p-way mirrors for recordsize=4K or 8K.

The strongest valid recommendation based on exact fitting of blocks into stripes is the following: If you are using RAID-Z with 512-byte sector devices with recordsize=4K or 8K and compression=off (but you probably want compression=lz4): use at least 5 disks with RAIDZ1; use at least 6 disks with RAIDZ2; and use at least 11 disks with RAIDZ3.

To summarize: Use RAID-Z. Not too wide. Enable compression.

 

Further reading on RAID-Z:

Feedback

  Comments: 42

glyphicon
  1. raphael schitz


    Hi Matt, thanks a lot for those very usefull details on RAIDZ. I guess when you state “use at least 5 disks with RAIDZ1; use at least 6 disks with RAIDZ2; and use at least 11 disks with RAIDZ3” it would a bit depend on disk type (sata, sas, fc, ssd, etc…) or MTBF basicly, right?


    • This recommendation is independent of disk type, because it has only to do with space usage. Please don’t forget the context: this only applies if you are using recordsize=4K or 8K with compression=off and 512-byte sector devices.

      In terms of reliability or MTBF, less reliable drives would call for higher amounts of parity (e.g. RAIDZ-2 instead of RAIDZ-1), and (to a lesser degree) smaller stripe widths (so that there is more parity per data). MTTR (mean time to repair) would also come into play. For example, if you have sufficient spares and resilver is instant, you never need more than RAIDZ1, because you never have two drives dead at once.


      • Reach has been done and they pretty much all agree that harddrives die in batches. When one drive dies, there is a very high probability of a second harddrive dying in a small window. These even accounts for mixing drives from different batches, models, or manufacturers.


  2. Hi Matt, thanks for this article.

    “[…] the performance of a single disk in the group.”

    This seems a little slow, but true. We’re experimenting with 8 SAS disks in an RAIDZ-1 array. It really has the performance of about a single disk of that array.

    Excluding a stripe over 4 2-way mirrors (which should be the fastest setup according to your setup), would an increase of parity disks (i.e. moving from a Z1 to a Z2 RAID) increase write performance?

    Would be a stripe over 4 2-way mirrors with lz4-compression activated the best performance setup? Is there a RAIDZ setup which would (theoretically) have comparable performance?

    Our load is MySQL with about 95% MyISAM tables.


    • “[…] the performance of a single disk in the group.” – Note that is specifically for random IOPS on RAIDZ — sequential read and write on RAIDZ are fine and can utilize the bandwidth of all disks.

      Switching from RAIDZ1 to RAIDZ2 won’t change this.

      Mirrors would perform best for random IOPS (e.g. most database workloads). Compression (e.g. lz4) is most likely a win — whether using mirroring or RAID-Z.


  3. Very helpful article, thank you!

    Any recommendations for 4K sector drives?
    In my experience there are greater space overheads for ashift=12 than ashift=9.
    Some people prefer formatting 4K drives with ashift=9 to eliminate space waste but I’m not sure it is correct.
    What do you think about it?

    Is it a good idea to change default record size (128K) to minimize space overhead for general purpose file storage (without RDBMS)?
    Or would be better to leave 128K and hope ZFS will use dynamic striping correctly with compression?


    • Using 512-byte aligned writes on a native-4K sector drive typically results in very poor write performance.

      The default record size (128K) does minimize space overhead — don’t change it for general-purpose uses. The only reason to change the recordsize is if your application does smaller random writes to large files (e.g. databases or disk images).


  4. Thanks Matt. Those spreadsheets have a lot of useful reference material.

  5. Jeffrey ‘jf’ Lim


    Hi Matt, when you say that parity is associated with each block rather than with each stripe, are you saying that any multiple parity sectors within each block are duplicates of each other?


    • No. As with RAID-4/5, RAIDZ1 needs one part parity for every (stripe width – 1) parts of data (and similarly for RAIDZ2 and 3).

      To dissect one of the examples I gave even further, the 5-wide RAIDZ1, the 11-sector block will use (1 parity + 4 data) + (1 parity + 4 data) + (1 parity + 3 data) — the blue block in rows 9-12. The parenthesized (parity+data) groupings indicate that each logical “row” of 5 sectors is 4 sectors of data + 1 sector of parity. The parity sector is the parity of the 4 associated data sectors (i.e. for RAIDZ1, each bit of the parity is the XOR of the 4 data bits, one from each data sector).

  6. Jeffrey ‘jf’ Lim


    Thanks, Matt!

    I have some other questions (hope this isnt too much! I’m just trying to understand things here) that I’m still trying to figure out, and am not having much luck with. These relate to the spreadsheet in one way or another.

    1. I get how the parity cost works (first 3 tabs), but I still can’t for the life of me figure out the calculation for the overhead. Is it possible to walk me through a calculation? Taking an example: for a 4-sector block being written to a 5-disk RAIDZ1, wouldnt the overhead be “ideal”? And yet somehow, the calculated figure comes out to 20%. How does that work?

    2. How do you derive the figures of at least 5, 6, 11 disks for RAIDZ1/2/3? If this comes out from the spreadsheet somehow, I don’t see those numbers jumping out at me.

    Thanks!


    • 1. A 4 sector block on a 5-disk RAIDZ1 uses: 1 sector parity + 4 sectors data + 1 sector padding (because we must allocate a multiple of P+1=2). The padding overhead is 1/5=20%. This is also reflected in the “parity cost” tabs — parity costs is 50% (1 parity + 1 pad) / (4 data).

      2. As to the stripe width recommendations, please do not take them out of context. I will repeat my recommendation here:

      Use RAID-Z. Not too wide. Enable compression. IF AND ONLY IF you are using RAID-Z with 512-byte sector devices with recordsize=4K or 8K and compression=off (but you probably want compression=lz4): use at least 5 disks with RAIDZ1; use at least 6 disks with RAIDZ2; and use at least 11 disks with RAIDZ3.

      So we are talking about exactly 8 or 16 sectors. For these exact sizes, the spreadsheet shows that the parity cost is much less with 5/6/11 disks (for RAIDZ 1/2/3) than with one less disk.

      • Jeffrey ‘jf’ Lim


        Thanks, Matt! I get it now. Didn’t mean to take things out of context, btw. Just a lot of words to get through for this brain here. Thanks!


  7. So, when you say that you want to use at least 5/6/11 disks (for raidz1/2/3) ONLY on 512b disks, don’t you mean ONLY on 4k disks? And also when you say only when you’re using recordsize=4k or 8k, don’t you also mean when you have a large amount of files under 8k? And when you say compression is disabled, don’t you also mean if your data is incompressible (jpegs, mpegs, etc.)?

    That or I’ve missed a basic principle here, which is why I’m asking.

    Thanks!!!


    • No, I mean what I said.

      If you are using 4K disks with recordsize=4K or 8K, you have 1 or 2 sectors per block, for which the parity cost is constant regardless of how many disks are in the RAID-Z group. It’s the same cost as if you used mirroring (which would provide better performance). As shown by the spreadsheet, blocks of only 1 or 2 sectors will take up the same space as if they were (P+1)-way mirrored (i.e. RAID-Z2 uses the same space as a 3-way mirror).

      If you have a large number of small files, their sizes are most likely not predictable, so you will not be in the sectors=8 or 16 case. For most sector sizes (even with less than 16 sectors), there is not a dramatic difference between 5/6/11 disks and one less or one more disk. (Again see the spreadsheet for details.)

      If your data is incompressible (e.g. jpegs, mpegs), then it is almost certainly not accessed record-wise, so you would not use recordsize=4k or 8k; you would use the default (and largest) recordsize of 128KB. In this case, the parity cost is very close to the ideal of P disks being used for parity and (N-P) disks being used entirely for data, so you can simply consider this ideal case as it would apply to any RAID system. (e.g. with RAIDZ2, with 5 disks the parity cost of 68% or 69% (depending on if sector size is 512 or 4KB) is nearly the ideal of 66.6%).

      Hopefully all this is clear from the spreadsheet. But I wonder if it would have been better to show the parity cost as a percent of total storage, rather than as a percent of the data size?


  8. Hi, very interesting article. I’m currently building a NAS with ZFS for file storage and your spread sheet states a 19% storage space cost for RAIDZ3.

    I’m using 24 x 4 TB disks (3.6 TiB) and when I create a 24 drive RAIDZ3 ashift=12 I get 69 TiB of storage, which seems to match the 19% prediction.

    However, if I format the array with RAIDZ3 and ashift=9 I get more storage space: 74 TiB. How can I explain this from your spread sheet?


    • The spreadsheet (just updated to include 24-wide RAIDZ3) shows that parity cost for 128KB blocks will be 25% (see cell T23) of data for 4KB sector size (ashift=12), or 16% (see cell T26) for 512b sectors. As a percentage of the total storage, the parity is 20% (25/(25+100)) or 14% (16/(16+100)) (depending on sector size). This matches the usable space you mentioned.

      I’ve added 3 new sheets to the spreadsheet, showing the parity size as a percent of total storage (the original sheets showed parity size as a percent of the data size). See e.g. https://docs.google.com/a/delphix.com/spreadsheets/d/1tf4qx1aMJp8Lo_R6gpT689wTjHv6CGVElrPqTA0w_ZY/edit?pli=1#gid=1353835485


      • Thanks Matt, I have difficulty reproducing the results, but I’m afraid I’ve difficulty understanding the calculation.

        I’ve been calculating the overhead for my RAIDZ3 24 drive 4K array.

        https://docs.google.com/spreadsheets/d/1pwmlpsVWV5wvlJoMDaUZTJdZD0ua8gDRrr5Z3KsGKwE/edit?usp=sharing

        Although my calculation is spot on for my 24 drive setup, there’s a large discrepancy for other number of drives. I’ve created a second tab with manually corrected padding that resembles the real life observed measurements as close as possible. But I’m not sure this makes any sense and this is even possible (strange stripe padding numbers).

        The real-world ZFS measurements are created by just creating the array and do a zfs list.

        Hope you have some thoughts on this.


        • You should be able to see the (complicated) math in my spreadsheet.

          I don’t understand your calculation for “sectors per block/stripe”. Is this supposed to be the count of data + parity sectors (excluding padding sectors) for a block with 32 sectors of data? If so, your calculation for this is not correct.

          Take for example the 6 disk case, I think you are calculating that there are 64 sectors (32 data + 32 parity). But that is not the case; there will be ten stripes of 3 data + 3 parity, and one stripe of 2 data + 3 parity, for a total of (10*(3+3)+(2+3)) = 65 sectors. This will then be rounded up to a multiple of 4, so there will be 3 pad sectors, for a total allocation of 68 sectors.


  9. Hi,

    first of all thank you very much for this insightful information! Especially the spreadsheets are great. Could you please reupload the original file where the overhead of parity and padding was listed? This was very convenient as one could easily see how many disks at what blocksize and raid-level had a capacity penalty compared to an ideal layout (e.g. 4KB sector, 32 block, 6Disk RAID-Z2 with 0% overhead because of 2/3 data/parity ratio and no padding).

    Thanks and keep up the good work! Looking forward to 1MB recordsize.


  10. Hi Matt,
    Thank you for the great spreadsheets, they make what is happening with space usage for various block sizes as simple as … reading a spreadsheet :).

    I am interested in RAIDZx on large disks with 4KB sectors for mostly large files which would consist of 128KB/4KB = 32 sector sized blocks. Considering that smaller width is better for safety (“not too wide”) and larger width is better for space efficiency I might conclude from your spreadsheets that the following sizes are 4KB sector optimal space efficiency widths for a given level of safety:

    Raid Level: #disks~%cost of total storage, …
    RAIDZ1: 3~33%, 4~27%, 5~20%, 7~16%, 9~11%, 17~6%
    RAIDZ2: 4~52%, 5~41%, 6~33%, 8~29%, 9~24%, 13~18%, 18~11%
    RAIDZ3: 6~53%, 7~43%, 9~38%, 10~33%, 11~27%, 19~20%

    In other words, a 7 disk RAIDZ2 is no more space efficient than a 6 disk RAIDZ2 for large files (except for a few odd sized smaller files) while slightly lowering safety by being wider.

    For a more general scenario of mixed file sizes on 4KB sector disks where there could be a significant number of small 1 or 2 sector (4KB or 8KB) blocks, you wrote that “RAIDZ-p is no better than p-way mirrors.” Considering that mirrors offer better performance than RAIDZ, would it be beneficial to actually have a Hybrid RAIDZ-p where all 1 or 2 sector blocks are actually sector mirrors (larger blocks of course are written with p parity as before). Then a bunch of small 1 or 2 block reads should perform 2 (RAIDZ1) to 4 (RAIDZ3) times better as they could be read from any of the mirrored sectors? I wouldn’t think that this would increase the complexity outrageously for the performance increase of small reads?

    (Completely off topic, but while I’m writing …)
    Block Pointer Rewrite. I imagine that this is about as difficult as trying to repot a highly cross-linked growing root-bound pot plant into a different shaped pot without damaging the root system. I just watched your OpenZFS Office Hours of 11 Oct 2013 and at 44:52 you discuss Device Removal & Block Pointer Rewrite.

    At 55:38 you propose using an indirect vdev as a solution for device removal with the trade-off of the indirection being permanent (except as files are rewritten) and recursively compounding.

    I wonder if there is another solution by turning the indirection idea upside down. Firstly, BPR is generally cited to solve problems of defragmentation, rebalancing storage, removing devices, adding data or parity devices to a RAIDZx besides other more localised problems of making changes to a few files or folders. If we restrict ourselves to the larger pool wide problems, then a typical solution has been to copy the pool somewhere else, make the desired changes (add an additional data disk to a RAIDZx) and then copy it back. Why can’t this be done live, in place?

    Turning the indirection idea upside-down. When a pool-wide BPR is initiated, place an indirection on the entire pool or the topmost (old) dataset, or place the current topmost (old) dataset as a child of a new parent dataset. Then just proceed to copy / send from the old dataset to the new one. Some extra free space will be required during the copy for any data that is duplicated in-flight by be being built in the new dataset before it is deleted from the old one. This should be able to be done without modifying ZFS code across half the codebase in an intricate manner as the indirection at the top just needs to keep track of whether blocks we are looking for are still in the old, or now in the new dataset (some kind of a log file like used to track free space?). (Alternatively, indirection at the bottom of the old dataset to the bottom of the new one could could be used to allow deletion). The procedure should be pausable for heavy load times or run at a low priority like resilvering. Additionally at all live times it should be as robust as snapshots and moving files around between datasets. It would take some effort to correctly copy blocks with many incoming references. A significant proportion of the “copying” is just metadata bookkeeping as for many scenarios (defragmentation, storage rebalancing) much of the data can physically remain in place. To support this the extra feature of allowing an old child read-only/delete-only dataset to have a different RAIDZx/MIRROR composition to the new parent/top-level/vdev one would need to be added (or not, as we are still just reading data at the end of block pointer references).

    In summary: Many scenarios requiring BPR are pool wide and currently require offline copying elsewhere AND BACK again to solve the problem. Solve this live/online by placing an indirection on the current pool/top-level dataset and copying once to a new top-level dataset in situ while deleting (indirecting from the bottom) the old dataset. Hopefully should only require some major coding at one or separate levels of ZFS with some sparse interacting changes elsewhere.

    If this last section is too off-topic, please do a BPR and put it elsewhere ;).


  11. Hi Matt,

    I have recently set up a RAIDZ2 with 8 x 4TB drives and an ashift of 12. I noticed that when the pool is empty, the write performance is about 690MB/sec. If I add about 2.8TB of data then test again, this drops to 282MB/sec, then add more data (about 20GB) this increases to 485MB/sec. I noticed that with ashift=9, the initial performance on the empty pool was similar, but as data was added, the performance dropped quite significantly.

    Is this kind of varying performance normal? Is this because I do not have the ‘recommended’ amount of drives for RAIDZ2? (6,10,18 etc)

    I’m willing to put up with these figures if that’s just how it is, but I am concerned that I may have something configure incorrectly here.

    Thanks :)

    • Matthew Ahrens


      I don’t know exactly why that would happen. I don’t see how it could be related to the # drives in the RAIDZ group. It could be related to allocation / data placement, and/or what physical locations are being written (on mechanical drives, the outer tracks have more bandwidth).


      • Ok, thanks for the clarification. Just one more question – are ashifts 9 and 12 the only options or are there others that may be worth trying for 4K drives?

  12. Michael Monnerie


    Great article!
    What I still don’t understand: say we use compression=on (as everyone should these days). Does that raid sizing/arithmetics/padding still require attention? Or does compression throw away all the planning? Is padding still used then?
    Also, what if we have compression=on, ashift=12 and 4K drives. Does that make a difference to 512b drives?
    And I’m concerned about performance with 4K drives and databases: do I understand well that db’s will hit a severe performance degradation with raid-z and 4K drives, compared to 512b disks? Because even if you only have 4 sectors, that makes 16KB, while the db usualy does 4k or 8k I/O.

    • Matthew Ahrens


      The TL;DR of the article is that you shouldn’t worry about getting an exact RAIDZ stripe width (e.g. due to padding). This is especially the case with compression (because the blocks will be unpredictable sizes).

      If you are using databases, you probably have changed the zfs recordsize to 4k or 8k. In this case (recordsize <=8K) if you are using 4K drives, there is no benefit to using RAID-Z compared to mirroring. This is because RAID-Z stores the parity on a per block basis, and your blocks are a very small number of sectors (just 1 or 2). Once you add the parity block/s, you are doubling the number of sectors so you might as well use mirroring. (Note: it doesn’t matter whether you have compression on or off.)

      • Michael Monnerie


        Does that mean when I use a 6-disk vdev with RAID-Z2, recordsize=4K and 4K disks, and within a file a single 4KB record is changed (as databases do), the write will be data+padding*3+parity*2?

        My idea is to make a primary server with mirrored disks, and a backup with RAID-Z2, and to copy snapshots over there. As far as I understood, the snapshots will be copied as a stream. Does that mean that if on source say 4x4KB records in that file got changed at different times, and are therefore on different positions on disk, they are made one 16KB write during snapshot copy?

        • Matthew Ahrens


          RAID-Z2 with recordsize=4K and 4K disks, every block of that file will be represented as 1 sector (4K) data + 2 sectors (8K) parity. (As shown in the linked spreadsheet.) It doesn’t matter whether you are writing to the blocks sequentially or randomly. You will get the same space usage as a 3-wide mirror.

          For backup, you will presumably not be doing random writes to the files, so you would like to use recordsize=128k. But I don’t think there is a out-of-the-box way to do that with zfs send/receive. You could do it by copying the file (with cp, rsync, tar, etc).

          If 4 adjacent blocks are modified (in any order) between the 2 snapshots specified to “zfs send”, those 4 blocks will be written in order by “zfs receive”.


  13. What would be the recommended record size for a setup of:

    1. 3 4k 4TB drives in raidz1
    2. Used for mix storage, mostly media files 1-4GB(this will be under a separate volume\partition in the pool)
    3. A separate volume\partition under the pool to store regular files with compression turned on.
    4. A third volume\partition to store virtual box vms when testing or development

    Currently, I created it with ashift=12. However, after reading this I’m trying to figure out if to re-create the pool with a different record size than ashift=12(4k). I’m using it mainly for storing media, dumping of files and VMs and want to get the most out of it.

    What are you thoughts?

    • Matthew Ahrens


      The recordsize property (controlled by “zfs set recordsize=” or “zfs create -o recordsize”) is a different thing than the sector size (ashift). Since your drives are 4k, you should use 4k sector size (aka ashift=12).

      For use cases 2 and 3, you should use the default recordsize of 128K, or consider going larger. For use case 4 (virtual disk images) you could consider a smaller recordsize (e.g. 4k) which might get better performance at the cost of much more space used (approximately double). But the default will probably be fine there too, for casual use.

      If you need better performance, I would recommend you get a 4th drive and use 2x 2-way mirrors.


      • Ahh, truth is all I did was use the command below to create the pool:

        sudo zpool create -f -o ashift=12 tank raidz1 /dev/disk/by-id/wwn-0x50014ee2b637806b /dev/disk/by-id/wwn-0x50014ee2b6387867 /dev/disk/by-id/wwn-0x50014ee20ba7f948

        I then did “sudo zfs create tank/PARTITION” for each partition I needed and “sudo zfs set compression=on tank/junker” to turn on compression for the partition I’ll backup and dump regular files too.

        I just created it last weekend, so don’t mind redoing to correct. I’m only learning about the recordist option since reading this post :). Also, what would be the options and commands you recommend to setup?


        • Decided I’ll add a 4th disk and create 2x 2-way mirrors as suggested:

          sudo zpool create -f -o ashift=12 tank mirror disk1 disk2 mirror disk3 disk4

          I’ll set the record size of 4k(sudo zfs set recordsize=4k tank/vms) on the partition\filesystem as suggested that will house virtual machines. I guess depends on the performance of the VMs, this can be done later since more space may be used. I’ll also use lz4 compression on the partition where I’ll dumps files to backup, etc

Comment