Revision as of 14:26, 1 June 2023

Notes on ZFS

Home setup

On osx I'm running a bunch of 12tb disks in a raidz2 config. My intent is to migrate to a zpool with special devices in it.

Plan is 20 12tb disks in 2 vdev's of raidz2 with 3.2 TB SSD's in a mirror. I'll use the m2 SSD on the server for ZIL and l2arc.

This should give about 174.56 TiB of space.

Block Size Histogram

  block   psize                lsize                asize
   size   Count   Size   Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:   350K   175M   175M   350K   175M   175M      0      0      0
     1K:   348K   413M   589M   348K   413M   589M      0      0      0
     2K:   273K   722M  1.28G   273K   722M  1.28G      0      0      0
     4K:   669K  2.65G  3.93G   221K  1.17G  2.45G      0      0      0
     8K:   925K  8.50G  12.4G   176K  1.91G  4.36G  1.23M  14.7G  14.7G
    16K:   620M  9.69T  9.70T   621M  9.70T  9.70T   621M  14.6T  14.6T
    32K:  1.39M  62.8G  9.76T  82.2K  3.57G  9.70T   410K  19.0G  14.6T
    64K:   548K  47.3G  9.81T  47.2K  4.06G  9.71T  1.58M   153G  14.7T
   128K:   825K   150G  9.95T  1014K   128G  9.83T   699K   133G  14.9T
   256K:  66.3M  16.6T  26.5T  68.4M  17.1T  26.9T  66.6M  20.3T  35.1T
   512K:      0      0  26.5T      0      0  26.9T      0      0  35.1T
     1M:      0      0  26.5T      0      0  26.9T      0      0  35.1T
     2M:      0      0  26.5T      0      0  26.9T      0      0  35.1T
     4M:      0      0  26.5T      0      0  26.9T      0      0  35.1T
     8M:      0      0  26.5T      0      0  26.9T      0      0  35.1T
    16M:      0      0  26.5T      0      0  26.9T      0      0  35.1T

Optimization

All disks should be updated

./SeaChest_Firmware_x86_64-redhat-linux --downloadFW /root/MobulaExosX12SAS-STD-5xxE-E004.LOD  -d /dev/sg7

All disks should be 4k sectors. The spinning disks should be long formatted to detect bad blocks.

./SeaChest_Lite_x86_64-redhat-linux  --setSectorSize 4096 --confirm this-will-erase-data -d /dev/sg8

Write cache should be enabled:

# sdparm --get=WCE /dev/sg5
    /dev/sg5: SEAGATE   ST12000NM0027     E004
WCE           0  [cha: y, def:  1, sav:  0]

# sdparm --set=WCE --save /dev/sg5                                                                                         |
    /dev/sg5: SEAGATE   ST12000NM0027     E004

# sdparm --get=WCE --save /dev/sg5
    /dev/sg5: SEAGATE   ST12000NM0027     E004
WCE           1  [cha: y, def:  1, sav:  1]

ashift= 13 = 8192 byte per IO.
recordsize             256K
compression            lz4
casesensitivity        insensitive
special_small_blocks   128K

zdb -Lbbb  PoolName

zpool create -f -o ashift=12 -O casesensitivity=insensitive -O normalization=formD -O compression=lz4 -O atime=off -O recordsize=256k ZfsMediaPool \
raidz2 /var/run/disk/by-path/PCI0@0-SAT0@17-PRT5@5-PMP@0-@0:0 /var/run/disk/by-path/PCI0@0-SAT0@17-PRT4@4-PMP@0-@0:0 \
/var/run/disk/by-path/PCI0@0-RP21@1B,4-PXSX@0-PRT31@1f-PMP@0-@0:0 /var/run/disk/by-path/PCI0@0-SAT0@17-PRT3@3-PMP@0-@0:0 \
/var/run/disk/by-path/PCI0@0-SAT0@17-PRT2@2-PMP@0-@0:0 /var/run/disk/by-path/PCI0@0-SAT0@17-PRT1@1-PMP@0-@0:0 \
/var/run/disk/by-path/PCI0@0-SAT0@17-PRT0@0-PMP@0-@0:0 /var/run/disk/by-path/PCI0@0-RP21@1B,4-PXSX@0-PRT2@2-PMP@0-@0:0 \
/var/run/disk/by-path/PCI0@0-RP21@1B,4-PXSX@0-PRT3@3-PMP@0-@0:0 /var/run/disk/by-path/PCI0@0-RP21@1B,4-PXSX@0-PRT28@1c-PMP@0-@0:0 \ 
/var/run/disk/by-path/PCI0@0-RP21@1B,4-PXSX@0-PRT4@4-PMP@0-@0:0 /var/run/disk/by-path/PCI0@0-RP21@1B,4-PXSX@0-PRT29@1d-PMP@0-@0:0

zpool add ZfsMediaPool log /dev/disk5s3
zpool add ZfsMediaPool cache /dev/disk5s4

=

https://github.com/openzfs/zfs/discussions/12769

Disk Notes

I've run into some issues seeing my disk size "MAX LBA" be different even after formatting

0:17:0     SEAGATE  ST12000NM0027    E004 Disk    10.91 TB    50:00:C5:00:A6:F0:A0:79 sdn  sg15                    
                    SN:ZJV2GV4B0000C908373F 
0:18:0     SEAGATE  ST12000NM0027    E004 Disk    10.84 TB    50:00:C5:00:A6:F0:CC:89 sdj  sg11                   
                    SN:ZJV2GTFT0000C9069US9

This is strange the size is different. I ran the info command on these

SeaChest_Basics_x86_64-redhat-linux -d /dev/sg11 -i SeaChest_Basics_x86_64-redhat-linux -d /dev/sg15 -i

This shows the following different options

Drive Capacity (TB/TiB): 12.00/10.91 
Drive Capacity (TB/TiB): 11.92/10.84

Protection Type 2
Protection Type 2 [Enabled]

Informational Exceptions [Mode 4]
Informational Exceptions [Mode 0]

sg_readcap -l /dev/sg15
Read Capacity results:
   Protection: prot_en=0, p_type=0, p_i_exponent=0
   Logical block provisioning: lbpme=0, lbprz=0
   Last LBA=2929721343 (0xae9fffff), Number of logical blocks=2929721344
   Logical block length=4096 bytes
   Logical blocks per physical block exponent=0
   Lowest aligned LBA=0
Hence:
   Device size: 12000138625024 bytes, 11444224.0 MiB, 12000.14 GB, 12.00 TB

# sg_readcap -l /dev/sg11
Read Capacity results:
   Protection: prot_en=1, p_type=1, p_i_exponent=0 [type 2 protection]
   Logical block provisioning: lbpme=0, lbprz=0
   Last LBA=2909274111 (0xad67ffff), Number of logical blocks=2909274112
   Logical block length=4096 bytes
   Logical blocks per physical block exponent=0
   Lowest aligned LBA=0
Hence:
   Device size: 11916386762752 bytes, 11364352.0 MiB, 11916.39 GB, 11.92 TB

Thus the smaller drive has something called Protection Type 2 enabled. I had no idea what this is. Some searching turned up this website

Not knowing what this was, I then went down a seemingly never ending spiral of T10 Protection Information [PDF] standards. Its pretty neat, how I understand it is the disk controller formats the platters to 520 byte sectors, instead of the more traditional 512 byte sectors, these 8 extra bytes per sector are there for the controller to make sure that the data written to that sector is the same data that is read from it, sort of like data verification. The disk controller can then presents the system (HBA controller or raid card) with the normal 512 bytes of data per section, and any SCSI compatible controller should be able to read and write to it just fine.

Essentially this is a hold over from the 520 byte sectors and 8 bytes being used for a checksum/parity. This isn't needed in ZFS. It wastes 79.872 GiBytes of space from the disk too!

Difference between revisions of "ZFS"

Revision as of 14:26, 1 June 2023

Contents

Home setup

Optimization

=

Disk Notes

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools