ZFS
Notes on ZFS
Contents
Home setup
On osx I'm running a bunch of 12tb disks in a raidz2 config. My intent is to migrate to a zpool with special devices in it.
Plan is 20 12tb disks in 2 vdev's of raidz2 with 3.2 TB SSD's in a mirror. I'll use the m2 SSD on the server for ZIL and l2arc.
This should give about 174.56 TiB of space.
Block Size Histogram block psize lsize asize size Count Size Cum. Count Size Cum. Count Size Cum. 512: 350K 175M 175M 350K 175M 175M 0 0 0 1K: 348K 413M 589M 348K 413M 589M 0 0 0 2K: 273K 722M 1.28G 273K 722M 1.28G 0 0 0 4K: 669K 2.65G 3.93G 221K 1.17G 2.45G 0 0 0 8K: 925K 8.50G 12.4G 176K 1.91G 4.36G 1.23M 14.7G 14.7G 16K: 620M 9.69T 9.70T 621M 9.70T 9.70T 621M 14.6T 14.6T 32K: 1.39M 62.8G 9.76T 82.2K 3.57G 9.70T 410K 19.0G 14.6T 64K: 548K 47.3G 9.81T 47.2K 4.06G 9.71T 1.58M 153G 14.7T 128K: 825K 150G 9.95T 1014K 128G 9.83T 699K 133G 14.9T 256K: 66.3M 16.6T 26.5T 68.4M 17.1T 26.9T 66.6M 20.3T 35.1T 512K: 0 0 26.5T 0 0 26.9T 0 0 35.1T 1M: 0 0 26.5T 0 0 26.9T 0 0 35.1T 2M: 0 0 26.5T 0 0 26.9T 0 0 35.1T 4M: 0 0 26.5T 0 0 26.9T 0 0 35.1T 8M: 0 0 26.5T 0 0 26.9T 0 0 35.1T 16M: 0 0 26.5T 0 0 26.9T 0 0 35.1T
Things to do
- Set Write Cache
- set MRIE = 4
- ll format
- record address and hours.
One liner to do this
for i in `seq 2 25` ; do SG="/dev/sg$i" ; echo $SG ;sdparm --set=WCE --save $SG ; sdparm --get=WCE $SG; sdparm --set=MRIE=4 --save $SG; sdparm --get=MRIE $SG; done
Optimization
All disks should be updated
./SeaChest_Firmware_x86_64-redhat-linux --downloadFW /root/MobulaExosX12SAS-STD-5xxE-E004.LOD -d /dev/sg7
All disks should be 4k sectors. The spinning disks should be long formatted to detect bad blocks.
./SeaChest_Lite_x86_64-redhat-linux --setSectorSize 4096 --confirm this-will-erase-data -d /dev/sg8
Write cache should be enabled:
# sdparm --get=WCE /dev/sg5 /dev/sg5: SEAGATE ST12000NM0027 E004 WCE 0 [cha: y, def: 1, sav: 0] # sdparm --set=WCE --save /dev/sg5 | /dev/sg5: SEAGATE ST12000NM0027 E004 # sdparm --get=WCE --save /dev/sg5 /dev/sg5: SEAGATE ST12000NM0027 E004 WCE 1 [cha: y, def: 1, sav: 1]
ashift= 13 = 8192 byte per IO. recordsize 256K compression lz4 casesensitivity insensitive special_small_blocks 128K zdb -Lbbb PoolName zpool create -f -o ashift=12 -O casesensitivity=insensitive -O normalization=formD -O compression=lz4 -O atime=off -O recordsize=256k ZfsMediaPool \ raidz2 /var/run/disk/by-path/PCI0@0-SAT0@17-PRT5@5-PMP@0-@0:0 /var/run/disk/by-path/PCI0@0-SAT0@17-PRT4@4-PMP@0-@0:0 \ /var/run/disk/by-path/PCI0@0-RP21@1B,4-PXSX@0-PRT31@1f-PMP@0-@0:0 /var/run/disk/by-path/PCI0@0-SAT0@17-PRT3@3-PMP@0-@0:0 \ /var/run/disk/by-path/PCI0@0-SAT0@17-PRT2@2-PMP@0-@0:0 /var/run/disk/by-path/PCI0@0-SAT0@17-PRT1@1-PMP@0-@0:0 \ /var/run/disk/by-path/PCI0@0-SAT0@17-PRT0@0-PMP@0-@0:0 /var/run/disk/by-path/PCI0@0-RP21@1B,4-PXSX@0-PRT2@2-PMP@0-@0:0 \ /var/run/disk/by-path/PCI0@0-RP21@1B,4-PXSX@0-PRT3@3-PMP@0-@0:0 /var/run/disk/by-path/PCI0@0-RP21@1B,4-PXSX@0-PRT28@1c-PMP@0-@0:0 \ /var/run/disk/by-path/PCI0@0-RP21@1B,4-PXSX@0-PRT4@4-PMP@0-@0:0 /var/run/disk/by-path/PCI0@0-RP21@1B,4-PXSX@0-PRT29@1d-PMP@0-@0:0 zpool add ZfsMediaPool log /dev/disk5s3 zpool add ZfsMediaPool cache /dev/disk5s4
=
https://github.com/openzfs/zfs/discussions/12769
Disk Notes
I've run into some issues seeing my disk size "MAX LBA" be different even after formatting
0:17:0 SEAGATE ST12000NM0027 E004 Disk 10.91 TB 50:00:C5:00:A6:F0:A0:79 sdn sg15 SN:ZJV2GV4B0000C908373F 0:18:0 SEAGATE ST12000NM0027 E004 Disk 10.84 TB 50:00:C5:00:A6:F0:CC:89 sdj sg11 SN:ZJV2GTFT0000C9069US9
This is strange the size is different. I ran the info command on these
SeaChest_Basics_x86_64-redhat-linux -d /dev/sg11 -i SeaChest_Basics_x86_64-redhat-linux -d /dev/sg15 -i
This shows the following different options
Drive Capacity (TB/TiB): 12.00/10.91 Drive Capacity (TB/TiB): 11.92/10.84 Protection Type 2 Protection Type 2 [Enabled] Informational Exceptions [Mode 4] Informational Exceptions [Mode 0]
sg_readcap -l /dev/sg15 Read Capacity results: Protection: prot_en=0, p_type=0, p_i_exponent=0 Logical block provisioning: lbpme=0, lbprz=0 Last LBA=2929721343 (0xae9fffff), Number of logical blocks=2929721344 Logical block length=4096 bytes Logical blocks per physical block exponent=0 Lowest aligned LBA=0 Hence: Device size: 12000138625024 bytes, 11444224.0 MiB, 12000.14 GB, 12.00 TB
# sg_readcap -l /dev/sg11 Read Capacity results: Protection: prot_en=1, p_type=1, p_i_exponent=0 [type 2 protection] Logical block provisioning: lbpme=0, lbprz=0 Last LBA=2909274111 (0xad67ffff), Number of logical blocks=2909274112 Logical block length=4096 bytes Logical blocks per physical block exponent=0 Lowest aligned LBA=0 Hence: Device size: 11916386762752 bytes, 11364352.0 MiB, 11916.39 GB, 11.92 TB
Thus the smaller drive has something called Protection Type 2 enabled. I had no idea what this is. Some searching turned up this website
Not knowing what this was, I then went down a seemingly never ending spiral of T10 Protection Information [PDF] standards. Its pretty neat, how I understand it is the disk controller formats the platters to 520 byte sectors, instead of the more traditional 512 byte sectors, these 8 extra bytes per sector are there for the controller to make sure that the data written to that sector is the same data that is read from it, sort of like data verification. The disk controller can then presents the system (HBA controller or raid card) with the normal 512 bytes of data per section, and any SCSI compatible controller should be able to read and write to it just fine.
Essentially this is a hold over from the 520 byte sectors and 8 bytes being used for a checksum/parity. This isn't needed in ZFS. It wastes 79.872 GiBytes of space from the disk too!
Note: I'm not sure this is due to the waste of space on this disk. It could be something the OEM does different for them as they were original 520b disks.
I tried
# SeaChest_Basics_x86_64-redhat-linux -d /dev/sg11 --restoreMaxLBA
And it didn't increase the LBA.
So lets look at the supported formats for this:
# ./SeaChest_Format_x86_64-redhat-linux -d /dev/sg11 --showSupportedFormats ========================================================================================== SeaChest_Format - Seagate drive utilities - NVMe Enabled Copyright (c) 2014-2021 Seagate Technology LLC and/or its Affiliates, All Rights Reserved SeaChest_Format Version: 2.3.1-2_2_3 X86_64 Build Date: Jun 17 2021 Today: Thu Jun 1 23:15:34 2023 User: root ========================================================================================== /dev/sg12 - ST12000NM0027 - ZJV1HW2T0000C8496S1T - SCSI Supported Logical Block Sizes and Protection Types: --------------------------------------------------- * - current device format PI Key: Y - protection type supported at specified block size N - protection type not supported at specified block size ? - unable to determine support for protection type at specified block size Relative performance key: N/A - relative performance not available. Best Better Good Degraded -------------------------------------------------------------------------------- Logical Block Size PI-0 PI-1 PI-2 PI-3 Relative Performance Metadata Size -------------------------------------------------------------------------------- 512 Y ? ? N N/A N/A 520 Y ? ? N N/A N/A 528 Y ? ? N N/A N/A * 4096 Y ? ? N N/A N/A 4112 Y ? ? N N/A N/A 4160 Y ? ? N N/A N/A -------------------------------------------------------------------------------- NOTE: Device is not capable of showing all sizes it supports. Only common sizes are listed. Please consult the product manual for all supported combinations. NOTE: This device supports protection information (PI) (a.k.a. End to End protection). Type 0 - No protection beyond transport protocol Type 1 - Logical Block Guard and Logical Block Reference Tag Type 2 - Logical Block Guard and Logical Block Reference Tag (except first block) 32byte read/write CDBs allowed Not all forms of PI are supported on all sector sizes unless otherwise indicated in the device product manual. NOTE: This device supports Fast Format. Fast format is not instantaneous and is used for switching between 5xx and 4xxx sector sizes. A fast format may take a few minutes or longer but may take longer depending on the size of the drive. Fast format support does not necessarily mean switching sector sizes AND changing PI at the same time is supported. In most cases, a switch of PI type will require a full device format. Fast format mode 1 is typically used to switch from 512 to 4096 block sizes with the current PI scheme.
Well that was a bust. But from reading the last thing, it looks like we'll have to try a format of the disk (again)
# ./SeaChest_Format_x86_64-redhat-linux --protectionType 0 --formatUnit 4096 --confirm this-will-erase-data --poll -d /dev/sg11 ========================================================================================== SeaChest_Format - Seagate drive utilities - NVMe Enabled Copyright (c) 2014-2021 Seagate Technology LLC and/or its Affiliates, All Rights Reserved SeaChest_Format Version: 2.3.1-2_2_3 X86_64 Build Date: Jun 17 2021 Today: Thu Jun 1 23:34:40 2023 User: root ========================================================================================== /dev/sg11 - ST12000NM0027 - ZJV2GTFT0000C9069US9 - SCSI Format Unit Performing SCSI drive format. Depending on the format request, this could take minutes to hours or days. Do not remove power or attempt other access as interrupting it may make the drive unusable or require performing this command again!! Progress will be updated every 5 minutes Percent Complete: 0.00%
After about 48 hours the low level format ended. I was able to access the disk directly without needing to power cycle it it, as I'd needed to do in the past. Unsure if this is the case.
Here's the output from the info command. Note this reset the MRIE to 0!
# ./SeaChest_Basics_x86_64-redhat-linux -d /dev/sg11 -i ========================================================================================== SeaChest_Basics - Seagate drive utilities - NVMe Enabled Copyright (c) 2014-2021 Seagate Technology LLC and/or its Affiliates, All Rights Reserved SeaChest_Basics Version: 3.1.0-2_2_3 X86_64 Build Date: Jun 17 2021 Today: Sun Jun 4 16:00:45 2023 User: root ========================================================================================== /dev/sg11 - ST12000NM0027 - ZJV0FC790000R8168TJ9 - SCSI Vendor ID: SEAGATE Model Number: ST12000NM0027 Serial Number: ZJV0FC79 PCBA Serial Number: 0000R8168TJ9 Firmware Revision: E004 World Wide Name: 5000C500953FDF87 Copyright: Copyright (c) 2020 Seagate All rights reserved Drive Capacity (TB/TiB): 12.00/10.91 Temperature Data: Current Temperature (C): 40 Highest Temperature (C): Not Reported Lowest Temperature (C): Not Reported Power On Time: 3 years 217 days 23 hours 26 minutes Power On Hours: 31511.43 MaxLBA: 2929721343 Native MaxLBA: Not Reported Logical Sector Size (B): 4096 Physical Sector Size (B): 4096 Sector Alignment: 0 Rotation Rate (RPM): 7200 Form Factor: 3.5" Last DST information: DST has never been run Long Drive Self Test Time: 19 hours 8 minutes Interface speed: Port 0 (Current Port) Max Speed (GB/s): 12.0 Negotiated Speed (Gb/s): 12.0 Port 1 Max Speed (GB/s): 12.0 Negotiated Speed (Gb/s): Not Reported Annualized Workload Rate (TB/yr): 10.31 Total Bytes Read (TB): 27.92 Total Bytes Written (TB): 9.18 Encryption Support: Not Supported Cache Size (MiB): Not Reported Read Look-Ahead: Enabled Non-Volatile Cache: Enabled Write Cache: Enabled SMART Status: Good ATA Security Information: Not Supported Firmware Download Support: Full, Segmented, Deferred Number of Logical Units: 1 Specifications Supported: SPC-4 SAM-5 SAS-3 SPL-3 SPC-4 SBC-3 Features Supported: Protection Type 1 Protection Type 2 Application Client Logging Self Test Automatic Write Reassignment [Enabled] Automatic Read Reassignment [Enabled] EPC [Enabled] Informational Exceptions [Mode 0] Translate Address Rebuild Assist Seagate Remanufacture Seagate In Drive Diagnostics (IDD) Format Unit Fast Format Sanitize Adapter Information: Vendor ID: 117Ch Product ID: 8072h Revision: 0006h
Looking at the size and blocks now, it matches the others, so this worked!
# sg_readcap -l /dev/sg11 Read Capacity results: Protection: prot_en=0, p_type=0, p_i_exponent=0 Logical block provisioning: lbpme=0, lbprz=0 Last LBA=2929721343 (0xae9fffff), Number of logical blocks=2929721344 Logical block length=4096 bytes Logical blocks per physical block exponent=0 Lowest aligned LBA=0 Hence: Device size: 12000138625024 bytes, 11444224.0 MiB, 12000.14 GB, 12.00 TB root@Lab-ASL:~# atdevinfo -i all
However this confirmed the MRIE is reset to 0 and the write cache is still enabled. That's good.
# sdparm /dev/sg11 --get=MRIE /dev/sg11: SEAGATE ST12000NM0027 E004 MRIE 0 [cha: y, def: 0, sav: 0] # sdparm --get=WCE /dev/sg11 /dev/sg11: SEAGATE ST12000NM0027 E004 WCE 1 [cha: y, def: 1, sav: 1] This below is the info from the ATTO utilities (HBA) for this disk. ******************************************************************** Target 3 Unit 0 (Channel 1, Port 0) ******************************************************************** Bus:Target:Unit: 0:3:0 OS Target ID: 25 Vendor: SEAGATE Product: ST12000NM0027 Firmware Revision: E004 Port Address: 50:00:C5:00:95:3F:DF:85 Node Address: N/A OS Device Name: /dev/sdj Device Type: Disk Serial Number: ZJV0FC790000R8168TJ9 Status: Ready SES Enclosure: Target 20, LUN 0 SES Slot: 23 SSD: No Capacity: 10.91 TB Sector Size: 4096 B T10-PI: Disabled (Types 1 and 2 supported) ==================================================================== SAS Protocol Information ==================================================================== Initiator Flags: None Target Flags: SSP Negotiated Rate: 12 Gb/s SAS Depth: 1 Slot Number: 20 SAS Port ID: 0 Topology: Expander Expander PHY ID: 16 ==================================================================== Supported Vital Product Data Pages ==================================================================== Device Identification: Supported Extended Inquiry Data: Supported Power Condition: Supported Unit Serial Number: Supported ATA Information: Unsupported Block Device Characteristics: Supported Block Limits: Supported Logical Block Provisioning: Supported ==================================================================== Block Device Characteristics Information ==================================================================== Medium Rotation Rate: 7200 Form Factor: 3.5 in. Background Operation Control: Unsupported ==================================================================== Supported Log Pages ==================================================================== Informational Exceptions: Supported Protocol Specific Port: Supported Self Test Results: Supported Temperature: Supported ==================================================================== Temperature Information ==================================================================== Current Temperature: 40 C Reference Temperature: 60 C ==================================================================== Mode Parameters ==================================================================== Write Caching: Enabled (Default) Read Ahead: Enabled (Default) IT Nexus Loss Time: 53.255 s (Default) Initiator Response Timeout: 53.255 s (Default) Reject To Open Limit: Vendor Specific (Default) Maximum Allowed XFER RDY: Unlimited (Read-Only) Transport Layer Retries: Disabled (Read-Only)
Informational Exceptions
MRIE (Method Of Reporting Informational Exceptions) field
This defines how the disk handles errors on the SAS level.
Page 417, table 391, of this document shows the following explanations:
MRIE |
Description |
---|---|
0 | No reporting of informational exception condition: The device server shall not report information exception conditions. |
1 | Asynchronous event reporting: Obsolete |
2 | Generate unit attention: The device server shall report informational exception conditions by establishing a unit attention condition (see SAM-5) for the initiator port associated with every I_T nexus, with the additional sense code set to indicate the cause of the informational exception condition. As defined in SAM-5, the command that has the CHECK CONDITION status with the sense key set to UNIT ATTENTION is not processed before the informational exception condition is reported |
3 | Conditionally generate recovered error: The device server shall report informational exception conditions, if the reporting of recovered errors is allowed, by returning a CHECK CONDITION status. If the TEST bit is set to zero, the status may be returned after the informational exception condition occurs on any command for which GOOD status or INTERMEDIATE status would have been returned. If the TEST bit is set to one, the status shall be returned on the next command received on any I_T nexus that is normally capable of returning an informational exception condition when the test bit is set to zero. The sense key shall be set to RECOVERED ERROR and the additional sense code shall indicate the cause of the informational exception condition. The command that returns the CHECK CONDITION for the informational exception shall complete without error before any informational exception condition may be reported. |
4 | Unconditionally generate recovered error: The device server shall report informational exception conditions, regardless of whether the reporting of recovered errors is allowed, by returning a CHECK CONDITION status. If the TEST bit is set to zero, thestatus may be returned after the informational exception condition occurs on any command for which GOOD status or INTERMEDIATE status would have been returned. If the TEST bit is set to one, the status shall be returned on the next command received on any I_T nexus that is normally capable of returning an informational exception condition when the TEST bit is set to zero. The sense key shall be set to RECOVERED ERROR and the additional sense code shall indicate the cause of the informational exception condition. The command that returns the CHECK CONDITION for the informational exception shall complete without error before any informational exception condition may be reported. |
5 | Generate no sense: The device server shall report informational exception conditions by returning a CHECK CONDITION status. If the TEST bit is set to zero, the status may be returned after the informational exception condition occurs on any command for which GOOD status or INTERMEDIATE status would have been returned. If the TEST bit is set to one, the status shall be returned on the next command received on any I_T nexus that is normally capable of returning an informational exception condition when the TEST bit is set to zero. The sense key shall be set to NO SENSE and the additional sense code shall indicate the cause of the informational exception condition. The command that returns the CHECK CONDITION for the informational exception shall complete without error before any informational exception condition may be reported. |
6 | Only report informational exception condition on request: The device server shall preserve the informational exception(s) information. To find out about information exception conditions the application client polls the device server by issuing a REQUEST SENSE command. In the REQUEST SENSE parameter data that contains the sense data, the sense key shall be set to NO SENSE and the additional sense code shall indicate the cause of the informational exception condition. |
7-B | Reserved |
C-F | Vendor specific |
I can find no suggested setting of this for Linux or ZFS use. From my reading I will choose to set this to 4.
There a two ways to set this.
SeaChest_SMART
This is the command to set it via SeaChest
# ./SeaChest_SMART_x86_64-redhat-linux -d /dev/sg11 --setMRIE 4 ========================================================================================== SeaChest_SMART - Seagate drive utilities - NVMe Enabled Copyright (c) 2014-2021 Seagate Technology LLC and/or its Affiliates, All Rights Reserved SeaChest_SMART Version: 2.0.1-2_2_3 X86_64 Build Date: Jun 17 2021 Today: Thu Jun 1 21:29:34 2023 User: root ========================================================================================== /dev/sg11 - ST12000NM0027 - ZJV2GTFT0000C9069US9 - SCSI Successfully set MRIE mode to 4 # ./SeaChest_SMART_x86_64-redhat-linux -d /dev/sg11 -i ========================================================================================== SeaChest_SMART - Seagate drive utilities - NVMe Enabled Copyright (c) 2014-2021 Seagate Technology LLC and/or its Affiliates, All Rights Reserved SeaChest_SMART Version: 2.0.1-2_2_3 X86_64 Build Date: Jun 17 2021 Today: Thu Jun 1 21:29:37 2023 User: root ========================================================================================== /dev/sg11 - ST12000NM0027 - ZJV2GTFT0000C9069US9 - SCSI Vendor ID: SEAGATE Model Number: ST12000NM0027 Serial Number: ZJV2GTFT PCBA Serial Number: 0000C9069US9 Firmware Revision: E004 World Wide Name: 5000C500A6F0CC8B Copyright: Copyright (c) 2020 Seagate All rights reserved Drive Capacity (TB/TiB): 11.92/10.84 Temperature Data: Current Temperature (C): 36 Highest Temperature (C): Not Reported Lowest Temperature (C): Not Reported Power On Time: 2 years 100 days 15 hours 26 minutes Power On Hours: 19935.43 MaxLBA: 2909274111 Native MaxLBA: Not Reported Logical Sector Size (B): 4096 Physical Sector Size (B): 4096 Sector Alignment: 0 Rotation Rate (RPM): 7200 Form Factor: 3.5" Last DST information: DST has never been run Long Drive Self Test Time: 19 hours 1 minute Interface speed: Port 0 (Current Port) Max Speed (GB/s): 12.0 Negotiated Speed (Gb/s): 12.0 Port 1 Max Speed (GB/s): 12.0 Negotiated Speed (Gb/s): Not Reported Annualized Workload Rate (TB/yr): 12.36 Total Bytes Read (TB): 23.82 Total Bytes Written (TB): 4.30 Encryption Support: Not Supported Cache Size (MiB): Not Reported Read Look-Ahead: Enabled Non-Volatile Cache: Enabled Write Cache: Enabled SMART Status: Good ATA Security Information: Not Supported Firmware Download Support: Full, Segmented, Deferred Number of Logical Units: 1 Specifications Supported: SPC-4 SAM-5 SAS-3 SPL-3 SPC-4 SBC-3 Features Supported: Protection Type 1 Protection Type 2 [Enabled] Application Client Logging Self Test Automatic Write Reassignment [Enabled] Automatic Read Reassignment [Enabled] EPC [Enabled] Informational Exceptions [Mode 4] Translate Address Rebuild Assist Seagate Remanufacture Seagate In Drive Diagnostics (IDD) Format Unit Fast Format Sanitize Adapter Information: Vendor ID: 117Ch Product ID: 8072h Revision: 0006h
sg utils
# sdparm /dev/sg11 --set=MRIE=4 /dev/sg11: SEAGATE ST12000NM0027 E004 # sdparm /dev/sg11 --get=MRIE /dev/sg11: SEAGATE ST12000NM0027 E004 MRIE 4 [cha: y, def: 0, sav: 4]