Difference between revisions of "ZFS"

From W9CR
Jump to navigation Jump to search
Line 130: Line 130:
  
 
Essentially this is a hold over from the 520 byte sectors and 8 bytes being used for a checksum/parity.  This isn't needed in ZFS. It wastes 79.872 GiBytes of space from the disk too!
 
Essentially this is a hold over from the 520 byte sectors and 8 bytes being used for a checksum/parity.  This isn't needed in ZFS. It wastes 79.872 GiBytes of space from the disk too!
 +
 +
==  Informational Exceptions ==
 +
 +
'''MRIE (Method Of Reporting Informational Exceptions) field'''
 +
 +
This defines how the disk handles errors on the SAS level.
 +
 +
[https://www.seagate.com/files/staticfiles/support/docs/manual/Interface%20manuals/100293068j.pdf Page 417, table 391, of this document] shows the following explanations:
 +
 +
{| class="wikitable"
 +
|-
 +
! MRIE<br />
 +
! style="font-weight:bold;" | Description
 +
|-
 +
| 0
 +
| No reporting of informational exception condition: The device server shall not report information exception conditions.
 +
|-
 +
| 1
 +
| Asynchronous event reporting: Obsolete
 +
|-
 +
| 2
 +
| Generate unit attention: <br />The device server shall report informational exception conditions by establishing a unit attention condition (see SAM-5) for the initiator port associated with every I_T nexus, with the additional sense code set to indicate the cause of the informational exception condition.<br /><br />As defined in SAM-5, the command that has the CHECK CONDITION status with the sense key set to UNIT ATTENTION is not processed before the informational exception condition is reported
 +
|-
 +
| 3
 +
| Conditionally generate recovered error: The device server shall report informational exception conditions, if the reporting of recovered errors is allowed, by returning a CHECK CONDITION status. If the TEST bit is set to zero, the status may be returned after the informational exception condition occurs on any command for which GOOD status or INTERMEDIATE status would have been returned. If the TEST bit is set to one, the status shall be returned on the next command received on any I_T nexus that is normally capable of returning an informational exception condition when the test bit is set to zero. The sense key shall be set to RECOVERED ERROR and the additional sense code shall indicate the cause of the informational exception condition.<br /><br />The command that returns the CHECK CONDITION for the informational exception shall complete without error before any informational exception condition may be reported.
 +
|-
 +
| 4
 +
| Unconditionally generate recovered error: The device server shall report informational exception conditions, regardless of whether the reporting of recovered errors is allowed, by returning a CHECK CONDITION status. If the TEST bit is set to zero, thestatus may be returned after the informational exception condition occurs on any command for which GOOD status or INTERMEDIATE status would have been returned. If the TEST bit is set to one, the status shall be returned on the next command received on any I_T nexus that is normally capable of returning an informational exception condition when the TEST bit is set to zero. The sense key shall be set to RECOVERED ERROR and the additional sense code shall indicate the cause of the informational exception condition.<br /><br />The command that returns the CHECK CONDITION for the informational exception shall complete without error before any informational exception condition may be reported.
 +
|-
 +
| 5
 +
| Generate no sense: The device server shall report informational exception conditions by returning a CHECK CONDITION status. If the TEST bit is set to zero, the status may be returned after the informational exception condition occurs on any command for which GOOD status or INTERMEDIATE status would have been returned. If the TEST bit is set to one, the status shall be returned on the next command received on any I_T nexus that is normally capable of returning an informational exception condition when the TEST bit is set to zero. The sense key shall be set to NO SENSE and the additional sense code shall indicate the cause of the informational exception condition.<br /><br />The command that returns the CHECK CONDITION for the informational exception shall complete without error before any informational exception condition may be reported.
 +
|-
 +
| 6
 +
| Only report informational exception condition on request: The device server shall preserve the informational exception(s) information. To find out about information exception conditions the application client polls the device server by issuing a REQUEST SENSE command. In the REQUEST SENSE parameter data that contains the sense data, the sense key shall be set to NO SENSE and the additional sense code shall indicate the cause of the informational exception condition.
 +
|-
 +
| 7-B
 +
| Reserved
 +
|-
 +
| C-F
 +
| Vendor specific
 +
|-
 +
|
 +
|
 +
|}
 +
 +
I can find no suggested setting of this for Linux or ZFS use.  From my reading I will choose to set this to 4.
 +
 +
There a two ways to set this.
 +
 +
=== SeaChest_SMART ===
 +
 +
This is the command to set it via SeaChest
 +
 +
# ./SeaChest_SMART_x86_64-redhat-linux -d /dev/sg11 --setMRIE 4
 +
==========================================================================================
 +
  SeaChest_SMART - Seagate drive utilities - NVMe Enabled
 +
  Copyright (c) 2014-2021 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 +
  SeaChest_SMART Version: 2.0.1-2_2_3 X86_64
 +
  Build Date: Jun 17 2021
 +
  Today: Thu Jun  1 21:29:34 2023        User: root
 +
==========================================================================================
 +
 +
/dev/sg11 - ST12000NM0027 - ZJV2GTFT0000C9069US9 - SCSI
 +
Successfully set MRIE mode to 4
 +
 +
# ./SeaChest_SMART_x86_64-redhat-linux -d /dev/sg11 -i
 +
==========================================================================================
 +
  SeaChest_SMART - Seagate drive utilities - NVMe Enabled
 +
  Copyright (c) 2014-2021 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 +
  SeaChest_SMART Version: 2.0.1-2_2_3 X86_64
 +
  Build Date: Jun 17 2021
 +
  Today: Thu Jun  1 21:29:37 2023        User: root
 +
==========================================================================================
 +
 +
/dev/sg11 - ST12000NM0027 - ZJV2GTFT0000C9069US9 - SCSI
 +
        Vendor ID: SEAGATE
 +
        Model Number: ST12000NM0027
 +
        Serial Number: ZJV2GTFT
 +
        PCBA Serial Number: 0000C9069US9
 +
        Firmware Revision: E004
 +
        World Wide Name: 5000C500A6F0CC8B
 +
        Copyright: Copyright (c) 2020 Seagate All rights reserved
 +
        Drive Capacity (TB/TiB): 11.92/10.84
 +
        Temperature Data:
 +
                Current Temperature (C): 36
 +
                Highest Temperature (C): Not Reported
 +
                Lowest Temperature (C): Not Reported
 +
        Power On Time:  2 years 100 days 15 hours 26 minutes
 +
        Power On Hours: 19935.43
 +
        MaxLBA: 2909274111
 +
        Native MaxLBA: Not Reported
 +
        Logical Sector Size (B): 4096
 +
        Physical Sector Size (B): 4096
 +
        Sector Alignment: 0
 +
        Rotation Rate (RPM): 7200
 +
        Form Factor: 3.5"
 +
        Last DST information:
 +
                DST has never been run
 +
        Long Drive Self Test Time:  19 hours 1 minute
 +
        Interface speed:
 +
                Port 0 (Current Port)
 +
                        Max Speed (GB/s): 12.0
 +
                        Negotiated Speed (Gb/s): 12.0
 +
                Port 1
 +
                        Max Speed (GB/s): 12.0
 +
                        Negotiated Speed (Gb/s): Not Reported
 +
        Annualized Workload Rate (TB/yr): 12.36
 +
        Total Bytes Read (TB): 23.82
 +
        Total Bytes Written (TB): 4.30
 +
        Encryption Support: Not Supported
 +
        Cache Size (MiB): Not Reported
 +
        Read Look-Ahead: Enabled
 +
        Non-Volatile Cache: Enabled
 +
        Write Cache: Enabled
 +
        SMART Status: Good
 +
        ATA Security Information: Not Supported
 +
        Firmware Download Support: Full, Segmented, Deferred
 +
        Number of Logical Units: 1
 +
        Specifications Supported:
 +
                SPC-4
 +
                SAM-5
 +
                SAS-3
 +
                SPL-3
 +
                SPC-4
 +
                SBC-3
 +
        Features Supported:
 +
                Protection Type 1
 +
                Protection Type 2 [Enabled]
 +
                Application Client Logging
 +
                Self Test
 +
                Automatic Write Reassignment [Enabled]
 +
                Automatic Read Reassignment [Enabled]
 +
                EPC [Enabled]
 +
                '''Informational Exceptions [Mode 4]'''
 +
                Translate Address
 +
                Rebuild Assist
 +
                Seagate Remanufacture
 +
                Seagate In Drive Diagnostics (IDD)
 +
                Format Unit
 +
                Fast Format
 +
                Sanitize
 +
        Adapter Information:
 +
                Vendor ID: 117Ch
 +
                Product ID: 8072h
 +
                Revision: 0006h
 +
 +
=== sg utils ===
 +
 +
# sdparm  /dev/sg11  --set=MRIE=4
 +
    /dev/sg11: SEAGATE  ST12000NM0027    E004
 +
 +
# sdparm  /dev/sg11  --get=MRIE
 +
    /dev/sg11: SEAGATE  ST12000NM0027    E004
 +
MRIE          4  [cha: y, def:  0, sav:  4]

Revision as of 17:33, 1 June 2023

Notes on ZFS

Home setup

On osx I'm running a bunch of 12tb disks in a raidz2 config. My intent is to migrate to a zpool with special devices in it.

Plan is 20 12tb disks in 2 vdev's of raidz2 with 3.2 TB SSD's in a mirror. I'll use the m2 SSD on the server for ZIL and l2arc.

This should give about 174.56 TiB of space.


Block Size Histogram

  block   psize                lsize                asize
   size   Count   Size   Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:   350K   175M   175M   350K   175M   175M      0      0      0
     1K:   348K   413M   589M   348K   413M   589M      0      0      0
     2K:   273K   722M  1.28G   273K   722M  1.28G      0      0      0
     4K:   669K  2.65G  3.93G   221K  1.17G  2.45G      0      0      0
     8K:   925K  8.50G  12.4G   176K  1.91G  4.36G  1.23M  14.7G  14.7G
    16K:   620M  9.69T  9.70T   621M  9.70T  9.70T   621M  14.6T  14.6T
    32K:  1.39M  62.8G  9.76T  82.2K  3.57G  9.70T   410K  19.0G  14.6T
    64K:   548K  47.3G  9.81T  47.2K  4.06G  9.71T  1.58M   153G  14.7T
   128K:   825K   150G  9.95T  1014K   128G  9.83T   699K   133G  14.9T
   256K:  66.3M  16.6T  26.5T  68.4M  17.1T  26.9T  66.6M  20.3T  35.1T
   512K:      0      0  26.5T      0      0  26.9T      0      0  35.1T
     1M:      0      0  26.5T      0      0  26.9T      0      0  35.1T
     2M:      0      0  26.5T      0      0  26.9T      0      0  35.1T
     4M:      0      0  26.5T      0      0  26.9T      0      0  35.1T
     8M:      0      0  26.5T      0      0  26.9T      0      0  35.1T
    16M:      0      0  26.5T      0      0  26.9T      0      0  35.1T

Optimization

All disks should be updated

./SeaChest_Firmware_x86_64-redhat-linux --downloadFW /root/MobulaExosX12SAS-STD-5xxE-E004.LOD  -d /dev/sg7

All disks should be 4k sectors. The spinning disks should be long formatted to detect bad blocks.

./SeaChest_Lite_x86_64-redhat-linux  --setSectorSize 4096 --confirm this-will-erase-data -d /dev/sg8

Write cache should be enabled:

# sdparm --get=WCE /dev/sg5
    /dev/sg5: SEAGATE   ST12000NM0027     E004
WCE           0  [cha: y, def:  1, sav:  0]

# sdparm --set=WCE --save /dev/sg5                                                                                         |
    /dev/sg5: SEAGATE   ST12000NM0027     E004

# sdparm --get=WCE --save /dev/sg5
    /dev/sg5: SEAGATE   ST12000NM0027     E004
WCE           1  [cha: y, def:  1, sav:  1]


ashift= 13 = 8192 byte per IO.
recordsize             256K
compression            lz4
casesensitivity        insensitive
special_small_blocks   128K

zdb -Lbbb  PoolName

zpool create -f -o ashift=12 -O casesensitivity=insensitive -O normalization=formD -O compression=lz4 -O atime=off -O recordsize=256k ZfsMediaPool \
raidz2 /var/run/disk/by-path/PCI0@0-SAT0@17-PRT5@5-PMP@0-@0:0 /var/run/disk/by-path/PCI0@0-SAT0@17-PRT4@4-PMP@0-@0:0 \
/var/run/disk/by-path/PCI0@0-RP21@1B,4-PXSX@0-PRT31@1f-PMP@0-@0:0 /var/run/disk/by-path/PCI0@0-SAT0@17-PRT3@3-PMP@0-@0:0 \
/var/run/disk/by-path/PCI0@0-SAT0@17-PRT2@2-PMP@0-@0:0 /var/run/disk/by-path/PCI0@0-SAT0@17-PRT1@1-PMP@0-@0:0 \
/var/run/disk/by-path/PCI0@0-SAT0@17-PRT0@0-PMP@0-@0:0 /var/run/disk/by-path/PCI0@0-RP21@1B,4-PXSX@0-PRT2@2-PMP@0-@0:0 \
/var/run/disk/by-path/PCI0@0-RP21@1B,4-PXSX@0-PRT3@3-PMP@0-@0:0 /var/run/disk/by-path/PCI0@0-RP21@1B,4-PXSX@0-PRT28@1c-PMP@0-@0:0 \ 
/var/run/disk/by-path/PCI0@0-RP21@1B,4-PXSX@0-PRT4@4-PMP@0-@0:0 /var/run/disk/by-path/PCI0@0-RP21@1B,4-PXSX@0-PRT29@1d-PMP@0-@0:0

zpool add ZfsMediaPool log /dev/disk5s3
zpool add ZfsMediaPool cache /dev/disk5s4

=

https://github.com/openzfs/zfs/discussions/12769

Disk Notes

I've run into some issues seeing my disk size "MAX LBA" be different even after formatting

0:17:0     SEAGATE  ST12000NM0027    E004 Disk    10.91 TB    50:00:C5:00:A6:F0:A0:79 sdn  sg15                    
                    SN:ZJV2GV4B0000C908373F 
0:18:0     SEAGATE  ST12000NM0027    E004 Disk    10.84 TB    50:00:C5:00:A6:F0:CC:89 sdj  sg11                   
                    SN:ZJV2GTFT0000C9069US9

This is strange the size is different. I ran the info command on these

SeaChest_Basics_x86_64-redhat-linux -d /dev/sg11 -i SeaChest_Basics_x86_64-redhat-linux -d /dev/sg15 -i

This shows the following different options

Drive Capacity (TB/TiB): 12.00/10.91 
Drive Capacity (TB/TiB): 11.92/10.84

Protection Type 2
Protection Type 2 [Enabled]

Informational Exceptions [Mode 4]
Informational Exceptions [Mode 0]
sg_readcap -l /dev/sg15
Read Capacity results:
   Protection: prot_en=0, p_type=0, p_i_exponent=0
   Logical block provisioning: lbpme=0, lbprz=0
   Last LBA=2929721343 (0xae9fffff), Number of logical blocks=2929721344
   Logical block length=4096 bytes
   Logical blocks per physical block exponent=0
   Lowest aligned LBA=0
Hence:
   Device size: 12000138625024 bytes, 11444224.0 MiB, 12000.14 GB, 12.00 TB
# sg_readcap -l /dev/sg11
Read Capacity results:
   Protection: prot_en=1, p_type=1, p_i_exponent=0 [type 2 protection]
   Logical block provisioning: lbpme=0, lbprz=0
   Last LBA=2909274111 (0xad67ffff), Number of logical blocks=2909274112
   Logical block length=4096 bytes
   Logical blocks per physical block exponent=0
   Lowest aligned LBA=0
Hence:
   Device size: 11916386762752 bytes, 11364352.0 MiB, 11916.39 GB, 11.92 TB

Thus the smaller drive has something called Protection Type 2 enabled. I had no idea what this is. Some searching turned up this website

Not knowing what this was, I then went down a seemingly never ending spiral of T10 Protection Information [PDF] standards. Its pretty neat, how I understand it is the disk controller formats the platters to 520 byte sectors, instead of the more traditional 512 byte sectors, these 8 extra bytes per sector are there for the controller to make sure that the data written to that sector is the same data that is read from it, sort of like data verification. The disk controller can then presents the system (HBA controller or raid card) with the normal 512 bytes of data per section, and any SCSI compatible controller should be able to read and write to it just fine.

Essentially this is a hold over from the 520 byte sectors and 8 bytes being used for a checksum/parity. This isn't needed in ZFS. It wastes 79.872 GiBytes of space from the disk too!

Informational Exceptions

MRIE (Method Of Reporting Informational Exceptions) field

This defines how the disk handles errors on the SAS level.

Page 417, table 391, of this document shows the following explanations:

MRIE
Description
0 No reporting of informational exception condition: The device server shall not report information exception conditions.
1 Asynchronous event reporting: Obsolete
2 Generate unit attention:
The device server shall report informational exception conditions by establishing a unit attention condition (see SAM-5) for the initiator port associated with every I_T nexus, with the additional sense code set to indicate the cause of the informational exception condition.

As defined in SAM-5, the command that has the CHECK CONDITION status with the sense key set to UNIT ATTENTION is not processed before the informational exception condition is reported
3 Conditionally generate recovered error: The device server shall report informational exception conditions, if the reporting of recovered errors is allowed, by returning a CHECK CONDITION status. If the TEST bit is set to zero, the status may be returned after the informational exception condition occurs on any command for which GOOD status or INTERMEDIATE status would have been returned. If the TEST bit is set to one, the status shall be returned on the next command received on any I_T nexus that is normally capable of returning an informational exception condition when the test bit is set to zero. The sense key shall be set to RECOVERED ERROR and the additional sense code shall indicate the cause of the informational exception condition.

The command that returns the CHECK CONDITION for the informational exception shall complete without error before any informational exception condition may be reported.
4 Unconditionally generate recovered error: The device server shall report informational exception conditions, regardless of whether the reporting of recovered errors is allowed, by returning a CHECK CONDITION status. If the TEST bit is set to zero, thestatus may be returned after the informational exception condition occurs on any command for which GOOD status or INTERMEDIATE status would have been returned. If the TEST bit is set to one, the status shall be returned on the next command received on any I_T nexus that is normally capable of returning an informational exception condition when the TEST bit is set to zero. The sense key shall be set to RECOVERED ERROR and the additional sense code shall indicate the cause of the informational exception condition.

The command that returns the CHECK CONDITION for the informational exception shall complete without error before any informational exception condition may be reported.
5 Generate no sense: The device server shall report informational exception conditions by returning a CHECK CONDITION status. If the TEST bit is set to zero, the status may be returned after the informational exception condition occurs on any command for which GOOD status or INTERMEDIATE status would have been returned. If the TEST bit is set to one, the status shall be returned on the next command received on any I_T nexus that is normally capable of returning an informational exception condition when the TEST bit is set to zero. The sense key shall be set to NO SENSE and the additional sense code shall indicate the cause of the informational exception condition.

The command that returns the CHECK CONDITION for the informational exception shall complete without error before any informational exception condition may be reported.
6 Only report informational exception condition on request: The device server shall preserve the informational exception(s) information. To find out about information exception conditions the application client polls the device server by issuing a REQUEST SENSE command. In the REQUEST SENSE parameter data that contains the sense data, the sense key shall be set to NO SENSE and the additional sense code shall indicate the cause of the informational exception condition.
7-B Reserved
C-F Vendor specific

I can find no suggested setting of this for Linux or ZFS use. From my reading I will choose to set this to 4.

There a two ways to set this.

SeaChest_SMART

This is the command to set it via SeaChest

# ./SeaChest_SMART_x86_64-redhat-linux -d /dev/sg11 --setMRIE 4
==========================================================================================
 SeaChest_SMART - Seagate drive utilities - NVMe Enabled
 Copyright (c) 2014-2021 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 SeaChest_SMART Version: 2.0.1-2_2_3 X86_64
 Build Date: Jun 17 2021
 Today: Thu Jun  1 21:29:34 2023        User: root
==========================================================================================

/dev/sg11 - ST12000NM0027 - ZJV2GTFT0000C9069US9 - SCSI
Successfully set MRIE mode to 4

# ./SeaChest_SMART_x86_64-redhat-linux -d /dev/sg11 -i
==========================================================================================
 SeaChest_SMART - Seagate drive utilities - NVMe Enabled
 Copyright (c) 2014-2021 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 SeaChest_SMART Version: 2.0.1-2_2_3 X86_64
 Build Date: Jun 17 2021
 Today: Thu Jun  1 21:29:37 2023        User: root
==========================================================================================

/dev/sg11 - ST12000NM0027 - ZJV2GTFT0000C9069US9 - SCSI
        Vendor ID: SEAGATE
        Model Number: ST12000NM0027
        Serial Number: ZJV2GTFT
        PCBA Serial Number: 0000C9069US9
        Firmware Revision: E004
        World Wide Name: 5000C500A6F0CC8B
        Copyright: Copyright (c) 2020 Seagate All rights reserved
        Drive Capacity (TB/TiB): 11.92/10.84
        Temperature Data:
                Current Temperature (C): 36
                Highest Temperature (C): Not Reported
                Lowest Temperature (C): Not Reported
        Power On Time:  2 years 100 days 15 hours 26 minutes
        Power On Hours: 19935.43
        MaxLBA: 2909274111
        Native MaxLBA: Not Reported
        Logical Sector Size (B): 4096
        Physical Sector Size (B): 4096
        Sector Alignment: 0
        Rotation Rate (RPM): 7200
        Form Factor: 3.5"
        Last DST information:
                DST has never been run
        Long Drive Self Test Time:  19 hours 1 minute
        Interface speed:
                Port 0 (Current Port)
                        Max Speed (GB/s): 12.0
                        Negotiated Speed (Gb/s): 12.0
                Port 1
                        Max Speed (GB/s): 12.0
                        Negotiated Speed (Gb/s): Not Reported
        Annualized Workload Rate (TB/yr): 12.36
        Total Bytes Read (TB): 23.82
        Total Bytes Written (TB): 4.30
        Encryption Support: Not Supported
        Cache Size (MiB): Not Reported
        Read Look-Ahead: Enabled
        Non-Volatile Cache: Enabled
        Write Cache: Enabled
        SMART Status: Good
        ATA Security Information: Not Supported
        Firmware Download Support: Full, Segmented, Deferred
        Number of Logical Units: 1
        Specifications Supported:
                SPC-4
                SAM-5
                SAS-3
                SPL-3
                SPC-4
                SBC-3
        Features Supported:
                Protection Type 1
                Protection Type 2 [Enabled]
                Application Client Logging
                Self Test
                Automatic Write Reassignment [Enabled]
                Automatic Read Reassignment [Enabled]
                EPC [Enabled]
                Informational Exceptions [Mode 4]
                Translate Address
                Rebuild Assist
                Seagate Remanufacture
                Seagate In Drive Diagnostics (IDD)
                Format Unit
                Fast Format
                Sanitize
        Adapter Information:
                Vendor ID: 117Ch
                Product ID: 8072h
                Revision: 0006h

sg utils

# sdparm  /dev/sg11  --set=MRIE=4
    /dev/sg11: SEAGATE   ST12000NM0027     E004

# sdparm  /dev/sg11  --get=MRIE
    /dev/sg11: SEAGATE   ST12000NM0027     E004
MRIE          4  [cha: y, def:  0, sav:  4]