Storage Server

From W9CR
Jump to navigation Jump to search

My documents on building a storage server and proxmox host.

Parts

- Chassis Manual

Disk Layout

localDataStore

  • ZFS zraid2
Zpool 
4 - 6 disk raidz2 
1 - 5 disk raidz2
Special - 3 m2 flash - 4tb but provisioned at 768gb starting
Zil - 16 GiB - optane
l2arc -  333 GiB - optane
Note optane is shared between Zil and l2arc
  • special device layout
# sgdisk -p /dev/disk/by-enclosure-slot/nvme-red
Disk /dev/disk/by-enclosure-slot/nvme-red: 7814037168 sectors, 3.6 TiB
Sector size (logical/physical): 512/512 bytes
Disk identifier (GUID): 8C156829-68A0-4EE4-AE89-BDFA828CC5A1
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 7814037134
Partitions will be aligned on 2048-sector boundaries
Total free space is 6203424365 sectors (2.9 TiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048      1610614783   768.0 GiB   BF01  Solaris /usr & Mac ZFS
  • Copy the red to the others
sgdisk /dev/disk/by-enclosure-slot/nvme-red -R /dev/disk/by-enclosure-slot/nvme-blue
sgdisk /dev/disk/by-enclosure-slot/nvme-red -R /dev/disk/by-enclosure-slot/nvme-green
  • optane setup (orange)
# sgdisk -p /dev/disk/by-enclosure-slot/nvme-orange
Disk /dev/disk/by-enclosure-slot/nvme-orange: 91573146 sectors, 349.3 GiB
Sector size (logical/physical): 4096/4096 bytes
Disk identifier (GUID): 70C34BA9-1175-42A5-B00E-16CFC022861F
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 5
First usable sector is 6, last usable sector is 91573140
Partitions will be aligned on 256-sector boundaries
Total free space is 1395599 sectors (5.3 GiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1             256         8388863   32.0 GiB    BF01  Solaris /usr & Mac ZFS
   2         8388864        90177791   312.0 GiB   BF01  Solaris /usr & Mac ZFS
  • make the array
zpool create -f -o ashift=12 -O compression=lz4 -O atime=off -O xattr=sa -O \
recordsize=128k localDataStore \
raidz2 \
/dev/disk/by-enclosure-slot/rear-slot000 \
/dev/disk/by-enclosure-slot/rear-slot001 \
/dev/disk/by-enclosure-slot/rear-slot002 \
/dev/disk/by-enclosure-slot/rear-slot003 \
/dev/disk/by-enclosure-slot/rear-slot004 \
/dev/disk/by-enclosure-slot/rear-slot005 \
raidz2 \
/dev/disk/by-enclosure-slot/rear-slot006 \
/dev/disk/by-enclosure-slot/rear-slot007 \
/dev/disk/by-enclosure-slot/rear-slot008 \
/dev/disk/by-enclosure-slot/rear-slot009 \
/dev/disk/by-enclosure-slot/rear-slot010 \
/dev/disk/by-enclosure-slot/rear-slot011 \
raidz2 \
/dev/disk/by-enclosure-slot/front-slot000 \
/dev/disk/by-enclosure-slot/front-slot001 \
/dev/disk/by-enclosure-slot/front-slot002 \
/dev/disk/by-enclosure-slot/front-slot003 \
/dev/disk/by-enclosure-slot/front-slot004 \
/dev/disk/by-enclosure-slot/front-slot005 \
raidz2 \
/dev/disk/by-enclosure-slot/front-slot006 \
/dev/disk/by-enclosure-slot/front-slot007 \
/dev/disk/by-enclosure-slot/front-slot008 \
/dev/disk/by-enclosure-slot/front-slot009 \
/dev/disk/by-enclosure-slot/front-slot010 \
/dev/disk/by-enclosure-slot/front-slot011 \
special mirror \
/dev/disk/by-enclosure-slot/nvme-red-part1 \
/dev/disk/by-enclosure-slot/nvme-green-part1 \
/dev/disk/by-enclosure-slot/nvme-blue-part1 \
cache /dev/disk/by-enclosure-slot/nvme-orange-part2 \
log /dev/disk/by-enclosure-slot/nvme-orange-part1 
  • check that it's online
zpool status localDataStore
  pool: localDataStore
 state: ONLINE
config:

        NAME                                         STATE     READ WRITE CKSUM
        localDataStore                               ONLINE       0     0     0
          raidz2-0                                   ONLINE       0     0     0
            disk/by-enclosure-slot/rear-slot000      ONLINE       0     0     0
            disk/by-enclosure-slot/rear-slot001      ONLINE       0     0     0
            disk/by-enclosure-slot/rear-slot002      ONLINE       0     0     0
            disk/by-enclosure-slot/rear-slot003      ONLINE       0     0     0
            disk/by-enclosure-slot/rear-slot004      ONLINE       0     0     0
            disk/by-enclosure-slot/rear-slot005      ONLINE       0     0     0
          raidz2-1                                   ONLINE       0     0     0
            disk/by-enclosure-slot/rear-slot006      ONLINE       0     0     0
            disk/by-enclosure-slot/rear-slot007      ONLINE       0     0     0
            disk/by-enclosure-slot/rear-slot008      ONLINE       0     0     0
            disk/by-enclosure-slot/rear-slot009      ONLINE       0     0     0
            disk/by-enclosure-slot/rear-slot010      ONLINE       0     0     0
            disk/by-enclosure-slot/rear-slot011      ONLINE       0     0     0
          raidz2-2                                   ONLINE       0     0     0
            disk/by-enclosure-slot/front-slot000     ONLINE       0     0     0
            disk/by-enclosure-slot/front-slot001     ONLINE       0     0     0
            disk/by-enclosure-slot/front-slot002     ONLINE       0     0     0
            disk/by-enclosure-slot/front-slot003     ONLINE       0     0     0
            disk/by-enclosure-slot/front-slot004     ONLINE       0     0     0
            disk/by-enclosure-slot/front-slot005     ONLINE       0     0     0
          raidz2-3                                   ONLINE       0     0     0
            disk/by-enclosure-slot/front-slot006     ONLINE       0     0     0
            disk/by-enclosure-slot/front-slot007     ONLINE       0     0     0
            disk/by-enclosure-slot/front-slot008     ONLINE       0     0     0
            disk/by-enclosure-slot/front-slot009     ONLINE       0     0     0
            disk/by-enclosure-slot/front-slot010     ONLINE       0     0     0
            disk/by-enclosure-slot/front-slot011     ONLINE       0     0     0
        special
          mirror-4                                   ONLINE       0     0     0
            disk/by-enclosure-slot/nvme-red-part1    ONLINE       0     0     0
            disk/by-enclosure-slot/nvme-green-part1  ONLINE       0     0     0
            disk/by-enclosure-slot/nvme-blue-part1   ONLINE       0     0     0
        logs
          disk/by-enclosure-slot/nvme-orange-part1   ONLINE       0     0     0
        cache
          disk/by-enclosure-slot/nvme-orange-part2   ONLINE       0     0     0


Boot Disks

rpool The sata disks will be linux boot disks in a standard linux raid (maybe zfs mirror?)

  • underprovision the sata boot disks
hdparm -Np976762584 --yes-i-know-what-i-am-doing /dev/disk/by-enclosure-slot/sata-bottom

Note that you must power cycle (hard not a reset) the box to get the disks to take this.

SAS controller

The built in controller supports HBA mode per supermicro.

https://docs.broadcom.com/doc/pub-005110

I had to pickup an AOM-S3008M-L8 HBA, which fits in place of the included RAID controller, as even in JBOD mode the controller will not talk to the expanders allowing ledon to identify the failed disks.


https://www.reddit.com/r/homelab/comments/iqz7xc/supermicro_s3108_in_jbod_mode/

Doesn't work in JBOD mode, everything is proxied.

Updating Disks

Firmware

Current firmware on the disks is E004, but disks have E002 on them

for i in `seq 2 25` ; do SeaChest_Firmware --downloadFW ./EvansExosX16SAS-STD-512E-E004.LOD -d /dev/sg$i; done

#SeaChest_Firmware -s
==========================================================================================
 SeaChest_Firmware - Seagate drive utilities - NVMe Enabled
 Copyright (c) 2014-2021 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 SeaChest_Firmware Version: 3.0.0-2_2_1 X86_64
 Build Date: Apr 27 2021
 Today: Tue Oct 10 22:14:38 2023        User: root
==========================================================================================
nvme_ioctl_id: Inappropriate ioctl for device
Vendor   Handle       Model Number            Serial Number          FwRev
LSI      /dev/sg0     SAS3x28                                        0705
LSI      /dev/sg1     SAS3x28                                        0705
SEAGATE  /dev/sg10    ST16000NM002G           ZL20AJ3P               E004
SEAGATE  /dev/sg11    ST16000NM002G           ZL231860               E004
SEAGATE  /dev/sg12    ST16000NM002G           ZL21T6Q1               E004
SEAGATE  /dev/sg13    ST16000NM002G           ZL231PV0               E004
SEAGATE  /dev/sg14    ST16000NM002G           ZL21TDHH               E004
SEAGATE  /dev/sg15    ST16000NM002G           ZL21Y85R               E004
SEAGATE  /dev/sg16    ST16000NM002G           ZL21S9FQ               E004
SEAGATE  /dev/sg17    ST16000NM002G           ZL21T38M               E004
SEAGATE  /dev/sg18    ST16000NM002G           ZL22C3MJ               E004
SEAGATE  /dev/sg19    ST16000NM002G           ZL21S9J3               E004
SEAGATE  /dev/sg2     ST16000NM002G           ZL21S9WK               E004
SEAGATE  /dev/sg20    ST16000NM002G           ZL21S9AW               E004
SEAGATE  /dev/sg21    ST16000NM002G           ZL21TGY1               E004
SEAGATE  /dev/sg22    ST16000NM002G           ZL20CRL7               E004
SEAGATE  /dev/sg23    ST16000NM002G           ZL21RP9E               E004
SEAGATE  /dev/sg24    ST16000NM002G           ZL21RNZW               E004
SEAGATE  /dev/sg25    ST16000NM002G           ZL21JYXF               E004
ATA      /dev/sg26    Samsung SSD 870 EVO 1TB S75BNL0W812633P        SVT03B6Q
ATA      /dev/sg27    Samsung SSD 870 EVO 1TB S75BNS0W642820L        SVT03B6Q
SEAGATE  /dev/sg3     ST16000NM002G           ZL21V48W               E004
SEAGATE  /dev/sg4     ST16000NM002G           ZL21T7XK               E004
SEAGATE  /dev/sg5     ST16000NM002G           ZL21T8HS               E004
SEAGATE  /dev/sg6     ST16000NM002G           ZL21SBMS               E004
SEAGATE  /dev/sg7     ST16000NM002G           ZL21SRYP               E004
SEAGATE  /dev/sg8     ST16000NM002G           ZL21LVPQ               E004
SEAGATE  /dev/sg9     ST16000NM002G           ZL21TB40               E004

NVMe     /dev/nvme0n1 Samsung SSD 990 PRO 4TB S7KGNJ0W912464T        0B2QJXG7

Low Level Format

It's necessary to low level format these disks as we need to turn off Protection Type 2, test the sectors of the drive and make it a 4096 sector size.

 # SeaChest_Format --protectionType 0 --formatUnit 4096 --confirm this-will-erase-data --poll -d /dev/sg2

NVME Config

This was the most dificult. The onboard NVME ports under the power supply will take a standard NVME cable, but do not supply any power. The M.2 to SFF-8612 adapters don't have a power connector, so I was able to add some 3.3v inputs on the filter cap. This was kinda hacky, but it works.

Optane

The optane SSD needs to be changed to 4k sectors, but the nvme command doesn't work. Intel has a intelmas program that does https://community.intel.com/t5/Intel-Optane-Solid-State-Drives/4K-format-on-Optane-SSD-P1600X/m-p/1477181


   root@pve01:~# nvme id-ns /dev/nvme2n1 -H -n 1
   <snip>
   LBA Format  0 : Metadata Size: 0   bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good (in use)
   LBA Format  1 : Metadata Size: 8   bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good
   LBA Format  2 : Metadata Size: 16  bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good
   LBA Format  3 : Metadata Size: 0   bytes - Data Size: 4096 bytes - Relative Performance: 0 Best
   LBA Format  4 : Metadata Size: 8   bytes - Data Size: 4096 bytes - Relative Performance: 0 Best
   LBA Format  5 : Metadata Size: 64  bytes - Data Size: 4096 bytes - Relative Performance: 0 Best
   LBA Format  6 : Metadata Size: 128 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best
    
root@pve01:~# intelmas start -intelssd 2 -nvmeformat LBAFormat=3
WARNING! You have selected to format the drive!
Proceed with the format? (Y|N): y
Formatting...(This can take several minutes to complete)

- Intel Optane(TM) SSD DC P4801X Series PHKM2051009H375A -

Status : NVMeFormat successful.

root@pve01:~# nvme list
Node                  Generic               SN                   Model                                    Namespace Usage                      Format           FW Rev
--------------------- --------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme3n1          /dev/ng3n1            S7KGNJ0W912446D      Samsung SSD 990 PRO 4TB                  1         449.25  GB /   4.00  TB    512   B +  0 B   0B2QJXG7
/dev/nvme2n1          /dev/ng2n1            PHKM2051009H375A     INTEL SSDPEL1K375GA                      1         375.08  GB / 375.08  GB      4 KiB +  0 B   E2010600
/dev/nvme1n1          /dev/ng1n1            S7KGNJ0W912452X      Samsung SSD 990 PRO 4TB                  1         449.27  GB /   4.00  TB    512   B +  0 B   0B2QJXG7
/dev/nvme0n1          /dev/ng0n1            S7KGNJ0W912464T      Samsung SSD 990 PRO 4TB                  1         449.27  GB /   4.00  TB    512   B +  0 B   0B2QJXG7

Partition the disk

Disk /dev/nvme2n1: 91573146 sectors, 349.3 GiB
Model: INTEL SSDPEL1K375GA
Sector size (logical/physical): 4096/4096 bytes
Disk identifier (GUID): 70C34BA9-1175-42A5-B00E-16CFC022861F
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 5
First usable sector is 6, last usable sector is 91573140
Partitions will be aligned on 256-sector boundaries
Total free space is 1395599 sectors (5.3 GiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1             256         8388863   32.0 GiB    BF01  Solaris /usr & Mac ZFS
   2         8388864        90177791   312.0 GiB   BF01  Solaris /usr & Mac ZFS

Command (? for help): w

Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING
PARTITIONS!!

Do you want to proceed? (Y/N): y
OK; writing new GUID partition table (GPT) to /dev/nvme2n1.
The operation has completed successfully.

Phsyical layout

Mandatory config

The disks don't deal with the old NVME support on this box (it's listed as oculink .91, not 1.0) and there are some bugs in linux around this and power handling.

A config must be added to the kernel cmdline

   vim /etc/kernel/cmdline 

add this to it:

   nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

then

   proxmox-boot-tool refresh

udev rules

https://www.reactivated.net/writing_udev_rules.html#strsubst

https://github.com/bkus/by-enclosure-slot

Updated code for the supermicro:

https://github.com/W9CR/by-enclosure-slot/tree/Supermicro-server

to get this working on boot with proxmox you need to do do the following:

in /etc/default/zfs 
ZPOOL_IMPORT_PATH="/dev/disk/by-enclosure-slot"
rm /etc/zfs/zpool.cache
update-initramfs -u

then power cycle it and confirm the new names are in use. Once that's done run update-initramfs -u again to store the zfs cache in the initramfs (optional but boots faster)

iscsi target

https://forum.level1techs.com/t/has-anyone-here-tried-to-create-an-iscsi-target-in-proxmox/193862

https://deepdoc-at.translate.goog/dokuwiki/doku.php?id=virtualisierung:proxmox_kvm_und_lxc:proxmox_debian_als_zfs-over-iscsi_server_verwenden&_x_tr_sl=en&_x_tr_tl=es&_x_tr_hl=en&_x_tr_pto=wapp

https://www.reddit.com/r/homelab/comments/ih374t/poor_linux_iscsi_target_performance_tips/

mtu!

sudo ip link set eth1 mtu 9000


"Backstore name is too long for "INQUIRY_MODEL" iscsi"


Initator = client target = server

Provision FatTony as the iSCSI Target

  • create the iscsi point on the datastore
zfs create localDataStore/iscsi
  • Install the server on FatTony
apt install targetcli-fb 
  • start the service
systemctl enable --now targetclid.service
  • fix the missing /dev/<zpool-name>.

Per the latest debian they are only located under /dev/zvol/<zpool-name>. This has a bug in proxmox, as the /dev/<zpool-name> is hard coded. A forum post on it is here https://forum.proxmox.com/threads/missing-path-to-zfs-pool-in-dev-dir-after-recent-updates.139371/


KERNEL=="zd*", SUBSYSTEM=="block", ACTION=="add|change", SYMLINK+="%c" >/etc/udev/rules.d/99-zvol.rules

Prepare the Proxmox Clientnodes

Now on the clients we need to configure the access to the iscsi target so it can ssh to it. It does this to run ZFS commands and make the block devices that are exported

The IP of the target here in the keyname is important, this is used by PVE as the default. IPV6 is equally possible.

mkdir /etc/pve/priv/zfs
ssh-keygen -f /etc/pve/priv/zfs/192.168.8.187_id_rsa
ssh-copy-id -i /etc/pve/priv/zfs/192.168.8.187_id_rsa.pub root@192.168.8.187
  • verify this key works
ssh -i /etc/pve/priv/zfs/192.168.8.187_id_rsa FatTony
  • Read out the ISCSI Initiator name from each node
for i in carbonrod moleman fink spiderpig fattony ; do ssh $i 'cat /etc/iscsi/initiatorname.iscsi | grep ^InitiatorName' ; done
InitiatorName=iqn.1993-08.org.debian:01:5f09136632
InitiatorName=iqn.1993-08.org.debian:01:28601263f0ff
InitiatorName=iqn.1993-08.org.debian:01:e45a7fa9d6ec
InitiatorName=iqn.1993-08.org.debian:01:cf88d296de8d
InitiatorName=iqn.1993-08.org.debian:01:51de7eb4a092

configure the target on fattony

# itargetcli
/iscsi> create
Created target iqn.2003-01.org.linux-iscsi.fattony.x8664:sn.5381d8313ebf.
Created TPG 1.
Global pref auto_add_default_portal=true
Created default portal listening on all IPs (0.0.0.0), port 3260.
  • restrict it to the 192.168.8.187 interface
/iscsi> cd iqn.2003-01.org.linux-iscsi.fattony.x8664:sn.5381d8313ebf/tpg1/portals
/iscsi/iqn.20.../tpg1/portals> delete 0.0.0.0 3260
Deleted network portal 0.0.0.0:3260
/iscsi/iqn.20.../tpg1/portals> create 192.168.8.187 3260
Using default IP port 3260
Created network portal 192.168.8.187:3260.


  • add the luns from each node to the acl
/iscsi/iqn.20...ebf/tpg1/acls> create iqn.1993-08.org.debian:01:5f09136632
Created Node ACL for iqn.1993-08.org.debian:01:5f09136632
/iscsi/iqn.20...ebf/tpg1/acls> create iqn.1993-08.org.debian:01:28601263f0ff
Created Node ACL for iqn.1993-08.org.debian:01:28601263f0ff
/iscsi/iqn.20...ebf/tpg1/acls> create iqn.1993-08.org.debian:01:e45a7fa9d6ec
Created Node ACL for iqn.1993-08.org.debian:01:e45a7fa9d6ec
/iscsi/iqn.20...ebf/tpg1/acls> create iqn.1993-08.org.debian:01:cf88d296de8d
Created Node ACL for iqn.1993-08.org.debian:01:cf88d296de8d
/iscsi/iqn.20...ebf/tpg1/acls> create iqn.1993-08.org.debian:01:51de7eb4a092
Created Node ACL for iqn.1993-08.org.debian:01:51de7eb4a092
/iscsi/iqn.20...ebf/tpg1/acls> ls
o- acls .................................................................................................................. [ACLs: 5]
  o- iqn.1993-08.org.debian:01:28601263f0ff ....................................................................... [Mapped LUNs: 0]
  o- iqn.1993-08.org.debian:01:51de7eb4a092 ....................................................................... [Mapped LUNs: 0]
  o- iqn.1993-08.org.debian:01:5f09136632 ......................................................................... [Mapped LUNs: 0]
  o- iqn.1993-08.org.debian:01:cf88d296de8d ....................................................................... [Mapped LUNs: 0]
  o- iqn.1993-08.org.debian:01:e45a7fa9d6ec ....................................................................... [Mapped LUNs: 0]
/iscsi/iqn.20...ebf/tpg1/acls> cd /iscsi/
/iscsi> ls
o- iscsi .............................................................................................................. [Targets: 1]
  o- iqn.2003-01.org.linux-iscsi.fattony.x8664:sn.5381d8313ebf ........................................................... [TPGs: 1]
    o- tpg1 ................................................................................................. [no-gen-acls, no-auth]
      o- acls ............................................................................................................ [ACLs: 5]
      | o- iqn.1993-08.org.debian:01:28601263f0ff ................................................................. [Mapped LUNs: 0]
      | o- iqn.1993-08.org.debian:01:51de7eb4a092 ................................................................. [Mapped LUNs: 0]
      | o- iqn.1993-08.org.debian:01:5f09136632 ................................................................... [Mapped LUNs: 0]
      | o- iqn.1993-08.org.debian:01:cf88d296de8d ................................................................. [Mapped LUNs: 0]
      | o- iqn.1993-08.org.debian:01:e45a7fa9d6ec ................................................................. [Mapped LUNs: 0]
      o- luns ............................................................................................................ [LUNs: 0]
      o- portals ...................................................................................................... [Portals: 1]
        o- 192.168.8.187:3260 ................................................................................................. [OK]
  • set auth to disabled (it's a private network)
/> /iscsi/iqn.2003-01.org.linux-iscsi.fattony.x8664:sn.5381d8313ebf/tpg1
/iscsi/iqn.20...d8313ebf/tpg1> set attribute authentication=0

  • configure the storage to use it.
vim /etc/pve/storage.cfg

zfs: iscsi-zfs
        blocksize 32k
        iscsiprovider LIO
        pool localDataStore/iscsi
        portal 192.168.8.187
        target iqn.2003-01.org.linux-iscsi.fattony.x8664:sn.5381d8313ebf
        content images
        lio_tpg tpg1
        nowritecache 1
        sparse 1

Verify that you can see iscsi-zfs from all nodes

If you can't or get an error, it's prob an ssh key issue. From the affected node do:

ssh -i /etc/pve/priv/zfs/192.168.8.187_id_rsa root@192.168.8.187  

and accept the key (yes)

Software

  • Proxmox 8 server
  • proxmox backup
  • snmp