Storage Server
My documents on building a storage server and proxmox host.
Contents
Parts
- X10DSC+ Motherboard
AOM-S3108M-H8L RAID/HBA- AOM-S3008M-L8 HBA, as the Raid one doesn't work 100%
- AOC-MTG-I4S quad SFP+ NIC
- 24x32 GB RAM (768g)
- 24x Seagate ST16000NM002G Exos X16 16TB 12Gb/s SAS
- 375 GB Optane SSD for l2arc/ZIL SSDPEL1K375GA01
- 3 4tb m2 SSD for Special Devices
- 2x Intel Xeon CPU E5-2690v4 2.60GHz
Disk Layout
localDataStore
- ZFS zraid2
Zpool 4 - 6 disk raidz2 1 - 5 disk raidz2 Special - 3 m2 flash - 4tb but provisioned at 768gb starting Zil - 16 GiB - optane l2arc - 333 GiB - optane Note optane is shared between Zil and l2arc
- special device layout
# sgdisk -p /dev/disk/by-enclosure-slot/nvme-red Disk /dev/disk/by-enclosure-slot/nvme-red: 7814037168 sectors, 3.6 TiB Sector size (logical/physical): 512/512 bytes Disk identifier (GUID): 8C156829-68A0-4EE4-AE89-BDFA828CC5A1 Partition table holds up to 128 entries Main partition table begins at sector 2 and ends at sector 33 First usable sector is 34, last usable sector is 7814037134 Partitions will be aligned on 2048-sector boundaries Total free space is 6203424365 sectors (2.9 TiB) Number Start (sector) End (sector) Size Code Name 1 2048 1610614783 768.0 GiB BF01 Solaris /usr & Mac ZFS
- Copy the red to the others
sgdisk /dev/disk/by-enclosure-slot/nvme-red -R /dev/disk/by-enclosure-slot/nvme-blue sgdisk /dev/disk/by-enclosure-slot/nvme-red -R /dev/disk/by-enclosure-slot/nvme-green
- optane setup (orange)
# sgdisk -p /dev/disk/by-enclosure-slot/nvme-orange Disk /dev/disk/by-enclosure-slot/nvme-orange: 91573146 sectors, 349.3 GiB Sector size (logical/physical): 4096/4096 bytes Disk identifier (GUID): 70C34BA9-1175-42A5-B00E-16CFC022861F Partition table holds up to 128 entries Main partition table begins at sector 2 and ends at sector 5 First usable sector is 6, last usable sector is 91573140 Partitions will be aligned on 256-sector boundaries Total free space is 1395599 sectors (5.3 GiB) Number Start (sector) End (sector) Size Code Name 1 256 8388863 32.0 GiB BF01 Solaris /usr & Mac ZFS 2 8388864 90177791 312.0 GiB BF01 Solaris /usr & Mac ZFS
- make the array
zpool create -f -o ashift=12 -O compression=lz4 -O atime=off -O xattr=sa -O \ recordsize=128k localDataStore \ raidz2 \ /dev/disk/by-enclosure-slot/rear-slot000 \ /dev/disk/by-enclosure-slot/rear-slot001 \ /dev/disk/by-enclosure-slot/rear-slot002 \ /dev/disk/by-enclosure-slot/rear-slot003 \ /dev/disk/by-enclosure-slot/rear-slot004 \ /dev/disk/by-enclosure-slot/rear-slot005 \ raidz2 \ /dev/disk/by-enclosure-slot/rear-slot006 \ /dev/disk/by-enclosure-slot/rear-slot007 \ /dev/disk/by-enclosure-slot/rear-slot008 \ /dev/disk/by-enclosure-slot/rear-slot009 \ /dev/disk/by-enclosure-slot/rear-slot010 \ /dev/disk/by-enclosure-slot/rear-slot011 \ raidz2 \ /dev/disk/by-enclosure-slot/front-slot000 \ /dev/disk/by-enclosure-slot/front-slot001 \ /dev/disk/by-enclosure-slot/front-slot002 \ /dev/disk/by-enclosure-slot/front-slot003 \ /dev/disk/by-enclosure-slot/front-slot004 \ /dev/disk/by-enclosure-slot/front-slot005 \ raidz2 \ /dev/disk/by-enclosure-slot/front-slot006 \ /dev/disk/by-enclosure-slot/front-slot007 \ /dev/disk/by-enclosure-slot/front-slot008 \ /dev/disk/by-enclosure-slot/front-slot009 \ /dev/disk/by-enclosure-slot/front-slot010 \ /dev/disk/by-enclosure-slot/front-slot011 \ special mirror \ /dev/disk/by-enclosure-slot/nvme-red-part1 \ /dev/disk/by-enclosure-slot/nvme-green-part1 \ /dev/disk/by-enclosure-slot/nvme-blue-part1 \ cache /dev/disk/by-enclosure-slot/nvme-orange-part2 \ log /dev/disk/by-enclosure-slot/nvme-orange-part1
- check that it's online
zpool status localDataStore pool: localDataStore state: ONLINE config: NAME STATE READ WRITE CKSUM localDataStore ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 disk/by-enclosure-slot/rear-slot000 ONLINE 0 0 0 disk/by-enclosure-slot/rear-slot001 ONLINE 0 0 0 disk/by-enclosure-slot/rear-slot002 ONLINE 0 0 0 disk/by-enclosure-slot/rear-slot003 ONLINE 0 0 0 disk/by-enclosure-slot/rear-slot004 ONLINE 0 0 0 disk/by-enclosure-slot/rear-slot005 ONLINE 0 0 0 raidz2-1 ONLINE 0 0 0 disk/by-enclosure-slot/rear-slot006 ONLINE 0 0 0 disk/by-enclosure-slot/rear-slot007 ONLINE 0 0 0 disk/by-enclosure-slot/rear-slot008 ONLINE 0 0 0 disk/by-enclosure-slot/rear-slot009 ONLINE 0 0 0 disk/by-enclosure-slot/rear-slot010 ONLINE 0 0 0 disk/by-enclosure-slot/rear-slot011 ONLINE 0 0 0 raidz2-2 ONLINE 0 0 0 disk/by-enclosure-slot/front-slot000 ONLINE 0 0 0 disk/by-enclosure-slot/front-slot001 ONLINE 0 0 0 disk/by-enclosure-slot/front-slot002 ONLINE 0 0 0 disk/by-enclosure-slot/front-slot003 ONLINE 0 0 0 disk/by-enclosure-slot/front-slot004 ONLINE 0 0 0 disk/by-enclosure-slot/front-slot005 ONLINE 0 0 0 raidz2-3 ONLINE 0 0 0 disk/by-enclosure-slot/front-slot006 ONLINE 0 0 0 disk/by-enclosure-slot/front-slot007 ONLINE 0 0 0 disk/by-enclosure-slot/front-slot008 ONLINE 0 0 0 disk/by-enclosure-slot/front-slot009 ONLINE 0 0 0 disk/by-enclosure-slot/front-slot010 ONLINE 0 0 0 disk/by-enclosure-slot/front-slot011 ONLINE 0 0 0 special mirror-4 ONLINE 0 0 0 disk/by-enclosure-slot/nvme-red-part1 ONLINE 0 0 0 disk/by-enclosure-slot/nvme-green-part1 ONLINE 0 0 0 disk/by-enclosure-slot/nvme-blue-part1 ONLINE 0 0 0 logs disk/by-enclosure-slot/nvme-orange-part1 ONLINE 0 0 0 cache disk/by-enclosure-slot/nvme-orange-part2 ONLINE 0 0 0
Boot Disks
rpool The sata disks will be linux boot disks in a standard linux raid (maybe zfs mirror?)
- underprovision the sata boot disks
hdparm -Np976762584 --yes-i-know-what-i-am-doing /dev/disk/by-enclosure-slot/sata-bottom
Note that you must power cycle (hard not a reset) the box to get the disks to take this.
SAS controller
The built in controller supports HBA mode per supermicro.
https://docs.broadcom.com/doc/pub-005110
I had to pickup an AOM-S3008M-L8 HBA, which fits in place of the included RAID controller, as even in JBOD mode the controller will not talk to the expanders allowing ledon to identify the failed disks.
https://www.reddit.com/r/homelab/comments/iqz7xc/supermicro_s3108_in_jbod_mode/
Doesn't work in JBOD mode, everything is proxied.
Updating Disks
Firmware
Current firmware on the disks is E004, but disks have E002 on them
for i in `seq 2 25` ; do SeaChest_Firmware --downloadFW ./EvansExosX16SAS-STD-512E-E004.LOD -d /dev/sg$i; done #SeaChest_Firmware -s ========================================================================================== SeaChest_Firmware - Seagate drive utilities - NVMe Enabled Copyright (c) 2014-2021 Seagate Technology LLC and/or its Affiliates, All Rights Reserved SeaChest_Firmware Version: 3.0.0-2_2_1 X86_64 Build Date: Apr 27 2021 Today: Tue Oct 10 22:14:38 2023 User: root ========================================================================================== nvme_ioctl_id: Inappropriate ioctl for device Vendor Handle Model Number Serial Number FwRev LSI /dev/sg0 SAS3x28 0705 LSI /dev/sg1 SAS3x28 0705 SEAGATE /dev/sg10 ST16000NM002G ZL20AJ3P E004 SEAGATE /dev/sg11 ST16000NM002G ZL231860 E004 SEAGATE /dev/sg12 ST16000NM002G ZL21T6Q1 E004 SEAGATE /dev/sg13 ST16000NM002G ZL231PV0 E004 SEAGATE /dev/sg14 ST16000NM002G ZL21TDHH E004 SEAGATE /dev/sg15 ST16000NM002G ZL21Y85R E004 SEAGATE /dev/sg16 ST16000NM002G ZL21S9FQ E004 SEAGATE /dev/sg17 ST16000NM002G ZL21T38M E004 SEAGATE /dev/sg18 ST16000NM002G ZL22C3MJ E004 SEAGATE /dev/sg19 ST16000NM002G ZL21S9J3 E004 SEAGATE /dev/sg2 ST16000NM002G ZL21S9WK E004 SEAGATE /dev/sg20 ST16000NM002G ZL21S9AW E004 SEAGATE /dev/sg21 ST16000NM002G ZL21TGY1 E004 SEAGATE /dev/sg22 ST16000NM002G ZL20CRL7 E004 SEAGATE /dev/sg23 ST16000NM002G ZL21RP9E E004 SEAGATE /dev/sg24 ST16000NM002G ZL21RNZW E004 SEAGATE /dev/sg25 ST16000NM002G ZL21JYXF E004 ATA /dev/sg26 Samsung SSD 870 EVO 1TB S75BNL0W812633P SVT03B6Q ATA /dev/sg27 Samsung SSD 870 EVO 1TB S75BNS0W642820L SVT03B6Q SEAGATE /dev/sg3 ST16000NM002G ZL21V48W E004 SEAGATE /dev/sg4 ST16000NM002G ZL21T7XK E004 SEAGATE /dev/sg5 ST16000NM002G ZL21T8HS E004 SEAGATE /dev/sg6 ST16000NM002G ZL21SBMS E004 SEAGATE /dev/sg7 ST16000NM002G ZL21SRYP E004 SEAGATE /dev/sg8 ST16000NM002G ZL21LVPQ E004 SEAGATE /dev/sg9 ST16000NM002G ZL21TB40 E004 NVMe /dev/nvme0n1 Samsung SSD 990 PRO 4TB S7KGNJ0W912464T 0B2QJXG7
Low Level Format
It's necessary to low level format these disks as we need to turn off Protection Type 2, test the sectors of the drive and make it a 4096 sector size.
# SeaChest_Format --protectionType 0 --formatUnit 4096 --confirm this-will-erase-data --poll -d /dev/sg2
NVME Config
This was the most dificult. The onboard NVME ports under the power supply will take a standard NVME cable, but do not supply any power. The M.2 to SFF-8612 adapters don't have a power connector, so I was able to add some 3.3v inputs on the filter cap. This was kinda hacky, but it works.
Optane
The optane SSD needs to be changed to 4k sectors, but the nvme command doesn't work. Intel has a intelmas program that does https://community.intel.com/t5/Intel-Optane-Solid-State-Drives/4K-format-on-Optane-SSD-P1600X/m-p/1477181
root@pve01:~# nvme id-ns /dev/nvme2n1 -H -n 1 <snip> LBA Format 0 : Metadata Size: 0 bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good (in use) LBA Format 1 : Metadata Size: 8 bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good LBA Format 2 : Metadata Size: 16 bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good LBA Format 3 : Metadata Size: 0 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best LBA Format 4 : Metadata Size: 8 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best LBA Format 5 : Metadata Size: 64 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best LBA Format 6 : Metadata Size: 128 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best
root@pve01:~# intelmas start -intelssd 2 -nvmeformat LBAFormat=3 WARNING! You have selected to format the drive! Proceed with the format? (Y|N): y Formatting...(This can take several minutes to complete) - Intel Optane(TM) SSD DC P4801X Series PHKM2051009H375A - Status : NVMeFormat successful. root@pve01:~# nvme list Node Generic SN Model Namespace Usage Format FW Rev --------------------- --------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- -------- /dev/nvme3n1 /dev/ng3n1 S7KGNJ0W912446D Samsung SSD 990 PRO 4TB 1 449.25 GB / 4.00 TB 512 B + 0 B 0B2QJXG7 /dev/nvme2n1 /dev/ng2n1 PHKM2051009H375A INTEL SSDPEL1K375GA 1 375.08 GB / 375.08 GB 4 KiB + 0 B E2010600 /dev/nvme1n1 /dev/ng1n1 S7KGNJ0W912452X Samsung SSD 990 PRO 4TB 1 449.27 GB / 4.00 TB 512 B + 0 B 0B2QJXG7 /dev/nvme0n1 /dev/ng0n1 S7KGNJ0W912464T Samsung SSD 990 PRO 4TB 1 449.27 GB / 4.00 TB 512 B + 0 B 0B2QJXG7
Partition the disk
Disk /dev/nvme2n1: 91573146 sectors, 349.3 GiB Model: INTEL SSDPEL1K375GA Sector size (logical/physical): 4096/4096 bytes Disk identifier (GUID): 70C34BA9-1175-42A5-B00E-16CFC022861F Partition table holds up to 128 entries Main partition table begins at sector 2 and ends at sector 5 First usable sector is 6, last usable sector is 91573140 Partitions will be aligned on 256-sector boundaries Total free space is 1395599 sectors (5.3 GiB) Number Start (sector) End (sector) Size Code Name 1 256 8388863 32.0 GiB BF01 Solaris /usr & Mac ZFS 2 8388864 90177791 312.0 GiB BF01 Solaris /usr & Mac ZFS Command (? for help): w Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING PARTITIONS!! Do you want to proceed? (Y/N): y OK; writing new GUID partition table (GPT) to /dev/nvme2n1. The operation has completed successfully.
Phsyical layout
Mandatory config
The disks don't deal with the old NVME support on this box (it's listed as oculink .91, not 1.0) and there are some bugs in linux around this and power handling.
A config must be added to the kernel cmdline
vim /etc/kernel/cmdline
add this to it:
nvme_core.default_ps_max_latency_us=0 pcie_aspm=off
then
proxmox-boot-tool refresh
udev rules
https://www.reactivated.net/writing_udev_rules.html#strsubst
https://github.com/bkus/by-enclosure-slot
Updated code for the supermicro:
https://github.com/W9CR/by-enclosure-slot/tree/Supermicro-server
to get this working on boot with proxmox you need to do do the following:
in /etc/default/zfs ZPOOL_IMPORT_PATH="/dev/disk/by-enclosure-slot"
rm /etc/zfs/zpool.cache update-initramfs -u
then power cycle it and confirm the new names are in use. Once that's done run update-initramfs -u again to store the zfs cache in the initramfs (optional but boots faster)
iscsi target
https://forum.level1techs.com/t/has-anyone-here-tried-to-create-an-iscsi-target-in-proxmox/193862
https://www.reddit.com/r/homelab/comments/ih374t/poor_linux_iscsi_target_performance_tips/
mtu!
sudo ip link set eth1 mtu 9000
"Backstore name is too long for "INQUIRY_MODEL" iscsi"
Initator = client
target = server
Provision FatTony as the iSCSI Target
- create the iscsi point on the datastore
zfs create localDataStore/iscsi
- Install the server on FatTony
apt install targetcli-fb
- start the service
systemctl enable --now targetclid.service
- fix the missing /dev/<zpool-name>.
Per the latest debian they are only located under /dev/zvol/<zpool-name>.
This has a bug in proxmox, as the /dev/<zpool-name> is hard coded.
A forum post on it is here https://forum.proxmox.com/threads/missing-path-to-zfs-pool-in-dev-dir-after-recent-updates.139371/
KERNEL=="zd*", SUBSYSTEM=="block", ACTION=="add|change", SYMLINK+="%c" >/etc/udev/rules.d/99-zvol.rules
Prepare the Proxmox Clientnodes
Now on the clients we need to configure the access to the iscsi target so it can ssh to it. It does this to run ZFS commands and make the block devices that are exported
The IP of the target here in the keyname is important, this is used by PVE as the default. IPV6 is equally possible.
mkdir /etc/pve/priv/zfs ssh-keygen -f /etc/pve/priv/zfs/192.168.8.187_id_rsa ssh-copy-id -i /etc/pve/priv/zfs/192.168.8.187_id_rsa.pub root@192.168.8.187
- verify this key works
ssh -i /etc/pve/priv/zfs/192.168.8.187_id_rsa FatTony
- Read out the ISCSI Initiator name from each node
for i in carbonrod moleman fink spiderpig fattony ; do ssh $i 'cat /etc/iscsi/initiatorname.iscsi | grep ^InitiatorName' ; done InitiatorName=iqn.1993-08.org.debian:01:5f09136632 InitiatorName=iqn.1993-08.org.debian:01:28601263f0ff InitiatorName=iqn.1993-08.org.debian:01:e45a7fa9d6ec InitiatorName=iqn.1993-08.org.debian:01:cf88d296de8d InitiatorName=iqn.1993-08.org.debian:01:51de7eb4a092
configure the target on fattony
# itargetcli /iscsi> create Created target iqn.2003-01.org.linux-iscsi.fattony.x8664:sn.5381d8313ebf. Created TPG 1. Global pref auto_add_default_portal=true Created default portal listening on all IPs (0.0.0.0), port 3260.
- restrict it to the 192.168.8.187 interface
/iscsi> cd iqn.2003-01.org.linux-iscsi.fattony.x8664:sn.5381d8313ebf/tpg1/portals /iscsi/iqn.20.../tpg1/portals> delete 0.0.0.0 3260 Deleted network portal 0.0.0.0:3260 /iscsi/iqn.20.../tpg1/portals> create 192.168.8.187 3260 Using default IP port 3260 Created network portal 192.168.8.187:3260.
- add the luns from each node to the acl
/iscsi/iqn.20...ebf/tpg1/acls> create iqn.1993-08.org.debian:01:5f09136632 Created Node ACL for iqn.1993-08.org.debian:01:5f09136632 /iscsi/iqn.20...ebf/tpg1/acls> create iqn.1993-08.org.debian:01:28601263f0ff Created Node ACL for iqn.1993-08.org.debian:01:28601263f0ff /iscsi/iqn.20...ebf/tpg1/acls> create iqn.1993-08.org.debian:01:e45a7fa9d6ec Created Node ACL for iqn.1993-08.org.debian:01:e45a7fa9d6ec /iscsi/iqn.20...ebf/tpg1/acls> create iqn.1993-08.org.debian:01:cf88d296de8d Created Node ACL for iqn.1993-08.org.debian:01:cf88d296de8d /iscsi/iqn.20...ebf/tpg1/acls> create iqn.1993-08.org.debian:01:51de7eb4a092 Created Node ACL for iqn.1993-08.org.debian:01:51de7eb4a092 /iscsi/iqn.20...ebf/tpg1/acls> ls o- acls .................................................................................................................. [ACLs: 5] o- iqn.1993-08.org.debian:01:28601263f0ff ....................................................................... [Mapped LUNs: 0] o- iqn.1993-08.org.debian:01:51de7eb4a092 ....................................................................... [Mapped LUNs: 0] o- iqn.1993-08.org.debian:01:5f09136632 ......................................................................... [Mapped LUNs: 0] o- iqn.1993-08.org.debian:01:cf88d296de8d ....................................................................... [Mapped LUNs: 0] o- iqn.1993-08.org.debian:01:e45a7fa9d6ec ....................................................................... [Mapped LUNs: 0] /iscsi/iqn.20...ebf/tpg1/acls> cd /iscsi/ /iscsi> ls o- iscsi .............................................................................................................. [Targets: 1] o- iqn.2003-01.org.linux-iscsi.fattony.x8664:sn.5381d8313ebf ........................................................... [TPGs: 1] o- tpg1 ................................................................................................. [no-gen-acls, no-auth] o- acls ............................................................................................................ [ACLs: 5] | o- iqn.1993-08.org.debian:01:28601263f0ff ................................................................. [Mapped LUNs: 0] | o- iqn.1993-08.org.debian:01:51de7eb4a092 ................................................................. [Mapped LUNs: 0] | o- iqn.1993-08.org.debian:01:5f09136632 ................................................................... [Mapped LUNs: 0] | o- iqn.1993-08.org.debian:01:cf88d296de8d ................................................................. [Mapped LUNs: 0] | o- iqn.1993-08.org.debian:01:e45a7fa9d6ec ................................................................. [Mapped LUNs: 0] o- luns ............................................................................................................ [LUNs: 0] o- portals ...................................................................................................... [Portals: 1] o- 192.168.8.187:3260 ................................................................................................. [OK]
- set auth to disabled (it's a private network)
/> /iscsi/iqn.2003-01.org.linux-iscsi.fattony.x8664:sn.5381d8313ebf/tpg1 /iscsi/iqn.20...d8313ebf/tpg1> set attribute authentication=0
- configure the storage to use it.
vim /etc/pve/storage.cfg zfs: iscsi-zfs blocksize 32k iscsiprovider LIO pool localDataStore/iscsi portal 192.168.8.187 target iqn.2003-01.org.linux-iscsi.fattony.x8664:sn.5381d8313ebf content images lio_tpg tpg1 nowritecache 1 sparse 1
Verify that you can see iscsi-zfs from all nodes
If you can't or get an error, it's prob an ssh key issue. From the affected node do:
ssh -i /etc/pve/priv/zfs/192.168.8.187_id_rsa root@192.168.8.187
and accept the key (yes)
Software
- Proxmox 8 server
- proxmox backup
- snmp