Difference between revisions of "Storage Server"

From W9CR
Jump to navigation Jump to search
 
(14 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
[[Category:Keekles Infrastructure]]
 +
 
My documents on building a storage server and proxmox host.
 
My documents on building a storage server and proxmox host.
  
Line 10: Line 12:
 
* [https://www.supermicro.com/en/products/accessories/addon/AOM-S3008M-L8.php AOM-S3008M-L8 HBA, as the Raid one doesn't work 100%]
 
* [https://www.supermicro.com/en/products/accessories/addon/AOM-S3008M-L8.php AOM-S3008M-L8 HBA, as the Raid one doesn't work 100%]
 
* [https://www.supermicro.com/en/products/accessories/addon/AOC-MTG-i4S.php AOC-MTG-I4S quad SFP+ NIC]
 
* [https://www.supermicro.com/en/products/accessories/addon/AOC-MTG-i4S.php AOC-MTG-I4S quad SFP+ NIC]
* 8x32 GB RAM (256g)
+
* 24x32 GB RAM (768g)
 
* [https://www.seagate.com/files/www-content/datasheets/pdfs/exos-x16-DS2011-1-1904US-en_US.pdf 24x Seagate ST16000NM002G Exos X16 16TB 12Gb/s SAS]  
 
* [https://www.seagate.com/files/www-content/datasheets/pdfs/exos-x16-DS2011-1-1904US-en_US.pdf 24x Seagate ST16000NM002G Exos X16 16TB 12Gb/s SAS]  
 
* [https://ark.intel.com/content/www/us/en/ark/products/149366/intel-optane-ssd-dc-p4801x-series-375gb-m-2-110mm-pcie-x4-3d-xpoint.html 375 GB Optane SSD for l2arc/ZIL  SSDPEL1K375GA01]  
 
* [https://ark.intel.com/content/www/us/en/ark/products/149366/intel-optane-ssd-dc-p4801x-series-375gb-m-2-110mm-pcie-x4-3d-xpoint.html 375 GB Optane SSD for l2arc/ZIL  SSDPEL1K375GA01]  
Line 18: Line 20:
 
= Disk Layout =
 
= Disk Layout =
  
ZFS zraid2  
+
== localDataStore ==
 +
*ZFS zraid2  
 +
 
 +
Zpool - 24 16 TB SAS disks
 +
4 - 6 disk raidz2
 +
Special - 3 m2 flash - 4tb but provisioned at 768gb starting
 +
Zil - 16 GiB - optane
 +
l2arc -  333 GiB - optane
 +
Note optane is shared between Zil and l2arc
 +
 
 +
* special device layout
 +
<pre>
 +
# sgdisk -p /dev/disk/by-enclosure-slot/nvme-red
 +
Disk /dev/disk/by-enclosure-slot/nvme-red: 7814037168 sectors, 3.6 TiB
 +
Sector size (logical/physical): 512/512 bytes
 +
Disk identifier (GUID): 8C156829-68A0-4EE4-AE89-BDFA828CC5A1
 +
Partition table holds up to 128 entries
 +
Main partition table begins at sector 2 and ends at sector 33
 +
First usable sector is 34, last usable sector is 7814037134
 +
Partitions will be aligned on 2048-sector boundaries
 +
Total free space is 6203424365 sectors (2.9 TiB)
  
    Zpool
+
Number  Start (sector)    End (sector)  Size      Code  Name
    3 - 6 disk raidz2
+
  1            2048      1610614783  768.0 GiB  BF01  Solaris /usr & Mac ZFS
    1 - 5 disk raidz2
+
</pre>
    Special - 3 m2 flash
+
 
    Zil - 16 GiB - optane
+
* Copy the red to the others
    l2arc -  333 GiB - optane
+
sgdisk /dev/disk/by-enclosure-slot/nvme-red -R /dev/disk/by-enclosure-slot/nvme-blue
    Note optane is shared between Zil and l2arc
+
sgdisk /dev/disk/by-enclosure-slot/nvme-red -R /dev/disk/by-enclosure-slot/nvme-green
 +
 
 +
* optane setup (orange)
 +
<pre>
 +
# sgdisk -p /dev/disk/by-enclosure-slot/nvme-orange
 +
Disk /dev/disk/by-enclosure-slot/nvme-orange: 91573146 sectors, 349.3 GiB
 +
Sector size (logical/physical): 4096/4096 bytes
 +
Disk identifier (GUID): 70C34BA9-1175-42A5-B00E-16CFC022861F
 +
Partition table holds up to 128 entries
 +
Main partition table begins at sector 2 and ends at sector 5
 +
First usable sector is 6, last usable sector is 91573140
 +
Partitions will be aligned on 256-sector boundaries
 +
Total free space is 1395599 sectors (5.3 GiB)
  
Space on this should be 15 disks of 16 tb, with redundancy or 240tbThis will leave one disk open as a hot spare.   
+
Number  Start (sector)    End (sector)  Size      Code  Name
 +
  1            256        8388863  32.0 GiB    BF01 Solaris /usr & Mac ZFS
 +
  2        8388864        90177791  312.0 GiB  BF01 Solaris /usr & Mac ZFS
 +
</pre>
  
The sata disks will be linux boot disks in a standard linux raid (maybe zfs mirror?)
+
* make the array
 +
zpool create -f -o ashift=12 -O compression=lz4 -O atime=off -O xattr=sa -O \
 +
recordsize=128k localDataStore \
 +
raidz2 \
 +
/dev/disk/by-enclosure-slot/rear-slot000 \
 +
/dev/disk/by-enclosure-slot/rear-slot001 \
 +
/dev/disk/by-enclosure-slot/rear-slot002 \
 +
/dev/disk/by-enclosure-slot/rear-slot003 \
 +
/dev/disk/by-enclosure-slot/rear-slot004 \
 +
/dev/disk/by-enclosure-slot/rear-slot005 \
 +
raidz2 \
 +
/dev/disk/by-enclosure-slot/rear-slot006 \
 +
/dev/disk/by-enclosure-slot/rear-slot007 \
 +
/dev/disk/by-enclosure-slot/rear-slot008 \
 +
/dev/disk/by-enclosure-slot/rear-slot009 \
 +
/dev/disk/by-enclosure-slot/rear-slot010 \
 +
/dev/disk/by-enclosure-slot/rear-slot011 \
 +
raidz2 \
 +
/dev/disk/by-enclosure-slot/front-slot000 \
 +
/dev/disk/by-enclosure-slot/front-slot001 \
 +
/dev/disk/by-enclosure-slot/front-slot002 \
 +
/dev/disk/by-enclosure-slot/front-slot003 \
 +
/dev/disk/by-enclosure-slot/front-slot004 \
 +
/dev/disk/by-enclosure-slot/front-slot005 \
 +
raidz2 \
 +
/dev/disk/by-enclosure-slot/front-slot006 \
 +
/dev/disk/by-enclosure-slot/front-slot007 \
 +
/dev/disk/by-enclosure-slot/front-slot008 \
 +
/dev/disk/by-enclosure-slot/front-slot009 \
 +
/dev/disk/by-enclosure-slot/front-slot010 \
 +
/dev/disk/by-enclosure-slot/front-slot011 \
 +
special mirror \
 +
/dev/disk/by-enclosure-slot/nvme-red-part1 \
 +
/dev/disk/by-enclosure-slot/nvme-green-part1 \
 +
/dev/disk/by-enclosure-slot/nvme-blue-part1 \
 +
cache /dev/disk/by-enclosure-slot/nvme-orange-part2 \
 +
log /dev/disk/by-enclosure-slot/nvme-orange-part1
 +
 
 +
* check that it's online
 +
<pre>
 +
zpool status localDataStore
 +
  pool: localDataStore
 +
state: ONLINE
 +
config:
 +
 
 +
        NAME                                        STATE    READ WRITE CKSUM
 +
        localDataStore                              ONLINE      0    0    0
 +
          raidz2-0                                  ONLINE      0    0    0
 +
            disk/by-enclosure-slot/rear-slot000      ONLINE      0    0    0
 +
            disk/by-enclosure-slot/rear-slot001      ONLINE      0    0    0
 +
            disk/by-enclosure-slot/rear-slot002      ONLINE      0    0    0
 +
            disk/by-enclosure-slot/rear-slot003      ONLINE      0    0    0
 +
            disk/by-enclosure-slot/rear-slot004      ONLINE      0    0    0
 +
            disk/by-enclosure-slot/rear-slot005      ONLINE      0    0    0
 +
          raidz2-1                                  ONLINE      0    0    0
 +
            disk/by-enclosure-slot/rear-slot006      ONLINE      0    0    0
 +
            disk/by-enclosure-slot/rear-slot007      ONLINE      0    0    0
 +
            disk/by-enclosure-slot/rear-slot008      ONLINE      0    0    0
 +
            disk/by-enclosure-slot/rear-slot009      ONLINE      0    0    0
 +
            disk/by-enclosure-slot/rear-slot010      ONLINE      0    0    0
 +
            disk/by-enclosure-slot/rear-slot011      ONLINE      0    0    0
 +
          raidz2-2                                  ONLINE      0    0    0
 +
            disk/by-enclosure-slot/front-slot000    ONLINE      0    0    0
 +
            disk/by-enclosure-slot/front-slot001    ONLINE      0    0    0
 +
            disk/by-enclosure-slot/front-slot002    ONLINE      0    0    0
 +
            disk/by-enclosure-slot/front-slot003    ONLINE      0    0    0
 +
            disk/by-enclosure-slot/front-slot004    ONLINE      0    0    0
 +
            disk/by-enclosure-slot/front-slot005    ONLINE      0    0    0
 +
          raidz2-3                                  ONLINE      0    0    0
 +
            disk/by-enclosure-slot/front-slot006    ONLINE      0    0    0
 +
            disk/by-enclosure-slot/front-slot007    ONLINE      0    0    0
 +
            disk/by-enclosure-slot/front-slot008    ONLINE      0    0    0
 +
            disk/by-enclosure-slot/front-slot009    ONLINE      0    0    0
 +
            disk/by-enclosure-slot/front-slot010    ONLINE      0    0    0
 +
            disk/by-enclosure-slot/front-slot011    ONLINE      0    0    0
 +
        special
 +
          mirror-4                                  ONLINE      0    0    0
 +
            disk/by-enclosure-slot/nvme-red-part1    ONLINE      0    0    0
 +
            disk/by-enclosure-slot/nvme-green-part1  ONLINE      0    0    0
 +
            disk/by-enclosure-slot/nvme-blue-part1  ONLINE      0    0    0
 +
        logs
 +
          disk/by-enclosure-slot/nvme-orange-part1  ONLINE      0    0    0
 +
        cache
 +
          disk/by-enclosure-slot/nvme-orange-part2  ONLINE      0    0    0
 +
</pre>
 +
 
 +
== Boot Disks ==
 +
 
 +
rpool
 +
The sata disks will be linux boot disks in a standard linux raid (maybe zfs mirror?)  
 +
* underprovision the sata boot disks
 +
hdparm -Np976762584 --yes-i-know-what-i-am-doing /dev/disk/by-enclosure-slot/sata-bottom
 +
 
 +
'''Note that you must power cycle (hard not a reset) the box to get the disks to take this.'''
  
 
= SAS controller =
 
= SAS controller =
Line 190: Line 320:
  
 
https://github.com/bkus/by-enclosure-slot
 
https://github.com/bkus/by-enclosure-slot
 +
 +
Updated code for the supermicro:
 +
 +
https://github.com/W9CR/by-enclosure-slot/tree/Supermicro-server
 +
 +
to get this working on boot with proxmox you need to do do the following:
 +
 +
in /etc/default/zfs
 +
ZPOOL_IMPORT_PATH="/dev/disk/by-enclosure-slot"
 +
 +
* create /etc/initramfs-tools/hooks/enclosure-slot
 +
<pre>
 +
#!/bin/sh -e
 +
 +
PREREQS="udev"
 +
 +
prereqs() { echo "$PREREQS"; }
 +
 +
case "$1" in
 +
    prereqs)
 +
    prereqs
 +
    exit 0
 +
    ;;
 +
esac
 +
 +
. /usr/share/initramfs-tools/hook-functions
 +
 +
for program in get_enclosure_slot; do
 +
  copy_exec /lib/udev/$program /lib/udev
 +
done
 +
 +
#cuz the above needs bash
 +
for program in readlink sg_ses flock bash ; do
 +
        copy_exec /usr/bin/$program /usr/bin
 +
done
 +
</pre>
 +
 +
rm /etc/zfs/zpool.cache
 +
update-initramfs -u
 +
 +
 +
 +
then power cycle it and confirm the new names are in use.  Once that's done run update-initramfs -u again to store the zfs cache in the initramfs (optional but boots faster)
  
 
= iscsi target =  
 
= iscsi target =  
Line 205: Line 378:
  
 
"Backstore name  is too long for "INQUIRY_MODEL" iscsi"
 
"Backstore name  is too long for "INQUIRY_MODEL" iscsi"
 +
 +
 +
Initator = client
 +
target = server
 +
 +
== Provision FatTony as the iSCSI Target ==
 +
 +
* create the iscsi point on the datastore
 +
zfs create localDataStore/iscsi
 +
 +
* Install the server on FatTony
 +
apt install targetcli-fb
 +
 +
* start the service
 +
systemctl enable --now targetclid.service
 +
 +
* fix the missing  /dev/<zpool-name>.
 +
 +
==='''Per the latest debian they are only located under /dev/zvol/<zpool-name>.'''===
 +
'''This has a [https://bugzilla.proxmox.com/show_bug.cgi?id=5071 bug in proxmox,] as the /dev/<zpool-name> is hard coded.  '''
 +
 +
A forum post on it is here https://forum.proxmox.com/threads/missing-path-to-zfs-pool-in-dev-dir-after-recent-updates.139371/
 +
 +
 +
 +
KERNEL=="zd*", SUBSYSTEM=="block", ACTION=="add|change", SYMLINK+="%c" >/etc/udev/rules.d/99-zvol.rules
 +
 +
==Prepare the Proxmox Clientnodes==
 +
Now on the clients we need to configure the access to the iscsi target so it can ssh to it.  It does this to run ZFS commands and make the block devices that are exported
 +
 +
The IP of the target here in the keyname is important, this is used by PVE as the default. IPV6 is equally possible.
 +
 +
mkdir /etc/pve/priv/zfs
 +
ssh-keygen -f /etc/pve/priv/zfs/192.168.8.187_id_rsa
 +
ssh-copy-id -i /etc/pve/priv/zfs/192.168.8.187_id_rsa.pub root@192.168.8.187
 +
 +
* verify this key works
 +
ssh -i /etc/pve/priv/zfs/192.168.8.187_id_rsa FatTony
 +
 +
* Read out the ISCSI Initiator name from each node
 +
for i in carbonrod moleman fink spiderpig fattony ; do ssh $i 'cat /etc/iscsi/initiatorname.iscsi | grep ^InitiatorName' ; done
 +
InitiatorName=iqn.1993-08.org.debian:01:5f09136632
 +
InitiatorName=iqn.1993-08.org.debian:01:28601263f0ff
 +
InitiatorName=iqn.1993-08.org.debian:01:e45a7fa9d6ec
 +
InitiatorName=iqn.1993-08.org.debian:01:cf88d296de8d
 +
InitiatorName=iqn.1993-08.org.debian:01:51de7eb4a092
 +
 +
== configure the target on fattony ==
 +
 +
# itargetcli
 +
/iscsi> create
 +
Created target iqn.2003-01.org.linux-iscsi.fattony.x8664:sn.5381d8313ebf.
 +
Created TPG 1.
 +
Global pref auto_add_default_portal=true
 +
Created default portal listening on all IPs (0.0.0.0), port 3260.
 +
 +
* restrict it to the 192.168.8.187 interface
 +
/iscsi> cd iqn.2003-01.org.linux-iscsi.fattony.x8664:sn.5381d8313ebf/tpg1/portals
 +
/iscsi/iqn.20.../tpg1/portals> delete 0.0.0.0 3260
 +
Deleted network portal 0.0.0.0:3260
 +
/iscsi/iqn.20.../tpg1/portals> create 192.168.8.187 3260
 +
Using default IP port 3260
 +
Created network portal 192.168.8.187:3260.
 +
 +
 +
* add the luns from each node to the acl
 +
 +
<pre>
 +
/iscsi/iqn.20...ebf/tpg1/acls> create iqn.1993-08.org.debian:01:5f09136632
 +
Created Node ACL for iqn.1993-08.org.debian:01:5f09136632
 +
/iscsi/iqn.20...ebf/tpg1/acls> create iqn.1993-08.org.debian:01:28601263f0ff
 +
Created Node ACL for iqn.1993-08.org.debian:01:28601263f0ff
 +
/iscsi/iqn.20...ebf/tpg1/acls> create iqn.1993-08.org.debian:01:e45a7fa9d6ec
 +
Created Node ACL for iqn.1993-08.org.debian:01:e45a7fa9d6ec
 +
/iscsi/iqn.20...ebf/tpg1/acls> create iqn.1993-08.org.debian:01:cf88d296de8d
 +
Created Node ACL for iqn.1993-08.org.debian:01:cf88d296de8d
 +
/iscsi/iqn.20...ebf/tpg1/acls> create iqn.1993-08.org.debian:01:51de7eb4a092
 +
Created Node ACL for iqn.1993-08.org.debian:01:51de7eb4a092
 +
/iscsi/iqn.20...ebf/tpg1/acls> ls
 +
o- acls .................................................................................................................. [ACLs: 5]
 +
  o- iqn.1993-08.org.debian:01:28601263f0ff ....................................................................... [Mapped LUNs: 0]
 +
  o- iqn.1993-08.org.debian:01:51de7eb4a092 ....................................................................... [Mapped LUNs: 0]
 +
  o- iqn.1993-08.org.debian:01:5f09136632 ......................................................................... [Mapped LUNs: 0]
 +
  o- iqn.1993-08.org.debian:01:cf88d296de8d ....................................................................... [Mapped LUNs: 0]
 +
  o- iqn.1993-08.org.debian:01:e45a7fa9d6ec ....................................................................... [Mapped LUNs: 0]
 +
/iscsi/iqn.20...ebf/tpg1/acls> cd /iscsi/
 +
/iscsi> ls
 +
o- iscsi .............................................................................................................. [Targets: 1]
 +
  o- iqn.2003-01.org.linux-iscsi.fattony.x8664:sn.5381d8313ebf ........................................................... [TPGs: 1]
 +
    o- tpg1 ................................................................................................. [no-gen-acls, no-auth]
 +
      o- acls ............................................................................................................ [ACLs: 5]
 +
      | o- iqn.1993-08.org.debian:01:28601263f0ff ................................................................. [Mapped LUNs: 0]
 +
      | o- iqn.1993-08.org.debian:01:51de7eb4a092 ................................................................. [Mapped LUNs: 0]
 +
      | o- iqn.1993-08.org.debian:01:5f09136632 ................................................................... [Mapped LUNs: 0]
 +
      | o- iqn.1993-08.org.debian:01:cf88d296de8d ................................................................. [Mapped LUNs: 0]
 +
      | o- iqn.1993-08.org.debian:01:e45a7fa9d6ec ................................................................. [Mapped LUNs: 0]
 +
      o- luns ............................................................................................................ [LUNs: 0]
 +
      o- portals ...................................................................................................... [Portals: 1]
 +
        o- 192.168.8.187:3260 ................................................................................................. [OK]
 +
</pre>
 +
 +
* set auth to disabled (it's a private network)
 +
 +
<pre>
 +
/> /iscsi/iqn.2003-01.org.linux-iscsi.fattony.x8664:sn.5381d8313ebf/tpg1
 +
/iscsi/iqn.20...d8313ebf/tpg1> set attribute authentication=0
 +
 +
</pre>
 +
 +
* configure the storage to use it.
 +
 +
vim /etc/pve/storage.cfg
 +
 +
zfs: iscsi-zfs
 +
        blocksize 32k
 +
        iscsiprovider LIO
 +
        pool localDataStore/iscsi
 +
        portal 192.168.8.187
 +
        target iqn.2003-01.org.linux-iscsi.fattony.x8664:sn.5381d8313ebf
 +
        content images
 +
        lio_tpg tpg1
 +
        nowritecache 1
 +
        sparse 1
 +
 +
== Verify that you can see iscsi-zfs from all nodes ==
 +
 +
If you can't or get an error, it's prob an ssh key issue.  From the affected node do:
 +
ssh -i /etc/pve/priv/zfs/192.168.8.187_id_rsa root@192.168.8.187 
 +
and accept the key (yes)
 +
 +
= NFSv4 =
 +
 +
As iSCSI doesn't work with containters, we use NFS for this and holding ISO.  Note this doesn't store the raw files in the filesystem, it just make a qcow image on the NFS target, and will be the size of your entire disk.  It kinda blows. 
 +
 +
Since we have zfs on the pool, it's trivial to set some paramters and let ZFS handle the exporting itself. 
 +
 +
== Host Config ==
 +
 +
* install the nfs kernel software
 +
apt-get install nfs-kernel-server
 +
 +
* make a nfs dataset
 +
zfs create localDataStore/testnfs
 +
 +
* set zfs nfs share permissions
 +
root@FatTony:~# zfs set sharenfs=rw=@192.168.8.0/24,no_root_squash,crossmnt localDataStore/testnfs
 +
root@FatTony:~# zfs get sharenfs localDataStore/testnfs
 +
NAME                    PROPERTY  VALUE                                      SOURCE
 +
localDataStore/testnfs  sharenfs  rw=@192.168.8.0/24,no_root_squash,crossmnt  local
 +
 +
== Client config ==
 +
The NFS mount is configured to be a soft mount, so if something happens to it you can force unmount.  If you don't do this, you will have to reboot the hypervisor to clear a hung NFS mount. 
 +
 +
printf "
 +
nfs: testnfs
 +
        export /localDataStore/testnfs
 +
        path /mnt/pve/testnfsFatTony
 +
        server 192.168.8.187
 +
        options soft
 +
        content images,vztmpl,rootdir,iso
 +
        prune-backups keep-all=1" >>/etc/pve/storage.cfg
  
 
= Software =
 
= Software =

Latest revision as of 12:15, 12 September 2024


My documents on building a storage server and proxmox host.

Parts

- Chassis Manual

Disk Layout

localDataStore

  • ZFS zraid2
Zpool - 24 16 TB SAS disks
4 - 6 disk raidz2 
Special - 3 m2 flash - 4tb but provisioned at 768gb starting
Zil - 16 GiB - optane
l2arc -  333 GiB - optane
Note optane is shared between Zil and l2arc
  • special device layout
# sgdisk -p /dev/disk/by-enclosure-slot/nvme-red
Disk /dev/disk/by-enclosure-slot/nvme-red: 7814037168 sectors, 3.6 TiB
Sector size (logical/physical): 512/512 bytes
Disk identifier (GUID): 8C156829-68A0-4EE4-AE89-BDFA828CC5A1
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 7814037134
Partitions will be aligned on 2048-sector boundaries
Total free space is 6203424365 sectors (2.9 TiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048      1610614783   768.0 GiB   BF01  Solaris /usr & Mac ZFS
  • Copy the red to the others
sgdisk /dev/disk/by-enclosure-slot/nvme-red -R /dev/disk/by-enclosure-slot/nvme-blue
sgdisk /dev/disk/by-enclosure-slot/nvme-red -R /dev/disk/by-enclosure-slot/nvme-green
  • optane setup (orange)
# sgdisk -p /dev/disk/by-enclosure-slot/nvme-orange
Disk /dev/disk/by-enclosure-slot/nvme-orange: 91573146 sectors, 349.3 GiB
Sector size (logical/physical): 4096/4096 bytes
Disk identifier (GUID): 70C34BA9-1175-42A5-B00E-16CFC022861F
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 5
First usable sector is 6, last usable sector is 91573140
Partitions will be aligned on 256-sector boundaries
Total free space is 1395599 sectors (5.3 GiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1             256         8388863   32.0 GiB    BF01  Solaris /usr & Mac ZFS
   2         8388864        90177791   312.0 GiB   BF01  Solaris /usr & Mac ZFS
  • make the array
zpool create -f -o ashift=12 -O compression=lz4 -O atime=off -O xattr=sa -O \
recordsize=128k localDataStore \
raidz2 \
/dev/disk/by-enclosure-slot/rear-slot000 \
/dev/disk/by-enclosure-slot/rear-slot001 \
/dev/disk/by-enclosure-slot/rear-slot002 \
/dev/disk/by-enclosure-slot/rear-slot003 \
/dev/disk/by-enclosure-slot/rear-slot004 \
/dev/disk/by-enclosure-slot/rear-slot005 \
raidz2 \
/dev/disk/by-enclosure-slot/rear-slot006 \
/dev/disk/by-enclosure-slot/rear-slot007 \
/dev/disk/by-enclosure-slot/rear-slot008 \
/dev/disk/by-enclosure-slot/rear-slot009 \
/dev/disk/by-enclosure-slot/rear-slot010 \
/dev/disk/by-enclosure-slot/rear-slot011 \
raidz2 \
/dev/disk/by-enclosure-slot/front-slot000 \
/dev/disk/by-enclosure-slot/front-slot001 \
/dev/disk/by-enclosure-slot/front-slot002 \
/dev/disk/by-enclosure-slot/front-slot003 \
/dev/disk/by-enclosure-slot/front-slot004 \
/dev/disk/by-enclosure-slot/front-slot005 \
raidz2 \
/dev/disk/by-enclosure-slot/front-slot006 \
/dev/disk/by-enclosure-slot/front-slot007 \
/dev/disk/by-enclosure-slot/front-slot008 \
/dev/disk/by-enclosure-slot/front-slot009 \
/dev/disk/by-enclosure-slot/front-slot010 \
/dev/disk/by-enclosure-slot/front-slot011 \
special mirror \
/dev/disk/by-enclosure-slot/nvme-red-part1 \
/dev/disk/by-enclosure-slot/nvme-green-part1 \
/dev/disk/by-enclosure-slot/nvme-blue-part1 \
cache /dev/disk/by-enclosure-slot/nvme-orange-part2 \
log /dev/disk/by-enclosure-slot/nvme-orange-part1 
  • check that it's online
zpool status localDataStore
  pool: localDataStore
 state: ONLINE
config:

        NAME                                         STATE     READ WRITE CKSUM
        localDataStore                               ONLINE       0     0     0
          raidz2-0                                   ONLINE       0     0     0
            disk/by-enclosure-slot/rear-slot000      ONLINE       0     0     0
            disk/by-enclosure-slot/rear-slot001      ONLINE       0     0     0
            disk/by-enclosure-slot/rear-slot002      ONLINE       0     0     0
            disk/by-enclosure-slot/rear-slot003      ONLINE       0     0     0
            disk/by-enclosure-slot/rear-slot004      ONLINE       0     0     0
            disk/by-enclosure-slot/rear-slot005      ONLINE       0     0     0
          raidz2-1                                   ONLINE       0     0     0
            disk/by-enclosure-slot/rear-slot006      ONLINE       0     0     0
            disk/by-enclosure-slot/rear-slot007      ONLINE       0     0     0
            disk/by-enclosure-slot/rear-slot008      ONLINE       0     0     0
            disk/by-enclosure-slot/rear-slot009      ONLINE       0     0     0
            disk/by-enclosure-slot/rear-slot010      ONLINE       0     0     0
            disk/by-enclosure-slot/rear-slot011      ONLINE       0     0     0
          raidz2-2                                   ONLINE       0     0     0
            disk/by-enclosure-slot/front-slot000     ONLINE       0     0     0
            disk/by-enclosure-slot/front-slot001     ONLINE       0     0     0
            disk/by-enclosure-slot/front-slot002     ONLINE       0     0     0
            disk/by-enclosure-slot/front-slot003     ONLINE       0     0     0
            disk/by-enclosure-slot/front-slot004     ONLINE       0     0     0
            disk/by-enclosure-slot/front-slot005     ONLINE       0     0     0
          raidz2-3                                   ONLINE       0     0     0
            disk/by-enclosure-slot/front-slot006     ONLINE       0     0     0
            disk/by-enclosure-slot/front-slot007     ONLINE       0     0     0
            disk/by-enclosure-slot/front-slot008     ONLINE       0     0     0
            disk/by-enclosure-slot/front-slot009     ONLINE       0     0     0
            disk/by-enclosure-slot/front-slot010     ONLINE       0     0     0
            disk/by-enclosure-slot/front-slot011     ONLINE       0     0     0
        special
          mirror-4                                   ONLINE       0     0     0
            disk/by-enclosure-slot/nvme-red-part1    ONLINE       0     0     0
            disk/by-enclosure-slot/nvme-green-part1  ONLINE       0     0     0
            disk/by-enclosure-slot/nvme-blue-part1   ONLINE       0     0     0
        logs
          disk/by-enclosure-slot/nvme-orange-part1   ONLINE       0     0     0
        cache
          disk/by-enclosure-slot/nvme-orange-part2   ONLINE       0     0     0

Boot Disks

rpool The sata disks will be linux boot disks in a standard linux raid (maybe zfs mirror?)

  • underprovision the sata boot disks
hdparm -Np976762584 --yes-i-know-what-i-am-doing /dev/disk/by-enclosure-slot/sata-bottom

Note that you must power cycle (hard not a reset) the box to get the disks to take this.

SAS controller

The built in controller supports HBA mode per supermicro.

https://docs.broadcom.com/doc/pub-005110

I had to pickup an AOM-S3008M-L8 HBA, which fits in place of the included RAID controller, as even in JBOD mode the controller will not talk to the expanders allowing ledon to identify the failed disks.


https://www.reddit.com/r/homelab/comments/iqz7xc/supermicro_s3108_in_jbod_mode/

Doesn't work in JBOD mode, everything is proxied.

Updating Disks

Firmware

Current firmware on the disks is E004, but disks have E002 on them

for i in `seq 2 25` ; do SeaChest_Firmware --downloadFW ./EvansExosX16SAS-STD-512E-E004.LOD -d /dev/sg$i; done

#SeaChest_Firmware -s
==========================================================================================
 SeaChest_Firmware - Seagate drive utilities - NVMe Enabled
 Copyright (c) 2014-2021 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 SeaChest_Firmware Version: 3.0.0-2_2_1 X86_64
 Build Date: Apr 27 2021
 Today: Tue Oct 10 22:14:38 2023        User: root
==========================================================================================
nvme_ioctl_id: Inappropriate ioctl for device
Vendor   Handle       Model Number            Serial Number          FwRev
LSI      /dev/sg0     SAS3x28                                        0705
LSI      /dev/sg1     SAS3x28                                        0705
SEAGATE  /dev/sg10    ST16000NM002G           ZL20AJ3P               E004
SEAGATE  /dev/sg11    ST16000NM002G           ZL231860               E004
SEAGATE  /dev/sg12    ST16000NM002G           ZL21T6Q1               E004
SEAGATE  /dev/sg13    ST16000NM002G           ZL231PV0               E004
SEAGATE  /dev/sg14    ST16000NM002G           ZL21TDHH               E004
SEAGATE  /dev/sg15    ST16000NM002G           ZL21Y85R               E004
SEAGATE  /dev/sg16    ST16000NM002G           ZL21S9FQ               E004
SEAGATE  /dev/sg17    ST16000NM002G           ZL21T38M               E004
SEAGATE  /dev/sg18    ST16000NM002G           ZL22C3MJ               E004
SEAGATE  /dev/sg19    ST16000NM002G           ZL21S9J3               E004
SEAGATE  /dev/sg2     ST16000NM002G           ZL21S9WK               E004
SEAGATE  /dev/sg20    ST16000NM002G           ZL21S9AW               E004
SEAGATE  /dev/sg21    ST16000NM002G           ZL21TGY1               E004
SEAGATE  /dev/sg22    ST16000NM002G           ZL20CRL7               E004
SEAGATE  /dev/sg23    ST16000NM002G           ZL21RP9E               E004
SEAGATE  /dev/sg24    ST16000NM002G           ZL21RNZW               E004
SEAGATE  /dev/sg25    ST16000NM002G           ZL21JYXF               E004
ATA      /dev/sg26    Samsung SSD 870 EVO 1TB S75BNL0W812633P        SVT03B6Q
ATA      /dev/sg27    Samsung SSD 870 EVO 1TB S75BNS0W642820L        SVT03B6Q
SEAGATE  /dev/sg3     ST16000NM002G           ZL21V48W               E004
SEAGATE  /dev/sg4     ST16000NM002G           ZL21T7XK               E004
SEAGATE  /dev/sg5     ST16000NM002G           ZL21T8HS               E004
SEAGATE  /dev/sg6     ST16000NM002G           ZL21SBMS               E004
SEAGATE  /dev/sg7     ST16000NM002G           ZL21SRYP               E004
SEAGATE  /dev/sg8     ST16000NM002G           ZL21LVPQ               E004
SEAGATE  /dev/sg9     ST16000NM002G           ZL21TB40               E004

NVMe     /dev/nvme0n1 Samsung SSD 990 PRO 4TB S7KGNJ0W912464T        0B2QJXG7

Low Level Format

It's necessary to low level format these disks as we need to turn off Protection Type 2, test the sectors of the drive and make it a 4096 sector size.

 # SeaChest_Format --protectionType 0 --formatUnit 4096 --confirm this-will-erase-data --poll -d /dev/sg2

NVME Config

This was the most dificult. The onboard NVME ports under the power supply will take a standard NVME cable, but do not supply any power. The M.2 to SFF-8612 adapters don't have a power connector, so I was able to add some 3.3v inputs on the filter cap. This was kinda hacky, but it works.

Optane

The optane SSD needs to be changed to 4k sectors, but the nvme command doesn't work. Intel has a intelmas program that does https://community.intel.com/t5/Intel-Optane-Solid-State-Drives/4K-format-on-Optane-SSD-P1600X/m-p/1477181


   root@pve01:~# nvme id-ns /dev/nvme2n1 -H -n 1
   <snip>
   LBA Format  0 : Metadata Size: 0   bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good (in use)
   LBA Format  1 : Metadata Size: 8   bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good
   LBA Format  2 : Metadata Size: 16  bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good
   LBA Format  3 : Metadata Size: 0   bytes - Data Size: 4096 bytes - Relative Performance: 0 Best
   LBA Format  4 : Metadata Size: 8   bytes - Data Size: 4096 bytes - Relative Performance: 0 Best
   LBA Format  5 : Metadata Size: 64  bytes - Data Size: 4096 bytes - Relative Performance: 0 Best
   LBA Format  6 : Metadata Size: 128 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best
    
root@pve01:~# intelmas start -intelssd 2 -nvmeformat LBAFormat=3
WARNING! You have selected to format the drive!
Proceed with the format? (Y|N): y
Formatting...(This can take several minutes to complete)

- Intel Optane(TM) SSD DC P4801X Series PHKM2051009H375A -

Status : NVMeFormat successful.

root@pve01:~# nvme list
Node                  Generic               SN                   Model                                    Namespace Usage                      Format           FW Rev
--------------------- --------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme3n1          /dev/ng3n1            S7KGNJ0W912446D      Samsung SSD 990 PRO 4TB                  1         449.25  GB /   4.00  TB    512   B +  0 B   0B2QJXG7
/dev/nvme2n1          /dev/ng2n1            PHKM2051009H375A     INTEL SSDPEL1K375GA                      1         375.08  GB / 375.08  GB      4 KiB +  0 B   E2010600
/dev/nvme1n1          /dev/ng1n1            S7KGNJ0W912452X      Samsung SSD 990 PRO 4TB                  1         449.27  GB /   4.00  TB    512   B +  0 B   0B2QJXG7
/dev/nvme0n1          /dev/ng0n1            S7KGNJ0W912464T      Samsung SSD 990 PRO 4TB                  1         449.27  GB /   4.00  TB    512   B +  0 B   0B2QJXG7

Partition the disk

Disk /dev/nvme2n1: 91573146 sectors, 349.3 GiB
Model: INTEL SSDPEL1K375GA
Sector size (logical/physical): 4096/4096 bytes
Disk identifier (GUID): 70C34BA9-1175-42A5-B00E-16CFC022861F
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 5
First usable sector is 6, last usable sector is 91573140
Partitions will be aligned on 256-sector boundaries
Total free space is 1395599 sectors (5.3 GiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1             256         8388863   32.0 GiB    BF01  Solaris /usr & Mac ZFS
   2         8388864        90177791   312.0 GiB   BF01  Solaris /usr & Mac ZFS

Command (? for help): w

Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING
PARTITIONS!!

Do you want to proceed? (Y/N): y
OK; writing new GUID partition table (GPT) to /dev/nvme2n1.
The operation has completed successfully.

Phsyical layout

Mandatory config

The disks don't deal with the old NVME support on this box (it's listed as oculink .91, not 1.0) and there are some bugs in linux around this and power handling.

A config must be added to the kernel cmdline

   vim /etc/kernel/cmdline 

add this to it:

   nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

then

   proxmox-boot-tool refresh

udev rules

https://www.reactivated.net/writing_udev_rules.html#strsubst

https://github.com/bkus/by-enclosure-slot

Updated code for the supermicro:

https://github.com/W9CR/by-enclosure-slot/tree/Supermicro-server

to get this working on boot with proxmox you need to do do the following:

in /etc/default/zfs 
ZPOOL_IMPORT_PATH="/dev/disk/by-enclosure-slot"
  • create /etc/initramfs-tools/hooks/enclosure-slot
#!/bin/sh -e

PREREQS="udev"

prereqs() { echo "$PREREQS"; }

case "$1" in
    prereqs)
    prereqs
    exit 0
    ;;
esac

. /usr/share/initramfs-tools/hook-functions

for program in get_enclosure_slot; do
  copy_exec /lib/udev/$program /lib/udev
done

#cuz the above needs bash
for program in readlink sg_ses flock bash ; do
        copy_exec /usr/bin/$program /usr/bin
done
rm /etc/zfs/zpool.cache
update-initramfs -u


then power cycle it and confirm the new names are in use. Once that's done run update-initramfs -u again to store the zfs cache in the initramfs (optional but boots faster)

iscsi target

https://forum.level1techs.com/t/has-anyone-here-tried-to-create-an-iscsi-target-in-proxmox/193862

https://deepdoc-at.translate.goog/dokuwiki/doku.php?id=virtualisierung:proxmox_kvm_und_lxc:proxmox_debian_als_zfs-over-iscsi_server_verwenden&_x_tr_sl=en&_x_tr_tl=es&_x_tr_hl=en&_x_tr_pto=wapp

https://www.reddit.com/r/homelab/comments/ih374t/poor_linux_iscsi_target_performance_tips/

mtu!

sudo ip link set eth1 mtu 9000


"Backstore name is too long for "INQUIRY_MODEL" iscsi"


Initator = client target = server

Provision FatTony as the iSCSI Target

  • create the iscsi point on the datastore
zfs create localDataStore/iscsi
  • Install the server on FatTony
apt install targetcli-fb 
  • start the service
systemctl enable --now targetclid.service
  • fix the missing /dev/<zpool-name>.

Per the latest debian they are only located under /dev/zvol/<zpool-name>.

This has a bug in proxmox, as the /dev/<zpool-name> is hard coded.

A forum post on it is here https://forum.proxmox.com/threads/missing-path-to-zfs-pool-in-dev-dir-after-recent-updates.139371/


KERNEL=="zd*", SUBSYSTEM=="block", ACTION=="add|change", SYMLINK+="%c" >/etc/udev/rules.d/99-zvol.rules

Prepare the Proxmox Clientnodes

Now on the clients we need to configure the access to the iscsi target so it can ssh to it. It does this to run ZFS commands and make the block devices that are exported

The IP of the target here in the keyname is important, this is used by PVE as the default. IPV6 is equally possible.

mkdir /etc/pve/priv/zfs
ssh-keygen -f /etc/pve/priv/zfs/192.168.8.187_id_rsa
ssh-copy-id -i /etc/pve/priv/zfs/192.168.8.187_id_rsa.pub root@192.168.8.187
  • verify this key works
ssh -i /etc/pve/priv/zfs/192.168.8.187_id_rsa FatTony
  • Read out the ISCSI Initiator name from each node
for i in carbonrod moleman fink spiderpig fattony ; do ssh $i 'cat /etc/iscsi/initiatorname.iscsi | grep ^InitiatorName' ; done
InitiatorName=iqn.1993-08.org.debian:01:5f09136632
InitiatorName=iqn.1993-08.org.debian:01:28601263f0ff
InitiatorName=iqn.1993-08.org.debian:01:e45a7fa9d6ec
InitiatorName=iqn.1993-08.org.debian:01:cf88d296de8d
InitiatorName=iqn.1993-08.org.debian:01:51de7eb4a092

configure the target on fattony

# itargetcli
/iscsi> create
Created target iqn.2003-01.org.linux-iscsi.fattony.x8664:sn.5381d8313ebf.
Created TPG 1.
Global pref auto_add_default_portal=true
Created default portal listening on all IPs (0.0.0.0), port 3260.
  • restrict it to the 192.168.8.187 interface
/iscsi> cd iqn.2003-01.org.linux-iscsi.fattony.x8664:sn.5381d8313ebf/tpg1/portals
/iscsi/iqn.20.../tpg1/portals> delete 0.0.0.0 3260
Deleted network portal 0.0.0.0:3260
/iscsi/iqn.20.../tpg1/portals> create 192.168.8.187 3260
Using default IP port 3260
Created network portal 192.168.8.187:3260.


  • add the luns from each node to the acl
/iscsi/iqn.20...ebf/tpg1/acls> create iqn.1993-08.org.debian:01:5f09136632
Created Node ACL for iqn.1993-08.org.debian:01:5f09136632
/iscsi/iqn.20...ebf/tpg1/acls> create iqn.1993-08.org.debian:01:28601263f0ff
Created Node ACL for iqn.1993-08.org.debian:01:28601263f0ff
/iscsi/iqn.20...ebf/tpg1/acls> create iqn.1993-08.org.debian:01:e45a7fa9d6ec
Created Node ACL for iqn.1993-08.org.debian:01:e45a7fa9d6ec
/iscsi/iqn.20...ebf/tpg1/acls> create iqn.1993-08.org.debian:01:cf88d296de8d
Created Node ACL for iqn.1993-08.org.debian:01:cf88d296de8d
/iscsi/iqn.20...ebf/tpg1/acls> create iqn.1993-08.org.debian:01:51de7eb4a092
Created Node ACL for iqn.1993-08.org.debian:01:51de7eb4a092
/iscsi/iqn.20...ebf/tpg1/acls> ls
o- acls .................................................................................................................. [ACLs: 5]
  o- iqn.1993-08.org.debian:01:28601263f0ff ....................................................................... [Mapped LUNs: 0]
  o- iqn.1993-08.org.debian:01:51de7eb4a092 ....................................................................... [Mapped LUNs: 0]
  o- iqn.1993-08.org.debian:01:5f09136632 ......................................................................... [Mapped LUNs: 0]
  o- iqn.1993-08.org.debian:01:cf88d296de8d ....................................................................... [Mapped LUNs: 0]
  o- iqn.1993-08.org.debian:01:e45a7fa9d6ec ....................................................................... [Mapped LUNs: 0]
/iscsi/iqn.20...ebf/tpg1/acls> cd /iscsi/
/iscsi> ls
o- iscsi .............................................................................................................. [Targets: 1]
  o- iqn.2003-01.org.linux-iscsi.fattony.x8664:sn.5381d8313ebf ........................................................... [TPGs: 1]
    o- tpg1 ................................................................................................. [no-gen-acls, no-auth]
      o- acls ............................................................................................................ [ACLs: 5]
      | o- iqn.1993-08.org.debian:01:28601263f0ff ................................................................. [Mapped LUNs: 0]
      | o- iqn.1993-08.org.debian:01:51de7eb4a092 ................................................................. [Mapped LUNs: 0]
      | o- iqn.1993-08.org.debian:01:5f09136632 ................................................................... [Mapped LUNs: 0]
      | o- iqn.1993-08.org.debian:01:cf88d296de8d ................................................................. [Mapped LUNs: 0]
      | o- iqn.1993-08.org.debian:01:e45a7fa9d6ec ................................................................. [Mapped LUNs: 0]
      o- luns ............................................................................................................ [LUNs: 0]
      o- portals ...................................................................................................... [Portals: 1]
        o- 192.168.8.187:3260 ................................................................................................. [OK]
  • set auth to disabled (it's a private network)
/> /iscsi/iqn.2003-01.org.linux-iscsi.fattony.x8664:sn.5381d8313ebf/tpg1
/iscsi/iqn.20...d8313ebf/tpg1> set attribute authentication=0

  • configure the storage to use it.
vim /etc/pve/storage.cfg

zfs: iscsi-zfs
        blocksize 32k
        iscsiprovider LIO
        pool localDataStore/iscsi
        portal 192.168.8.187
        target iqn.2003-01.org.linux-iscsi.fattony.x8664:sn.5381d8313ebf
        content images
        lio_tpg tpg1
        nowritecache 1
        sparse 1

Verify that you can see iscsi-zfs from all nodes

If you can't or get an error, it's prob an ssh key issue. From the affected node do:

ssh -i /etc/pve/priv/zfs/192.168.8.187_id_rsa root@192.168.8.187  

and accept the key (yes)

NFSv4

As iSCSI doesn't work with containters, we use NFS for this and holding ISO. Note this doesn't store the raw files in the filesystem, it just make a qcow image on the NFS target, and will be the size of your entire disk. It kinda blows.

Since we have zfs on the pool, it's trivial to set some paramters and let ZFS handle the exporting itself.

Host Config

  • install the nfs kernel software
apt-get install nfs-kernel-server
  • make a nfs dataset
zfs create localDataStore/testnfs
  • set zfs nfs share permissions
root@FatTony:~# zfs set sharenfs=rw=@192.168.8.0/24,no_root_squash,crossmnt localDataStore/testnfs
root@FatTony:~# zfs get sharenfs localDataStore/testnfs
NAME                    PROPERTY  VALUE                                       SOURCE
localDataStore/testnfs  sharenfs  rw=@192.168.8.0/24,no_root_squash,crossmnt  local

Client config

The NFS mount is configured to be a soft mount, so if something happens to it you can force unmount. If you don't do this, you will have to reboot the hypervisor to clear a hung NFS mount.

printf "
nfs: testnfs
        export /localDataStore/testnfs
        path /mnt/pve/testnfsFatTony
        server 192.168.8.187
        options soft
        content images,vztmpl,rootdir,iso
        prune-backups keep-all=1" >>/etc/pve/storage.cfg

Software

  • Proxmox 8 server
  • proxmox backup
  • snmp