ZFS

Documentation

https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html ZFS module tunable parameters

Files

/proc/spl/kstat/zfs/arcstats
/sys/module/zfs/parameters

Important commands

zfs set acltype=posixacl pool ### do this for all ZFS pools otherwise all files have world-writable permissions (with nfs4.2, ubuntu LTS 18.04, 20.04), see https://github.com/openzfs/zfs/issues/10504

zfs set relatime=on pool ### ensure relatime is enabled otherwise each file access generates a write to the filesystem (to update the "last accessed" timestamp).

isdaq00 tuning

Increase zfs cache to allow "cd /zssd/home1; du -ks *" to run completely from cache without any disk access.

echo 20000000000 > /sys/module/zfs/parameters/zfs_arc_max
echo 50 > /sys/module/zfs/parameters/zfs_arc_dnode_limit_percent
echo 90 > /sys/module/zfs/parameters/zfs_arc_meta_limit_percent
echo 20000000000 > /sys/module/zfs/parameters/zfs_arc_max
echo 2 > /proc/sys/vm/drop_caches

Note:

"memory_free_bytes" is same as free memory reported by "top"
"memory_available_bytes" minus adjustable safety margin ("avail" in arcstat)
"arc_meta_max" is "arc_meta_used" + "memory_available_bytes"
"arc_meta_limit" should be set much bigger than that, set by zfs_arc_max and zfs_arc_meta_limit_percent
"arc_meta_used" is "size" in arcstat
"arc_dnode_limit" should be set much bigger than "dnode_size", set by zfs_arc_dnode_limit_percent
all the data should end up in the MFU (not MRU), "mfu_size" should be huge, "mru_size" much smaller.
isdaq00 with 24 GB of RAM is just about big enough to fit all of /zssd/home1, arc_meta_used is about 10 GB, arc_meta_max is about 12 GB.

Misc commands

zpool status
zpool get all
zpool iostat 1
zpool iostat -v 1
zpool history
zpool scrub data14
zpool events
arcstat.py 1
cat /proc/spl/kstat/zfs/arcstats
echo 30000000000 > /sys/module/zfs/parameters/zfs_arc_meta_limit
echo 32000000000 > /sys/module/zfs/parameters/zfs_arc_max

zfs get all
zfs set dedup=verify zssd/nfsroot

zpool create data14 raidz2 /dev/sd[b-h]1
zfs create z8tb/data
zfs destroy z8tb/data
zpool add z10tb cache /dev/disk/by-id/ata-ADATA_SP550_2F4320041688
parted /dev/sdx mklabel GPT
blkid
zpool iostat -v -q 1
watch -d -n 1 "cat /proc/spl/kstat/zfs/arcstats | grep l2"
zfs set primarycache=metadata tank/datab
zfs set secondarycache=metadata tank/datab

zfs userspace -p -H zssd/home1
zfs groupspace ...

zdb -vvv -O pool/gobackup/titan00__home1 data/home1/titan/packages/elog/logbooks/titan/2017
zdb -C pool | grep ashift ### find the real value of ashift

zfs snapshot -r pool_A@migrate
zfs send -R pool_A@migrate | zfs receive -F pool_B
echo 1 > /sys/module/zfs/parameters/zfs_send_corrupt_data # zfs send should not stop on i/o errors

zpool create test raidz2 `ls -1 /dev/disk/by-id/ata-WDC_WD40EZRX-00SPEB0_WD* | grep -v part`
zpool add -f test special mirror /dev/disk/by-id/ata-WDC_WDS120G2G0A-00JH30_1843A2802212 /dev/disk/by-id/ata-KINGSTON_SV300S37A120G_50026B77630CCB2C

invalid vdev specification
use '-f' to override the following errors:
mismatched replication level: pool and new vdev with different redundancy, raidz and mirror vdevs, 2 vs. 1 (2-way)

from this talk: https://drive.google.com/file/d/1YzulcT7p7TvHF50aI-Rxg6CMZMIGnxL_/view

zpool iostat -l 1 ### queue latencies
-w 1 ### distribution
-q -v 1 ### active and queued requests
-r1 ### size of IO

echo 2 > /proc/sys/vm/drop_caches ### clear zfs cache and system cache, all memory is free after this

Create raid0 (mirror) volume

echo USE_DISK_BY_ID=\'yes\' >> /etc/default/zfs
dracut -vf
zpool create zssd mirror /dev/sdaX /dev/sdbX
zpool set cachefile=none zssd
zpool set failmode=continue zssd
zpool status
zpool events
zpool get all
df /zssd
ls -l /zssd

Use whole disk for zfs mirror (RAID0)

echo USE_DISK_BY_ID=\'yes\' >> /etc/default/zfs
[root@daq13 ~]# parted /dev/sdb
(parted) mklabel GPT
(parted) q                                                                
[root@daq13 ~]# parted /dev/sdc
(parted) mklabel GPT                                                      
(parted) q                                                                
[root@daq13 ~]# blkid                                                     
/dev/sda1: UUID="ab920e4b-40ae-4551-aab8-f3e893d38830" TYPE="xfs" 
/dev/sdb: PTTYPE="gpt" 
/dev/sdc: PTTYPE="gpt" 
[root@daq13 ~]# zpool create z10tb mirror /dev/sdb /dev/sdc
[root@daq13 ~]# zpool status
  pool: z10tb
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        z10tb       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0

errors: No known data errors
[root@daq13 ~]# 
[root@daq13 ~]# zfs create z10tb/emma
[root@daq13 ~]# df -kl
Filesystem      1K-blocks     Used  Available Use% Mounted on
pool           9426697856        0 9426697856   0% /pool
pool/daqstore  9426697856        0 9426697856   0% /pool/daqstore
[root@daq13 ~]#

Enable ZFS at boot

systemctl enable zfs-import-cache
systemctl enable zfs-import-scan
systemctl enable zfs-mount
systemctl enable zfs-import.target
systemctl enable zfs.target

Replace failed disk

pull failed disk out
zpool status # identify failed disk zfs label (it should be labeled FAULTED or OFFLINE
safe to reboot here
install new disk
partition new disk, i.e. "gdisk /dev/sdh", use "o" to create new partition table, use "n" to create new partition, accept all default answers, use "w" to save and exit
safe to reboot here
run tests on new disk (smart, diskscrub), if unhappy go back to "install new disk"
safe to reboot here
identify serial number of new disk, i.e. "smartctl -a /dev/sdh | grep -i serial" yields "Serial Number: WD-WCAVY0893313"
identify linux id of new disk by "ls -l /dev/disk/by-id | grep -i WD-WCAVY0893313" yields "ata-WDC_WD2002FYPS-01U1B0_WD-WCAVY0893313-part1"
zpool replace data11 zfs-label-of-failed-disk ata-WDC_WD2002FYPS-01U1B0_WD-WCAVY0893313-part1
zpool status should look like this:

[root@daq11 ~]# zpool status
  pool: data11
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Apr 29 11:51:03 2016
    24.7G scanned out of 795G at 32.3M/s, 6h46m to go
    3.00G resilvered, 3.11% done
config:

        NAME                                                   STATE     READ WRITE CKSUM
        data11                                                 DEGRADED     0     0     0
          raidz2-0                                             DEGRADED     0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WCAZA3872943-part1     ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WCAZA1973466-part1     ONLINE       0     0     0
            replacing-2                                        DEGRADED     0     0     0
              17494865033746374811                             FAULTED      0     0     0  was /dev/sdi1
              ata-WDC_WD2002FYPS-01U1B0_WD-WCAVY0893313-part1  ONLINE       0     0     0  (resilvering)
            ata-WDC_WD20EARS-00MVWB0_WD-WCAZA1973369-part1     ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0858733-part1     ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0819555-part1     ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0857075-part1     ONLINE       0     0     0
            ata-WDC_WD2002FYPS-01U1B0_WD-WCAVY0347413-part1    ONLINE       0     0     0

errors: No known data errors

wait for raid rebuild ("resilvering") to complete
zpool status should look like this:

[root@daq11 ~]# zpool status
  pool: data11
 state: ONLINE
  scan: resilvered 96.2G in 1h44m with 0 errors on Fri Apr 29 13:35:40 2016
config:

        NAME                                                 STATE     READ WRITE CKSUM
        data11                                               ONLINE       0     0     0
          raidz2-0                                           ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WCAZA3872943-part1   ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WCAZA1973466-part1   ONLINE       0     0     0
            ata-WDC_WD2002FYPS-01U1B0_WD-WCAVY0893313-part1  ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WCAZA1973369-part1   ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0858733-part1   ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0819555-part1   ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0857075-part1   ONLINE       0     0     0
            ata-WDC_WD2002FYPS-01U1B0_WD-WCAVY0347413-part1  ONLINE       0     0     0

errors: No known data errors

Expand zfs pool

replacing 250GB mirrored SSDs with 1TB mirrored SSDs:
zpool scrub ### ensure both mirror halves are consistent and have good data
# confirm have backups of pool contents (amanda and daqbackup)
# pull one 250GB SSD
# insert replacement 1TB SSD
# follow instructions for replacing failed disk:
parted /dev/sda ...
ls -l /dev/disk/by-id/...
zpool replace zssd sda1 ata-WDC_WDS100T2B0A_192872803056
# wait for resilvering to complete
zpool scrub zssd # confirm resilver was ok
# do the same with the second 1TB disk
parted /dev/sdb
ls -l /dev/disk/by-id/...
zpool replace zssd sdb1 ata-WDC_WDS100T2B0A_192872802193
zpool online -e zssd ata-WDC_WDS100T2B0A_192872803056
zpool list -v ### observe EXPANDSZ is now non-zero
# wait for resilver to finish
zpool online -e zssd ata-WDC_WDS100T2B0A_192872803056
zpool list -v ### observe EXPANDSZ is now zero, but SIZE and FREE have changed

[root@alpha00 ~]# zpool list -v zssd
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
zssd   222G   202G  20.1G      706G    56%    90%  1.00x  DEGRADED  -
  mirror   222G   202G  20.1G      708G    56%    90%
    ata-WDC_WDS100T2B0A_192872803056      -      -      -         -      -      -
    replacing      -      -      -      708G      -      -
      sdb1      -      -      -      708G      -      -
      ata-WDC_WDS100T2B0A_192872802193      -      -      -         -      -      -

[root@alpha00 ~]# zpool list -v zssd
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
zssd   930G   202G   728G         -    13%    21%  1.00x  ONLINE  -
  mirror   930G   202G   728G         -    13%    21%
    ata-WDC_WDS100T2B0A_192872803056      -      -      -         -      -      -
    ata-WDC_WDS100T2B0A_192872802193      -      -      -         -      -      -

Convert pool from single to mirror

we will convert a single-disk pool to a mirrored pool
initial state:

root@daq13:~# zpool status
  pool: bpool
 state: ONLINE
  scan: none requested
config:

	NAME                                    STATE     READ WRITE CKSUM
	bpool                                   ONLINE       0     0     0
	  489bdda8-989a-f748-95b2-c1041aceed65  ONLINE       0     0     0

errors: No known data errors

  pool: rpool
 state: ONLINE
  scan: none requested
config:

	NAME                                    STATE     READ WRITE CKSUM
	rpool                                   ONLINE       0     0     0
	  d870d08b-5bba-f441-b486-6e4975a384f2  ONLINE       0     0     0

errors: No known data errors

per https://unix.stackexchange.com/questions/525950/how-to-convert-one-disk-zfs-pool-into-two-disk-raid1-pool

root@daq13:~# zpool attach rpool d870d08b-5bba-f441-b486-6e4975a384f2 /dev/sda2

status

root@daq13:~# zpool status
  pool: bpool
 state: ONLINE
  scan: none requested
config:

	NAME                                    STATE     READ WRITE CKSUM
	bpool                                   ONLINE       0     0     0
	  489bdda8-989a-f748-95b2-c1041aceed65  ONLINE       0     0     0

errors: No known data errors

  pool: rpool
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Jan  3 16:17:48 2021
	8.94G scanned at 2.98G/s, 620M issued at 207M/s, 8.94G total
	637M resilvered, 6.78% done, 0 days 00:00:41 to go
config:

	NAME                                      STATE     READ WRITE CKSUM
	rpool                                     ONLINE       0     0     0
	  mirror-0                                ONLINE       0     0     0
	    d870d08b-5bba-f441-b486-6e4975a384f2  ONLINE       0     0     0
	    sda2                                  ONLINE       0     0     0  (resilvering)

errors: No known data errors

Rename zfs pool

zpool export oldname
zpool import oldname z6tb

Quotas and disk use

zfs userspace zssd/home1 -s used

Misc

ZFS tunable parameters for hopefully speeding up resilvering:

https://www.reddit.com/r/zfs/comments/4192js/resilvering_raidz_why_so_incredibly_slow/
echo 0 > /sys/module/zfs/parameters/zfs_resilver_delay
echo 512 > /sys/module/zfs/parameters/zfs_top_maxinflight
echo 5000 > /sys/module/zfs/parameters/zfs_resilver_min_time_ms

Enable periodic scrub:

cd ~/git/scripts
git pull
cd zfs
make install

Working with ZFS snapshots:

zfs list -t snapshot
cd ~/git; git clone https://github.com/zfsonlinux/zfs-auto-snapshot.git; cd zfs-auto-snapshot; make install

If ZFS becomes 100% full, "rm" will stop working, but space can still be freed by using "echo > bigfile", afterwards "rm" works again.

ZFS

Contents

Documentation

Files

Important commands

isdaq00 tuning

Misc commands

Create raid0 (mirror) volume

Use whole disk for zfs mirror (RAID0)

Enable ZFS at boot

Replace failed disk

Expand zfs pool

Convert pool from single to mirror

Rename zfs pool

Quotas and disk use

Misc

Navigation menu

ZFS

Documentation

Files

Important commands

isdaq00 tuning

Misc commands

Create raid0 (mirror) volume

Use whole disk for zfs mirror (RAID0)

Enable ZFS at boot

Replace failed disk

Expand zfs pool

Convert pool from single to mirror

Rename zfs pool

Quotas and disk use

Misc

Navigation menu

Search