ZFS
Jump to navigation
Jump to search
Documentation
- https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html ZFS module tunable parameters
Files
- /proc/spl/kstat/zfs/arcstats
- /sys/module/zfs/parameters
Important commands
- zfs set acltype=posixacl pool ### do this for all ZFS pools otherwise all files have world-writable permissions (with nfs4.2, ubuntu LTS 18.04, 20.04), see https://github.com/openzfs/zfs/issues/10504
- zfs set relatime=on pool ### ensure relatime is enabled otherwise each file access generates a write to the filesystem (to update the "last accessed" timestamp).
isdaq00 tuning
Increase zfs cache to allow "cd /zssd/home1; du -ks *" to run completely from cache without any disk access.
echo 20000000000 > /sys/module/zfs/parameters/zfs_arc_max echo 50 > /sys/module/zfs/parameters/zfs_arc_dnode_limit_percent echo 90 > /sys/module/zfs/parameters/zfs_arc_meta_limit_percent echo 20000000000 > /sys/module/zfs/parameters/zfs_arc_max echo 2 > /proc/sys/vm/drop_caches
Note:
- "memory_free_bytes" is same as free memory reported by "top"
- "memory_available_bytes" minus adjustable safety margin ("avail" in arcstat)
- "arc_meta_max" is "arc_meta_used" + "memory_available_bytes"
- "arc_meta_limit" should be set much bigger than that, set by zfs_arc_max and zfs_arc_meta_limit_percent
- "arc_meta_used" is "size" in arcstat
- "arc_dnode_limit" should be set much bigger than "dnode_size", set by zfs_arc_dnode_limit_percent
- all the data should end up in the MFU (not MRU), "mfu_size" should be huge, "mru_size" much smaller.
- isdaq00 with 24 GB of RAM is just about big enough to fit all of /zssd/home1, arc_meta_used is about 10 GB, arc_meta_max is about 12 GB.
Misc commands
- zpool status
- zpool get all
- zpool iostat 1
- zpool iostat -v 1
- zpool history
- zpool scrub data14
- zpool events
- arcstat.py 1
- cat /proc/spl/kstat/zfs/arcstats
- echo 30000000000 > /sys/module/zfs/parameters/zfs_arc_meta_limit
- echo 32000000000 > /sys/module/zfs/parameters/zfs_arc_max
- zfs get all
- zfs set dedup=verify zssd/nfsroot
- zpool create data14 raidz2 /dev/sd[b-h]1
- zfs create z8tb/data
- zfs destroy z8tb/data
- zpool add z10tb cache /dev/disk/by-id/ata-ADATA_SP550_2F4320041688
- parted /dev/sdx mklabel GPT
- blkid
- zpool iostat -v -q 1
- watch -d -n 1 "cat /proc/spl/kstat/zfs/arcstats | grep l2"
- zfs set primarycache=metadata tank/datab
- zfs set secondarycache=metadata tank/datab
- zfs userspace -p -H zssd/home1
- zfs groupspace ...
- zdb -vvv -O pool/gobackup/titan00__home1 data/home1/titan/packages/elog/logbooks/titan/2017
- zdb -C pool | grep ashift ### find the real value of ashift
- zfs snapshot -r pool_A@migrate
- zfs send -R pool_A@migrate | zfs receive -F pool_B
- echo 1 > /sys/module/zfs/parameters/zfs_send_corrupt_data # zfs send should not stop on i/o errors
- zpool create test raidz2 `ls -1 /dev/disk/by-id/ata-WDC_WD40EZRX-00SPEB0_WD* | grep -v part`
- zpool add -f test special mirror /dev/disk/by-id/ata-WDC_WDS120G2G0A-00JH30_1843A2802212 /dev/disk/by-id/ata-KINGSTON_SV300S37A120G_50026B77630CCB2C
invalid vdev specification use '-f' to override the following errors: mismatched replication level: pool and new vdev with different redundancy, raidz and mirror vdevs, 2 vs. 1 (2-way)
zpool iostat -l 1 ### queue latencies -w 1 ### distribution -q -v 1 ### active and queued requests -r1 ### size of IO
- echo 2 > /proc/sys/vm/drop_caches ### clear zfs cache and system cache, all memory is free after this
Create raid0 (mirror) volume
echo USE_DISK_BY_ID=\'yes\' >> /etc/default/zfs dracut -vf zpool create zssd mirror /dev/sdaX /dev/sdbX zpool set cachefile=none zssd zpool set failmode=continue zssd zpool status zpool events zpool get all df /zssd ls -l /zssd
Use whole disk for zfs mirror (RAID0)
echo USE_DISK_BY_ID=\'yes\' >> /etc/default/zfs [root@daq13 ~]# parted /dev/sdb (parted) mklabel GPT (parted) q [root@daq13 ~]# parted /dev/sdc (parted) mklabel GPT (parted) q [root@daq13 ~]# blkid /dev/sda1: UUID="ab920e4b-40ae-4551-aab8-f3e893d38830" TYPE="xfs" /dev/sdb: PTTYPE="gpt" /dev/sdc: PTTYPE="gpt" [root@daq13 ~]# zpool create z10tb mirror /dev/sdb /dev/sdc [root@daq13 ~]# zpool status pool: z10tb state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM z10tb ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 sdb ONLINE 0 0 0 sdc ONLINE 0 0 0 errors: No known data errors [root@daq13 ~]# [root@daq13 ~]# zfs create z10tb/emma [root@daq13 ~]# df -kl Filesystem 1K-blocks Used Available Use% Mounted on pool 9426697856 0 9426697856 0% /pool pool/daqstore 9426697856 0 9426697856 0% /pool/daqstore [root@daq13 ~]#
Enable ZFS at boot
systemctl enable zfs-import-cache systemctl enable zfs-import-scan systemctl enable zfs-mount systemctl enable zfs-import.target systemctl enable zfs.target
Replace failed disk
- pull failed disk out
- zpool status # identify failed disk zfs label (it should be labeled FAULTED or OFFLINE
- safe to reboot here
- install new disk
- partition new disk, i.e. "gdisk /dev/sdh", use "o" to create new partition table, use "n" to create new partition, accept all default answers, use "w" to save and exit
- safe to reboot here
- run tests on new disk (smart, diskscrub), if unhappy go back to "install new disk"
- safe to reboot here
- identify serial number of new disk, i.e. "smartctl -a /dev/sdh | grep -i serial" yields "Serial Number: WD-WCAVY0893313"
- identify linux id of new disk by "ls -l /dev/disk/by-id | grep -i WD-WCAVY0893313" yields "ata-WDC_WD2002FYPS-01U1B0_WD-WCAVY0893313-part1"
- zpool replace data11 zfs-label-of-failed-disk ata-WDC_WD2002FYPS-01U1B0_WD-WCAVY0893313-part1
- zpool status should look like this:
[root@daq11 ~]# zpool status pool: data11 state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Fri Apr 29 11:51:03 2016 24.7G scanned out of 795G at 32.3M/s, 6h46m to go 3.00G resilvered, 3.11% done config: NAME STATE READ WRITE CKSUM data11 DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 ata-WDC_WD20EARS-00MVWB0_WD-WCAZA3872943-part1 ONLINE 0 0 0 ata-WDC_WD20EARS-00MVWB0_WD-WCAZA1973466-part1 ONLINE 0 0 0 replacing-2 DEGRADED 0 0 0 17494865033746374811 FAULTED 0 0 0 was /dev/sdi1 ata-WDC_WD2002FYPS-01U1B0_WD-WCAVY0893313-part1 ONLINE 0 0 0 (resilvering) ata-WDC_WD20EARS-00MVWB0_WD-WCAZA1973369-part1 ONLINE 0 0 0 ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0858733-part1 ONLINE 0 0 0 ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0819555-part1 ONLINE 0 0 0 ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0857075-part1 ONLINE 0 0 0 ata-WDC_WD2002FYPS-01U1B0_WD-WCAVY0347413-part1 ONLINE 0 0 0 errors: No known data errors
- wait for raid rebuild ("resilvering") to complete
- zpool status should look like this:
[root@daq11 ~]# zpool status pool: data11 state: ONLINE scan: resilvered 96.2G in 1h44m with 0 errors on Fri Apr 29 13:35:40 2016 config: NAME STATE READ WRITE CKSUM data11 ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 ata-WDC_WD20EARS-00MVWB0_WD-WCAZA3872943-part1 ONLINE 0 0 0 ata-WDC_WD20EARS-00MVWB0_WD-WCAZA1973466-part1 ONLINE 0 0 0 ata-WDC_WD2002FYPS-01U1B0_WD-WCAVY0893313-part1 ONLINE 0 0 0 ata-WDC_WD20EARS-00MVWB0_WD-WCAZA1973369-part1 ONLINE 0 0 0 ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0858733-part1 ONLINE 0 0 0 ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0819555-part1 ONLINE 0 0 0 ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0857075-part1 ONLINE 0 0 0 ata-WDC_WD2002FYPS-01U1B0_WD-WCAVY0347413-part1 ONLINE 0 0 0 errors: No known data errors
Expand zfs pool
replacing 250GB mirrored SSDs with 1TB mirrored SSDs: zpool scrub ### ensure both mirror halves are consistent and have good data # confirm have backups of pool contents (amanda and daqbackup) # pull one 250GB SSD # insert replacement 1TB SSD # follow instructions for replacing failed disk: parted /dev/sda ... ls -l /dev/disk/by-id/... zpool replace zssd sda1 ata-WDC_WDS100T2B0A_192872803056 # wait for resilvering to complete zpool scrub zssd # confirm resilver was ok # do the same with the second 1TB disk parted /dev/sdb ls -l /dev/disk/by-id/... zpool replace zssd sdb1 ata-WDC_WDS100T2B0A_192872802193 zpool online -e zssd ata-WDC_WDS100T2B0A_192872803056 zpool list -v ### observe EXPANDSZ is now non-zero # wait for resilver to finish zpool online -e zssd ata-WDC_WDS100T2B0A_192872803056 zpool list -v ### observe EXPANDSZ is now zero, but SIZE and FREE have changed
[root@alpha00 ~]# zpool list -v zssd NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT zssd 222G 202G 20.1G 706G 56% 90% 1.00x DEGRADED - mirror 222G 202G 20.1G 708G 56% 90% ata-WDC_WDS100T2B0A_192872803056 - - - - - - replacing - - - 708G - - sdb1 - - - 708G - - ata-WDC_WDS100T2B0A_192872802193 - - - - - -
[root@alpha00 ~]# zpool list -v zssd NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT zssd 930G 202G 728G - 13% 21% 1.00x ONLINE - mirror 930G 202G 728G - 13% 21% ata-WDC_WDS100T2B0A_192872803056 - - - - - - ata-WDC_WDS100T2B0A_192872802193 - - - - - -
Convert pool from single to mirror
- we will convert a single-disk pool to a mirrored pool
- initial state:
root@daq13:~# zpool status pool: bpool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM bpool ONLINE 0 0 0 489bdda8-989a-f748-95b2-c1041aceed65 ONLINE 0 0 0 errors: No known data errors pool: rpool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 d870d08b-5bba-f441-b486-6e4975a384f2 ONLINE 0 0 0 errors: No known data errors
root@daq13:~# zpool attach rpool d870d08b-5bba-f441-b486-6e4975a384f2 /dev/sda2
- status
root@daq13:~# zpool status pool: bpool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM bpool ONLINE 0 0 0 489bdda8-989a-f748-95b2-c1041aceed65 ONLINE 0 0 0 errors: No known data errors pool: rpool state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Sun Jan 3 16:17:48 2021 8.94G scanned at 2.98G/s, 620M issued at 207M/s, 8.94G total 637M resilvered, 6.78% done, 0 days 00:00:41 to go config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 d870d08b-5bba-f441-b486-6e4975a384f2 ONLINE 0 0 0 sda2 ONLINE 0 0 0 (resilvering) errors: No known data errors
Rename zfs pool
zpool export oldname zpool import oldname z6tb
Quotas and disk use
- zfs userspace zssd/home1 -s used
Misc
ZFS tunable parameters for hopefully speeding up resilvering: https://www.reddit.com/r/zfs/comments/4192js/resilvering_raidz_why_so_incredibly_slow/ echo 0 > /sys/module/zfs/parameters/zfs_resilver_delay echo 512 > /sys/module/zfs/parameters/zfs_top_maxinflight echo 5000 > /sys/module/zfs/parameters/zfs_resilver_min_time_ms
Enable periodic scrub:
cd ~/git/scripts git pull cd zfs make install
Working with ZFS snapshots:
- zfs list -t snapshot
- cd ~/git; git clone https://github.com/zfsonlinux/zfs-auto-snapshot.git; cd zfs-auto-snapshot; make install
If ZFS becomes 100% full, "rm" will stop working, but space can still be freed by using "echo > bigfile", afterwards "rm" works again.