ZFS: Difference between revisions
		
		
		
		
		
		Jump to navigation
		Jump to search
		
				
		
		
	
| Line 142: | Line 142: | ||
| [root@daq13 ~]#   | [root@daq13 ~]#   | ||
| </pre> | </pre> | ||
| === Use whole disk fir zfs raidz2 === | |||
| AAA | |||
| === Enable ZFS at boot === | === Enable ZFS at boot === | ||
Revision as of 23:28, 26 February 2025
Documentation
- https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html ZFS module tunable parameters
Files
- /proc/spl/kstat/zfs/arcstats
- /sys/module/zfs/parameters
Important commands
- zfs set acltype=posixacl pool ### do this for all ZFS pools otherwise all files have world-writable permissions (with nfs4.2, ubuntu LTS 18.04, 20.04), see https://github.com/openzfs/zfs/issues/10504
- zfs set relatime=on pool ### ensure relatime is enabled otherwise each file access generates a write to the filesystem (to update the "last accessed" timestamp).
isdaq00 tuning
Increase zfs cache to allow "cd /zssd/home1; du -ks *" to run completely from cache without any disk access.
echo 20000000000 > /sys/module/zfs/parameters/zfs_arc_max echo 50 > /sys/module/zfs/parameters/zfs_arc_dnode_limit_percent echo 90 > /sys/module/zfs/parameters/zfs_arc_meta_limit_percent echo 20000000000 > /sys/module/zfs/parameters/zfs_arc_max echo 2 > /proc/sys/vm/drop_caches
Note:
- "memory_free_bytes" is same as free memory reported by "top"
- "memory_available_bytes" minus adjustable safety margin ("avail" in arcstat)
- "arc_meta_max" is "arc_meta_used" + "memory_available_bytes"
- "arc_meta_limit" should be set much bigger than that, set by zfs_arc_max and zfs_arc_meta_limit_percent
- "arc_meta_used" is "size" in arcstat
- "arc_dnode_limit" should be set much bigger than "dnode_size", set by zfs_arc_dnode_limit_percent
- all the data should end up in the MFU (not MRU), "mfu_size" should be huge, "mru_size" much smaller.
- isdaq00 with 24 GB of RAM is just about big enough to fit all of /zssd/home1, arc_meta_used is about 10 GB, arc_meta_max is about 12 GB.
Misc commands
- zpool status
- zpool get all
- zpool iostat 1
- zpool iostat -v 1
- zpool history
- zpool scrub data14
- zpool events
- arcstat.py 1
- cat /proc/spl/kstat/zfs/arcstats
- echo 30000000000 > /sys/module/zfs/parameters/zfs_arc_meta_limit
- echo 32000000000 > /sys/module/zfs/parameters/zfs_arc_max
- zfs get all
- zfs set dedup=verify zssd/nfsroot
- zpool create data14 raidz2 /dev/sd[b-h]1
- zfs create z8tb/data
- zfs destroy z8tb/data
- zpool add z10tb cache /dev/disk/by-id/ata-ADATA_SP550_2F4320041688
- parted /dev/sdx mklabel GPT
- blkid
- zpool iostat -v -q 1
- watch -d -n 1 "cat /proc/spl/kstat/zfs/arcstats | grep l2"
- zfs set primarycache=metadata tank/datab
- zfs set secondarycache=metadata tank/datab
- zfs userspace -p -H zssd/home1
- zfs groupspace ...
- zdb -vvv -O pool/gobackup/titan00__home1 data/home1/titan/packages/elog/logbooks/titan/2017
- zdb -C pool | grep ashift ### find the real value of ashift
- zfs snapshot -r pool_A@migrate
- zfs send -R pool_A@migrate | zfs receive -F pool_B
- echo 1 > /sys/module/zfs/parameters/zfs_send_corrupt_data # zfs send should not stop on i/o errors
- zpool create test raidz2 `ls -1 /dev/disk/by-id/ata-WDC_WD40EZRX-00SPEB0_WD* | grep -v part`
- zpool add -f test special mirror /dev/disk/by-id/ata-WDC_WDS120G2G0A-00JH30_1843A2802212 /dev/disk/by-id/ata-KINGSTON_SV300S37A120G_50026B77630CCB2C
invalid vdev specification use '-f' to override the following errors: mismatched replication level: pool and new vdev with different redundancy, raidz and mirror vdevs, 2 vs. 1 (2-way)
zpool iostat -l 1 ### queue latencies -w 1 ### distribution -q -v 1 ### active and queued requests -r1 ### size of IO
- echo 2 > /proc/sys/vm/drop_caches ### clear zfs cache and system cache, all memory is free after this
- watch -n1 -d "cat /proc/spl/kstat/zfs/arcstats | grep -v l2 | tail -52"
Create raid0 (mirror) volume
echo USE_DISK_BY_ID=\'yes\' >> /etc/default/zfs dracut -vf zpool create zssd mirror /dev/sdaX /dev/sdbX zpool set cachefile=none zssd zpool set failmode=continue zssd zpool status zpool events zpool get all df /zssd ls -l /zssd
Use whole disk for zfs mirror (RAID0)
echo USE_DISK_BY_ID=\'yes\' >> /etc/default/zfs
[root@daq13 ~]# parted /dev/sdb
(parted) mklabel GPT
(parted) q                                                                
[root@daq13 ~]# parted /dev/sdc
(parted) mklabel GPT                                                      
(parted) q                                                                
[root@daq13 ~]# blkid                                                     
/dev/sda1: UUID="ab920e4b-40ae-4551-aab8-f3e893d38830" TYPE="xfs" 
/dev/sdb: PTTYPE="gpt" 
/dev/sdc: PTTYPE="gpt" 
[root@daq13 ~]# zpool create z10tb mirror /dev/sdb /dev/sdc
[root@daq13 ~]# zpool status
  pool: z10tb
 state: ONLINE
  scan: none requested
config:
        NAME        STATE     READ WRITE CKSUM
        z10tb       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
errors: No known data errors
[root@daq13 ~]# 
[root@daq13 ~]# zfs create z10tb/emma
[root@daq13 ~]# df -kl
Filesystem      1K-blocks     Used  Available Use% Mounted on
pool           9426697856        0 9426697856   0% /pool
pool/daqstore  9426697856        0 9426697856   0% /pool/daqstore
[root@daq13 ~]# 
Use whole disk fir zfs raidz2
AAA
Enable ZFS at boot
systemctl enable zfs-import-cache systemctl enable zfs-import-scan systemctl enable zfs-mount systemctl enable zfs-import.target systemctl enable zfs.target
Replace failed disk
- pull failed disk out
- zpool status # identify failed disk zfs label (it should be labeled FAULTED or OFFLINE
- safe to reboot here
- install new disk
- partition new disk, i.e. "gdisk /dev/sdh", use "o" to create new partition table, use "n" to create new partition, accept all default answers, use "w" to save and exit
- safe to reboot here
- run tests on new disk (smart, diskscrub), if unhappy go back to "install new disk"
- safe to reboot here
- identify serial number of new disk, i.e. "smartctl -a /dev/sdh | grep -i serial" yields "Serial Number: WD-WCAVY0893313"
- identify linux id of new disk by "ls -l /dev/disk/by-id | grep -i WD-WCAVY0893313" yields "ata-WDC_WD2002FYPS-01U1B0_WD-WCAVY0893313-part1"
- zpool replace data11 zfs-label-of-failed-disk ata-WDC_WD2002FYPS-01U1B0_WD-WCAVY0893313-part1
- zpool status should look like this:
[root@daq11 ~]# zpool status
  pool: data11
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Apr 29 11:51:03 2016
    24.7G scanned out of 795G at 32.3M/s, 6h46m to go
    3.00G resilvered, 3.11% done
config:
        NAME                                                   STATE     READ WRITE CKSUM
        data11                                                 DEGRADED     0     0     0
          raidz2-0                                             DEGRADED     0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WCAZA3872943-part1     ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WCAZA1973466-part1     ONLINE       0     0     0
            replacing-2                                        DEGRADED     0     0     0
              17494865033746374811                             FAULTED      0     0     0  was /dev/sdi1
              ata-WDC_WD2002FYPS-01U1B0_WD-WCAVY0893313-part1  ONLINE       0     0     0  (resilvering)
            ata-WDC_WD20EARS-00MVWB0_WD-WCAZA1973369-part1     ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0858733-part1     ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0819555-part1     ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0857075-part1     ONLINE       0     0     0
            ata-WDC_WD2002FYPS-01U1B0_WD-WCAVY0347413-part1    ONLINE       0     0     0
errors: No known data errors
- wait for raid rebuild ("resilvering") to complete
- zpool status should look like this:
[root@daq11 ~]# zpool status
  pool: data11
 state: ONLINE
  scan: resilvered 96.2G in 1h44m with 0 errors on Fri Apr 29 13:35:40 2016
config:
        NAME                                                 STATE     READ WRITE CKSUM
        data11                                               ONLINE       0     0     0
          raidz2-0                                           ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WCAZA3872943-part1   ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WCAZA1973466-part1   ONLINE       0     0     0
            ata-WDC_WD2002FYPS-01U1B0_WD-WCAVY0893313-part1  ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WCAZA1973369-part1   ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0858733-part1   ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0819555-part1   ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0857075-part1   ONLINE       0     0     0
            ata-WDC_WD2002FYPS-01U1B0_WD-WCAVY0347413-part1  ONLINE       0     0     0
errors: No known data errors
replace failed disk (whole disk zfs)
- roughly same as above
- parted /dev/sdi, mklabel GPT, q
- zpool replace pool 5050168421842479357 /dev/disk/by-id/ata-WDC_WD60EFRX-68MYMN1_WD-WX31D944C2AP
- here "pool" is the zfs pool name
- first number is the failed "was" disk from "zpool status"
- second /dev/disk/by-id is the replacement disk from ./smart-status.perl
- zpool status
[root@tigstore01 ~]# zpool status pool: pool state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Thu Sep 8 16:06:41 2022 9.36T scanned at 829M/s, 7.27T issued at 644M/s, 18.0T total 625G resilvered, 40.48% done, 04:50:08 to go config: NAME STATE READ WRITE CKSUM pool DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 ata-WDC_WD60EFRX-68MYMN1_WD-WX21D9421VHA ONLINE 0 0 0 (resilvering) ata-WDC_WD60EFRX-68MYMN1_WD-WX31D944C8Y6 ONLINE 0 0 0 ata-WDC_WD60EFRX-68MYMN1_WD-WX31D944CKDK ONLINE 0 0 0 replacing-3 DEGRADED 0 0 0 5050168421842479357 UNAVAIL 0 0 0 was /dev/disk/by-id/ata-WDC_WD60EFRX-68MYMN1_WD-WX31D944CPDP-part1 ata-WDC_WD60EFRX-68L0BN1_WD-WX11DA5NDTPS UNAVAIL 3 0 0 (resilvering) ata-WDC_WD60EFRX-68MYMN1_WD-WX31D944C2AP ONLINE 0 0 0 (resilvering) ata-WDC_WD60EFRX-68L0BN1_WD-WX21DA5FNE28 ONLINE 0 0 0 ata-WDC_WD60EFRX-68MYMN1_WD-WX41D94RNHS4 ONLINE 0 0 0 (resilvering) ata-WDC_WD60EFRX-68MYMN1_WD-WX41D94RNT2A ONLINE 0 0 0 ata-WDC_WD60EFRX-68MYMN1_WD-WX41D94RNZJ0 ONLINE 0 0 0 errors: No known data errors
- wait for resilver to complete
Expand zfs pool
replacing 250GB mirrored SSDs with 1TB mirrored SSDs: zpool scrub ### ensure both mirror halves are consistent and have good data # confirm have backups of pool contents (amanda and daqbackup) # pull one 250GB SSD # insert replacement 1TB SSD # follow instructions for replacing failed disk: parted /dev/sda ... ls -l /dev/disk/by-id/... zpool replace zssd sda1 ata-WDC_WDS100T2B0A_192872803056 # wait for resilvering to complete zpool scrub zssd # confirm resilver was ok # do the same with the second 1TB disk parted /dev/sdb ls -l /dev/disk/by-id/... zpool replace zssd sdb1 ata-WDC_WDS100T2B0A_192872802193 zpool online -e zssd ata-WDC_WDS100T2B0A_192872803056 zpool list -v ### observe EXPANDSZ is now non-zero # wait for resilver to finish zpool online -e zssd ata-WDC_WDS100T2B0A_192872803056 zpool list -v ### observe EXPANDSZ is now zero, but SIZE and FREE have changed
[root@alpha00 ~]# zpool list -v zssd
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
zssd   222G   202G  20.1G      706G    56%    90%  1.00x  DEGRADED  -
  mirror   222G   202G  20.1G      708G    56%    90%
    ata-WDC_WDS100T2B0A_192872803056      -      -      -         -      -      -
    replacing      -      -      -      708G      -      -
      sdb1      -      -      -      708G      -      -
      ata-WDC_WDS100T2B0A_192872802193      -      -      -         -      -      -
[root@alpha00 ~]# zpool list -v zssd
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
zssd   930G   202G   728G         -    13%    21%  1.00x  ONLINE  -
  mirror   930G   202G   728G         -    13%    21%
    ata-WDC_WDS100T2B0A_192872803056      -      -      -         -      -      -
    ata-WDC_WDS100T2B0A_192872802193      -      -      -         -      -      -
Convert pool from single to mirror
- we will convert a single-disk pool to a mirrored pool
- initial state:
root@daq13:~# zpool status pool: bpool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM bpool ONLINE 0 0 0 489bdda8-989a-f748-95b2-c1041aceed65 ONLINE 0 0 0 errors: No known data errors pool: rpool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 d870d08b-5bba-f441-b486-6e4975a384f2 ONLINE 0 0 0 errors: No known data errors
root@daq13:~# zpool attach rpool d870d08b-5bba-f441-b486-6e4975a384f2 /dev/sda2
- status
root@daq13:~# zpool status pool: bpool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM bpool ONLINE 0 0 0 489bdda8-989a-f748-95b2-c1041aceed65 ONLINE 0 0 0 errors: No known data errors pool: rpool state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Sun Jan 3 16:17:48 2021 8.94G scanned at 2.98G/s, 620M issued at 207M/s, 8.94G total 637M resilvered, 6.78% done, 0 days 00:00:41 to go config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 d870d08b-5bba-f441-b486-6e4975a384f2 ONLINE 0 0 0 sda2 ONLINE 0 0 0 (resilvering) errors: No known data errors
Rename zfs pool
zpool export oldname zpool import oldname z6tb
Quotas and disk use
- zfs userspace zssd/home1 -s used
Misc
ZFS tunable parameters for hopefully speeding up resilvering: https://www.reddit.com/r/zfs/comments/4192js/resilvering_raidz_why_so_incredibly_slow/ echo 0 > /sys/module/zfs/parameters/zfs_resilver_delay echo 512 > /sys/module/zfs/parameters/zfs_top_maxinflight echo 5000 > /sys/module/zfs/parameters/zfs_resilver_min_time_ms
Enable periodic scrub:
cd ~/git/scripts git pull cd zfs make install
Working with ZFS snapshots:
- zfs list -t snapshot
- cd ~/git; git clone https://github.com/zfsonlinux/zfs-auto-snapshot.git; cd zfs-auto-snapshot; make install
If ZFS becomes 100% full, "rm" will stop working, but space can still be freed by using "echo > bigfile", afterwards "rm" works again.