SLinstall

Notes

these instructions are periodically updated to include items needed for older/newer versions of Linux. They are marked like this: (SL4.2+) means Scientific Linux 4.2 and newer; (SL4 is equivalent to FC3). (FC5 only) means Fedora Core 5; etc.
obsolete items are marked by the "#" sign at the beginning of the line and sometimes have a comment about the reason for removal.
typically, we do not "upgrade" machines using the Red Hat "upgrade" function. Instead, we save critical files from the old installation and do a "fresh install" from scratch
starting with RHEL7, the recommended OS is CentOS7 (instead of SL7).

Disk configurations

The year is 2019 and SSDs are used exclusively, except for bulk data storage, where one used 6-8-10-12 TB HDDs

For reliability, home directories and data disks must use redundant storage - mdadm raid1 or ZFS raid1/raid6.

For non-critical machines, a single SSD seems to be reliable enough to use as a boot and OS disk. But since any storage device can fail at any time without warning, home directories and data disks should use redundant storage.

Note: for data disks bigger than 4-6TB, mdadm raid1/raid6 is no longer recommended because raid rebuild, verification and repair time has become unreasonably long. Instead, use ZFS raid1/raid6 which implements online verification, repair and disk replacement without requiring machine shutdown or OS down time.

single SSD - 120GB min - single partition for "/", no swap partition (create a swap file if swap is needed) - for non-critical machine with no local data storage (OS only)
dual SSD - 2x240GB min - all partitions mirrored (RAID1), 30GB "/", rest for /home1 - for daq station with local user home directories and no bulk data storage
single SSD + 2x6-8-10-12TB HDD - SSD partition: all "/", HDD partition as ZFS raid1 (mirrored) - for daq station with small local bulk data storage
single SSD + 6-8x6-8-10-12TB HDD - for small storage server machines - for daq station with local home directories and large bulk data storage.

For VME processors:

network boot - VME-CPU#Network_boot - only option for V7648/V7750, do not use for V7805 (no netboot from GigE), optional for V7865/XVB-602
USB boot - 8GB USB for V7805, 16GB USB for V7865/XVB-602

Preparation

save /etc, /var, /root, /opt, (if needed: /usr/local, /tftpboot) by rsync to some data disk (/ladd/data0/root)
check that "/" partition (it will be overwritten) is different from /home1 and /data partitions
note the MAC addresses of all network interfaces, add them to ladd00 dhcpd.conf to enable PXE boot into the SL "network installer"
shutdown

Running installer (CentOS7)

CentOS7 can be installed from vanilla CentOS7 installation media or from a custom USB key build per there instructions: https://daqshare.triumf.ca/~olchansk/linux/CentOS7/

The custom installer makes it easy to use a custom kickstart file (ks.cfg).

Instructions for using the usb-installer:

disconnect machine from network
plug the usb-installer into a usb3 port (blue colour)
reboot machine, select booting from usb (press F8 on ASUS motherboards)
usb-installer boot menu offers to install CentOS7, go there
CentOS7 should boot (many messages scroll on screen)
into graphical mode
into installer main menu
all installer options should "happy" except for the "installation destination"
go to the "installation destination" menu
- unselect all disks except for the SSD where the OS will be installed
- (MOST IMPORTANT: unselect the USB installer disk!)
- select "I will configure..."
- say "done"
- the "manual partitionning" menu will open
  - use the "-" button to delete all existing partitions
  - select "standard partition"
  - click on the "+" button
  - in the "Add new partition" dialog, set mount point "/", capacity blank, click "add mount point"
  - check capacity (should be full size of SSD), check filesystem type (should be XFS)
  - say "done", there will be a warning about absent swap partition, say "done" again.
  - in the big useless dialog, say "accept changes"
  - should be back to the "installation summary" screen, "installation destination" should be happy now
after everything is happy, say "begin installation"
as the installation proceeds, set the password for the root user
after installation is complete, reboot the machine
unplug the usb-installer, CentOS7 should boot from SSD into the login screen
click on "not listed?", login as root
setup network connection:
- open a terminal
- start "nm-connection-editor"
- click on "+" to create a new connection profile
- select "wired ethernet"
- select "add profile..."
- in "Identity", set "name" to "static"
- in "Identity", check that "Connect automatically" and "Make available..." is enabled
- in "IPv4", set "Addresses" to "manual" instead of "dhcp"
- enter IP address, netmask 255.255.224.0, gateway 142.90.100.18, dns 142.90.100.19, search triumf.ca
- say "Add", then close/quit the network settings
connect network cable
network should be up, ping ladd00 should work
run: yum update -y
check new kernel is installed: ls -l /boot
logout and restart (good luck finding these buttons in the gui!)
confirm correct linux kernel is selected during boot (-229.20, not the original installer kernel)
login as root, confirm network is up, proceed with the rest of these instructions

Configure SSH

(+CentOS7)

Login from the console
restore the SSH keys from backup (/etc/ssh/*key*)
service sshd restart
ssh into the new machine as root
ssh root@localhost, ctrl-C
### this is done later from Konstantin's git repository - scp root@ladd00:/root/authorized_keys ~root/.ssh/
(not needed for SL5.5 kickstart) check that /etc/ssh/ssh_config contains "ForwardX11 yes" and "ForwardX11Trusted yes":

echo "  ForwardX11 yes" >> /etc/ssh/ssh_config
echo "  ForwardX11Trusted yes" >> /etc/ssh/ssh_config

Post installation CentOS7

Set hostname: (use full name, i.e. daq11.triumf.ca)

emacs -nw /etc/hostname

echo "olchansk@triumf.ca amaudruz@triumf.ca lindner@triumf.ca" >> ~root/.forward
chmod a+r /var/log/messages
chmod a+r /var/log/yum.log

Activate rc.local:

chmod a+x /etc/rc.local
chmod a+x /etc/rc.d/rc.local  # TL edit
systemctl start rc-local
systemctl status rc-local

Disable "persistent network names" (DO NOT DO THIS)

/bin/touch /etc/udev/rules.d/75-persistent-net-generator.rules
/bin/rm /etc/udev/rules.d/70-persistent-net.rules
#shutdown -r now

Configure NIS client (CentOS7)

yum -y install ypbind authconfig
echo "NISTIMEOUT=5" >> /etc/sysconfig/network
echo "NETWORKWAIT=yes" >> /etc/sysconfig/network
authconfig --enablenis --enablepreferdns --nisdomain LADD-NIS --nisserver ladd00.triumf.ca --update
ypwhich
ypcat -k passwd
systemctl restart autofs

On the master NIS node (ladd00), add this new node to /etc/netgroup, and update NIS maps (cd /var/yp; make)
Use "system-config-users" to add local user accounts
enable selinux ssh key login to nfs mounted home directories:

setsebool -P use_nfs_home_dirs 1

Configure NIS secondary server (CentOS7)

Enable local NIS server, make local machine use it:

yum -y install ypserv
/usr/lib64/yp/ypinit -s ladd00 ### (/usr/lib/yp/ypinit on 32-bit machines)
systemctl enable rpcbind ypserv ypxfrd yppasswdd
systemctl start rpcbind ypserv ypxfrd yppasswdd
emacs -nw /etc/yp.conf # change "domain XXX server YYY.triumf.ca" to read "domain XXX server localhost"
systemctl restart ypbind
ypwhich # should say "localhost"
ypcat -k auto.master # should work

Punch hole in the firewall: (or "make" on NIS master will complain)

echo YPSERV_ARGS=\"-p 800\" >> /etc/sysconfig/network
systemctl restart ypserv
firewall-cmd --get-services
firewall-cmd --add-service rpc-bind --permanent
firewall-cmd --add-port=800/tcp --add-port=800/udp --permanent
firewall-cmd --reload
firewall-cmd --list-all

on the NIS master:
- add the new machine to /var/yp/ypservers, run "make -C /var/yp" and also "cd /var/yp; yppush -h newmachine ypservers"
- if using /var/yp/securenets, copy it from NIS master to new NIS secondary server

Enable hourly NIS update cron job (DO THIS AFTER git pull scripts, see below)

cd ~/git/scripts
git pull
cd etc
cd ~/git/scripts/etc; ln -s $PWD/ypxfr-cron-hourly /etc/cron.hourly

Configure AUTOFS (CentOS7)

yum -y install autofs
systemctl enable autofs
systemctl start autofs
ls -l /daq/daqshare

Label Selinux labels

When upgrading non-selinux machines (el6) to el7 (selinux enforcing) the existing user home directories will not have the correct selinux labels and many things will not work, including ssh logins (sshd cannot access ~user/.ssh files).

semanage fcontext -a -e /home /home1 ### selinux has special rules for /home, assign them to /home1
restorecon -R -v /home1 ### apply the new rules to files in /home1
ls -Zd /home1/alpha/.ssh
# should say: drwx------. alpha users system_u:object_r:ssh_home_t:s0  /home1/alpha/.ssh

Configure time (CentOS7)

Time server ntpd was replaced by chronyd.

yum -y install chrony
echo server time1 iburst >> /etc/chrony.conf
echo server time2 iburst >> /etc/chrony.conf
echo server time3 iburst >> /etc/chrony.conf
systemctl enable chronyd
systemctl restart chronyd
chronyc sources
chronyc tracking

if desired, edit /etc/chrony.conf, remove non-triumf time servers

Enable automatic system updates (CentOS7)

Disable yum-cron:

rpm --erase yum-cron
/bin/rm -v /var/lock/subsys/yum-cron
/bin/rm -v /etc/cron.daily/0yum-daily.cron
/bin/rm -v /etc/cron.hourly/0yum-hourly.cron

Enable yum-autoupdate:

yum install -y epel-release
yum install -y yum-changelog yum-protectbase yum-tsflags yum-versionlock
rpm -vh --install http://linuxsoft.cern.ch/cern/centos/7.2/cern/x86_64/Packages/yum-kernel-module-1-5.el7.cern.noarch.rpm
rpm -vh --install http://linuxsoft.cern.ch/cern/centos/7.2/cern/x86_64/Packages/yum-autoupdate-4.4.2-1.el7.cern.noarch.rpm
#rpm -vh --install https://daqshare.triumf.ca/~olchansk/linux/yum-autoupdate-4.4.2-1.el7.cern.noarch.rpm https://daqshare.triumf.ca/~olchansk/linux/yum-kernel-module-1-5.el7.cern.noarch.rpm
systemctl enable yum-autoupdate
systemctl start yum-autoupdate
systemctl status yum-autoupdate

Disable automatic system updates (CentOS7)

yum -y erase yum-autoupdate
/bin/rm -f /etc/sysconfig/yum-autoupdate.rpmsave
/bin/rm -f /var/lock/subsys/yum-autoupdate

Configure system services (CentOS7)

systemctl list-unit-files | grep enabled | sort ### (to see enabled services)
disable unwanted services:

systemctl disable bluetooth
systemctl disable dm-event
systemctl disable dmraid-activation
systemctl disable iscsid
systemctl disable iscsi
systemctl disable iscsiuio
systemctl disable libvirtd
systemctl disable lvm2-lmetad
systemctl disable lvm2-monitor
systemctl disable ModemManager
systemctl disable multipathd
systemctl disable netcf-transaction
systemctl disable lvm2-lvmetad.socket
systemctl disable lvm2-lvmpolld.socket
systemctl disable iscsid.socket
systemctl disable iscsiuio.socket
#systemctl disable

Erase unwanted packages

yum erase PackageKit # bugs users about security updates

Erase unwanted packages (CentOS7)

PackageKit # bugs users about security updates, hogs yum lock
perl-homedir # creates unwanted $HOME/perl5
ModemManager # thinks that all USB-attached devices are modems
pcp # sends error email to itself, does not work
abrt # sends email to root about useless crashes, i.e. crash of X when machine is rebooted
rear # some kind of backup and recovery tool, not clear what it does, but it sends email complaining how it is broken
bash-completion # "echo $HOME/<TAB>" becomes "echo \$HOME" (notice "\" added before "$") preventing tab-completion from doing anything useful.

yum -y erase PackageKit perl-homedir ModemManager pcp abrt abrt-libs abrt-gui-libs rear bash-completion

Configure external package repositories

yum install elrepo-release epel-release

Configure external package repositories (CentOS7)

EPEL: (addtional packages)

yum install epel-release

ELREPO: (kernel drivers)

rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
rpm -Uvh https://www.elrepo.org/elrepo-release-7.0-2.el7.elrepo.noarch.rpm
yum -y install yum-plugin-fastestmirror

Install packages needed to continue with installation

(+CentOS7)

(these packages are sometimes missing, they are needed to follow following instructions instructions)

(SL6.5: libotf is a dependancy of emacs - SL6.5 installer fails to install it)

yum install ed patch wget git libotf gdisk emacs

Configure TRIUMF packages

(only for machines on the TRIUMF network)

(TRIUMF kickstart usually installs this automatically)

rpm -vh --install  http://mirror.triumf.ca/triumf/6/x86_64/Packages/triumf-release-1.4-1.noarch.rpm
yum install triumf-ssh triumf-syslog

Configure TRIUMF packages (CentOS7)

(only for machines on the TRIUMF network)

# TL Was rpm -vh --install http://mirror.triumf.ca/triumf/6/x86_64/RPMS/triumf-release-1.4-1.noarch.rpm
rpm -vh --install  http://mirror.triumf.ca/triumf/6/x86_64/Packages/triumf-release-1.4-1.noarch.rpm
yum install triumf-ssh triumf-syslog

Configure Konstantin's scripts

(+Centos7)

mkdir ~root/git
cd ~root/git
git clone http://ladd00.triumf.ca/~olchansk/git/scripts.git
cd scripts
git pull

Enable yum version lock

DO THIS ONLY IF NEEDED

yum install yum-plugin-versionlock
yum versionlock packagename # yum versionlock rpcbind
yum versionlock list # list locked packages
yum versionlock delete packagename # unlock given package
yum versionlock clear # delete all locks

Configure TRIUMF mirror of yum repositories (SL6)

(only for machines on TRIUMF network)

if /daq/mirror is available: /bin/cp ~/git/scripts/etc/daq-mirror-SL6.repo /etc/yum.repos.d/
if /triumfcs/mirror is available: /bin/cp ~/git/scripts/etc/triumfcs-mirror-SL6.repo /etc/yum.repos.d/
otherwise: /bin/cp ~/git/scripts/etc/triumf-SL6.repo /etc/yum.repos.d/

then disable external repositories:

yum clean all
yum-config-manager --disable epel
yum-config-manager --disable elrepo
yum-config-manager --disable sl
yum-config-manager --disable sl-security
yum-config-manager --disable sl6x
yum-config-manager --disable sl6x-security
yum clean all

Configure trusted ssh keys

(+CentOS7)

ssh localhost
interrupt by Ctrl-C
/bin/cp ~/git/scripts/etc/authorized_keys ~/.ssh/

Configure hardware sensors

yum install lm_sensors kmod-k10temp kmod-coretemp
sensors-detect (accept default answer to all questions - press ENTER)
service lm_sensors restart (to reload the kernel modules)
sensors (to see available sensors)

If no sensors are detected by standard drivers, follow motherboard-specific instructions at the bottom of this page.

Configure coretemp CPU sensors

On some machines, the coretemp driver for Intel CPU temperature sensors is not loaded after the above steps.

sensors | grep coretemp ### number of sensors reported should be the same as the number of CPU cores
if output is blank, add this to /etc/rc.local

emacs -nw /etc/rc.local
modprobe coretemp

Configure IPMI sensors

Some machines support the IPMI interface for monitoring the hardware: fan speeds, temperatures, voltages.

find out if IPMI is supported. Try this:

dmidecode | grep -i ipmi

if output is not blank, IPMI is maybe supported.

install and enable IPMI software:

yum install "OpenIPMI*" ipmitool
service ipmi start
ipmitool sensor ### to confirm IPMI is present. If output is blank, do not go further.
chkconfig ipmi on
chkconfig ipmievd on
service ipmi restart
service ipmievd restart
tail -100 /var/log/messages ### look at messages logged by ipmievd

(CentOS7) install and enable IPMI software:

yum install "OpenIPMI*" ipmitool
systemctl start ipmi
ipmitool sensor ### to confirm IPMI is present. If output is blank, do not go further.
systemctl list-unit-files | grep -i ipmi
systemctl enable ipmi
systemctl restart ipmi
systemctl status ipmi
systemctl enable ipmievd
systemctl restart ipmievd
systemctl status ipmievd
tail -100 /var/log/messages ### look at messages logged by ipmievd

if ipmievd complains about SEL buffer overflow, clear it manually:

ipmitool sel list ### show ipmi messages in raw format
ipmitool sel elist ### show ipmi messages in useful format
ipmitool sel elist > file ### save ipmi messages into a file
ipmitool sel clear  ### clear all accumulated ipmi messages

useful ipmi commands:
- ipmitool sensor -- read hardware sensors
- ipmitool sel elist -- report all accumulated messages

Configure SMARTD (CentOS7)

Default el7 smartd config files send deficient email notices about disk failures. Overwrite.

/bin/cp ~/git/scripts/etc/smartd.conf /etc/smartmontools/
/bin/cp ~/git/scripts/etc/smartd_warning.sh /etc/smartmontools/
systemctl restart smartd
systemctl status smartd

Enable User Disk Quotas (OPTIONAL)

(+CentOS7)

read http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/ch-disk-quotas.html
emacs -nw /etc/fstab, add "grpquota,usrquota" to filesystem options, e.g.:

[root@isdaq00 home1]# grep quota /etc/fstab
UUID=5a2aefbd-45db-475e-841e-12ec89220fbd /home1 ext4 defaults,grpquota,usrquota 1 2

cd /; umount /home1; mount /home1
quotacheck -cug /home1
quotacheck -avug
quotaon -av
quota system is now active
increase the soft quota time limit from default 7days to 30 or 60 days: edquota -t
set quotas for all users (see below)
setup warnquota:
- create warnquota config file: emacs -nw /etc/warnquota.conf

# values can be quoted:
MAIL_CMD        = "/usr/sbin/sendmail -t"
FROM            = root
SUBJECT         = User %i@%h exceeded allocated disk quota
CC_TO           = "root"
# If you set this variable CC will be used only when user has less than
# specified grace time left (examples of possible times: 5 seconds, 1 minute,
# 12 hours, 5 days)
# CC_BEFORE = 2 days
SUPPORT         = "root"
# Text in the beginning of the mail (if not specified, default text is used)
# This way text can be split to more lines
# Line breaks are done by '|' character
# The expressions %i, %h, %d, and %% are substituted for user/group name,
# host name, domain name, and '%' respectively. For backward compatibility
# %s behaves as %i but is deprecated.
MESSAGE         = User "%i" on "%h" has exceeded the allocated disk quota.||Please delete any unnecessary files on following filesystems or|contact the system administrato
r to increase your quota allocation:|
SIGNATURE       = --|automated email from warnquota

- note that %i@%h in the SUBJECT line do not seem to work
- create cron job: emacs -nw /etc/cron.daily/warnquota

#!/bin/sh
warnquota
#end

- chmod a+x /etc/cron.daily/warnquota
- touch /etc/crontab

Useful commands for managing quotas:

repquota -a | sort -n -k3 ### show quota of all users sorted by disk usage
edquota -u username ### open "vi" editor to change user quotas
repquote -a | grep username ### report quota for given user
setquota -u username 0 0 0 0 /home1 ### disable quotas for given user
setquota -u username 50000000 100000000 0 0 /home1 ### set quotas for 50GB soft and 100GB hard
edquota -t ### change user quota time limits
edquote -tg ### change group quota time limits

Enable NFS V4 server (CentOS7)

create /etc/exports. example: (fsid numbers should be unique and increase 1,2,3,...)

/home1  @home_export(rw,no_root_squash,async,fsid=1)
/data1  @data_export(rw,no_root_squash,async,fsid=2)

check the netgroup file
- if using NIS: check NIS netgroup: ypcat -k netgroup
- if no NIS, create /etc/netgroup: @daqmachines (deap00,,) (deap01,,) (deap02,,)
- if no NIS, edit /etc/nsswitch.conf, make the netgrooup line read: "netgroup: files"
enable things, start them:

firewall-cmd --get-services
firewall-cmd --permanent --add-service=nfs
firewall-cmd --reload
firewall-cmd --list-all
systemctl enable nfs-server
systemctl start nfs-server
systemctl status nfs

Enable NFS V3 server (CentOS7)

ps -efw | grep rpc.mountd # should be running!
firewall-cmd --get-services
firewall-cmd --permanent --add-service=mountd
firewall-cmd --permanent --add-service=rpc-bind
firewall-cmd --reload
firewall-cmd --list-all

Enable NFS V3 server

edit /etc/hosts.allow, add or uncomment "mountd: 142.90.0.0/255.255.0.0"
create /etc/exports. example:

/home1  @home_export(rw,no_root_squash,async)
/data1  @data_export(rw,no_root_squash,async)

check the netgroup file
- if using NIS: check NIS netgroup: ypcat -k netgroup
- if no NIS, create /etc/netgroup: @daqmachines (deap00,,) (deap01,,) (deap02,,)
- if no NIS, edit /etc/nsswitch.conf, make the netgrooup line read: "netgroup: files"
chkconfig nfs on
chkconfig nfslock on
service nfs restart

Then on ladd00 need to do

ssh to root@ladd00
edit /etc/auto.daq to add new machine...
make -C /var/yp

Enable NFS V4 SERVER (SL6)

if used with NIS, same as NFSv3
if used as standalone, need to edit idmapd.conf - set the "Domain" name to the same value on NFS server and NFS slave (default automagically determined value does not always work). More TBW.

Enable AMANDA backups

AMANDA backups are already enabled by TRIUMF kickstart installs. For non-kickstart installation, follow instructions at [http://amanda/~amanda], or look at "/triumfcs/trshare/olchansk/linux/amanda/amanda-enable.perl". As final step, use [https://helpdesk.triumf.ca] to contact TRIUMF CS to add this new machine to the amanda backup list.

yum install triumf-amanda

Enable AMANDA backups (CentOS7)

yum install amanda-client
list-unit-files | grep -i amanda
#systemctl enable amanda
systemctl enable amanda.socket
systemctl enable amanda-udp.socket
systemctl restart amanda.socket
systemctl restart amanda-udp.socket
firewall-cmd --get-services
firewall-cmd --permanent --add-service=amanda-client
firewall-cmd --reload
firewall-cmd --list-all
echo amanda.triumf.ca amanda amdump >> /var/lib/amanda/.amandahosts

On amanda server, add new machine to the disklist, then:

amcheck -c daily titan00

Enable DCACHE

DAQ dcache server is mounted as

/daq/pnfs/triumf.ca/data/

For Centos-7 machines, you need to adjust the firewall rules in order to be able to communicate with the trdata machines; this is only necessary if you are copying data to trdata. The firewall changes are

firewall-cmd --permanent --add-rich-rule="rule family="ipv4" source address="142.90.100.212/32" port protocol="tcp" port="0-65535" accept"
firewall-cmd --permanent --add-rich-rule="rule family="ipv4" source address="142.90.107.156/32" port protocol="tcp" port="0-65535" accept"
firewall-cmd --permanent --add-rich-rule="rule family="ipv4" source address="142.90.100.219/32" port protocol="tcp" port="0-65535" accept"
firewall-cmd --reload
firewall-cmd --list-all

This instructions are unnecessary

# mkdir -p /pnfs
# edit /etc/rc.local, add to the end of file: "mount -o intr,rw,noac,hard,nfsvers=3 trdata00:/pnfs /pnfs &"
# . /etc/rc.local

For more information on, see TrdataDcache dcache page.

Configure CPU speed (CentOS7)

In el7 the CPU frequency selection is confused. On some machines the default governor is "conservative", on other machines it is "powersave".

The current configuration can be seen by: "cpupower frequency-info -p"

The actual cpu frequency can be seen by "cat /proc/cpuinfo | grep -i mhz" and by "cpupower monitor" (run them under "watch -d -n1").

The linux kernel documentation says "powersave" will set CPU frequency to the minimum value, forever. But on some machines (i.e. daq06, daq14) it is easy to see that the CPU frequency actually changes according to the CPU load. This is explained in the documentation for the intel_pstate" driver.

On machines where CPU frequency seems always stuck at mimimum, try this:

set the governor to "performance": cpupower frequency-set -g performance
see if frequency now changes according to load (good) or is stuck at maximum (not so good, but ok)
make it permanent by adding this command to /etc/rc.local - echo cpupower frequency-set -g performance >> /etc/rc.local

Configure Ganglia

SL6 Ganglia instructions (EPEL6 ganglia-3.7.2)

/bin/rm /etc/gmond.conf
yum install "*gmond*"
/bin/rm /etc/ganglia/conf.d/ganglia-triumf-daq.conf
/bin/cp -v /dev/null /etc/ganglia/conf.d/multicpu.conf
/bin/cp -v /dev/null /etc/ganglia/conf.d/netstats.pyconf
/bin/cp -v /dev/null /etc/ganglia/conf.d/diskstat.pyconf
/bin/cp -v /dev/null /etc/ganglia/conf.d/procstat.pyconf
/bin/cp ~/git/scripts/etc/gmond.conf /etc/ganglia/gmond.conf
chkconfig gmond on
service gmond restart

Configure Ganglia (Centos7)

CentOS7 Ganglia instructions (EPEL7 ganglia-3.7.2)

/bin/rm /etc/gmond.conf
yum -y install "ganglia-gmond*"
/bin/cp -v /dev/null /etc/ganglia/conf.d/multicpu.conf   # collects useless data
/bin/cp -v /dev/null /etc/ganglia/conf.d/netstats.pyconf # spews errors into syslog
/bin/cp -v /dev/null /etc/ganglia/conf.d/diskstat.pyconf # collects useless data
/bin/cp -v /dev/null /etc/ganglia/conf.d/procstat.pyconf # do not create /tmp/gmond.conf
/bin/cp ~/git/scripts/etc/gmond.conf /etc/ganglia/gmond.conf
systemctl enable gmond
systemctl restart gmond
systemctl status gmond

Configure TRIUMF DAQ packages

(+CentOS7)

cd /etc/yum.repos.d
wget http://daq.triumf.ca/~daqweb/yum/triumf-daq.repo

Install Konstantin's packages

(+CentOS7)

yum --disablerepo=\* --enablerepo=triumf-daq --skip-broken install diskscrub emailonreboot monitor_nfs "ganglia-*" triumf_nodeinfo

Install memtest and PXE boot

cd /boot
wget http://ladd00.triumf.ca/tftpboot/memtest86+-5.01.bin.gz
wget http://ladd00.triumf.ca/tftpboot/memtest86+-4.20.bin.gz
wget http://ladd00.triumf.ca/tftpboot/memtest86+-4.10
wget http://ladd00.triumf.ca/tftpboot/gpxe-1.0.1+-gpxe.lkrn

emacs -nw /boot/grub/grub.conf
title memtest86+-5.01
      root (hd0,0)
      kernel /boot/memtest86+-5.01.bin.gz
title memtest86+-4.20
      root (hd0,0)
      kernel /boot/memtest86+-4.20.bin.gz
title memtest86+-4.10
      root (hd0,0)
      kernel /boot/memtest86+-4.10
title pxeboot
      root (hd0,0)
      kernel /boot/gpxe-1.0.1+-gpxe.lkrn

Install node monitoring

(+CentOS7)

yum --disablerepo=\* --enablerepo=triumf-daq --skip-broken install triumf_nodeinfo
/usr/sbin/sendnodeinfo.perl --config ladd00.triumf.ca:8600
emacs -nw /etc/nodeinfo
/usr/sbin/sendnodeinfo.perl ladd00.triumf.ca:8600

Install gonodeinfo node monitoring

(+Ubuntu, +CentOS7)

go to https://bitbucket.org/dd1/gonodeinfo follow instructions:

yum -y install golang
mkdir ~/git
cd ~/git
git clone https://bitbucket.org/dd1/gonodeinfo.git
cd gonodeinfo
git pull
make
make install # install gonodeinfo agent
cd ~ # this is important

edit /etc/gonodeinfo.conf
change "Description", "Location", "User" and "Administrator" as appropriate (or delete them)
change "Servers" to read: Servers: ladd00.triumf.ca:8601
run gonodeinfo
if error is "connection refused". go to the nodeinfo server to add this client to the access control list:
on the gonodeinfo server: run gonodereceive -a daq13
try gonodeinfo again, there should be no error
on the gonodeinfo server: run gonodereport, look at the web pages, the new machine should be listed now

Install latest system updates

(+CentOS7)

yum update -y

Configure TRIUMF Printers

chkconfig cups off
service cups stop
yum install triumf-printers

Configure TRIUMF Printers (CentOS7)

systemctl stop cups
systemctl disable cups
echo "ServerName printers.triumf.ca" > /etc/cups/client.conf
lpstat -a

Disable syslog spam (CentOS7)

Default el7 config is spamming the syslog with useless messages "systemd: Starting Session", etc. Disable this:

echo auditctl -e 0 >> /etc/rc.local
echo /usr/bin/systemd-analyze set-log-level notice >> /etc/rc.local
/etc/rc.local

Install basic system packages (CentOS7)

(if starting from minimal system, basic system packages required:)

yum install -y which psmisc redhat-lsb-core xorg-x11-xauth xterm emacs-nox rsync tcpdump strace nfs-utils sysstat iftop tcsh

yum install -y gcc gcc-c++ gdb glibc-static libstdc++-static zlib zlib-devel openssl-devel httpd-tools

Install packages needed for QUARTUS, ROOT, EPICS and MIDAS DAQ

(+CentOS7)

yum install --skip-broken giflib.x86_64 sysstat "libusb-devel*" unixODBC-devel postgresql-devel libxml2-devel libXpm-devel libgfortran git compat-readline43 "graphviz*" dcap "tigervnc*" telnet glibc"*" strace "fftw*" libpng "freetype*" xpdf "xemacs*" tkcvs xterm mutt "*-g77*" joe "libXmu*" dcap-devel gsl-devel pcre-devel h5py gd-devel xorg-x11-fonts"*" minicom xfig"*" perl-BSD-Resource "net-snmp-*" readline-static git-all nasm imake tcl-devel gv xorg-x11-twm expat-devel screen compat-readline5 ImageMagick ImageMagick-devel wget alacarte scipy numpy sympy nedit gnuplot php-cli php-domxml-php4-php5 php-gd php-fpdf php-cli kdebase cmake tcpdump sqlite sqlite-devel kdegraphics gdisk lsof gconf-editor iftop tk-devel mcelog kdm blt itcl lz4 bzip2 pbzip2 apr-devel apr-util-devel net-tools golang"*" --exclude golang-cover"*"hg"*" --exclude golang"*"hg"*" --exclude golang-pkg"*" --exclude golang-github"*" --exclude golang"*"git"*" mesa"*" xerces-c"*" diffuse clang i2c-tools texlive-revtex texlive-revtex4 kile kbibtex xrdp glibc.i686 gimp gimp-data-extras perl-GD"*" perl-Math"*" perl-Statistics-Basic cmake3 cmake3-gui extra-cmake-modules python2-pip x2go"*"

(do not install boost on 32-bit machines)

yum install --skip-broken "boost-*"

(packages for 32-bit software compilation on 64-bit machines. this is optional)

yum install --skip-broken giflib.i386 giflib.i686 compat-libf2c-34.i386 compat-libf2c-34.i686 mysql-devel.i686 openssl-devel.i686 unixODBC-devel.i686 libstdc++-devel.i386 libstdc++-devel.i686 "zlib-*.i686" "libXext-*.i686" "libXtst-*.i686" glibc-static.i686 freetype.i686 fontconfig.i686 libpng.i686 libXrender.i686 glibc-devel.i686 libX11-devel.i686 libXpm-devel.i686 libXft-devel.i686 mysql-devel.i686 dcap-devel.i686 gsl-devel.i686 pcre-devel.i686 fontconfig-devel.i686 freetype-devel.i686 libpng-devel.i686 libjpeg-devel.i686 libgfortran.i686 libxml2-devel.i686 gd-devel.i686 readline-devel.i686 ncurses-devel.i686 libXdmcp.i686 readline-static.i686 compat-readline5.i686

yum install boost-devel.i686

(separately install these packages - they collide with the big bunch above)

yum install rdesktop

yum reinstall urw-fonts

Install libraries for PHYSICA (CentOS7)

To run physica built on el6 from git sources on el7, do this:

(building physica on el7 is nort supported at this time)

(see more http://www.triumf.info/wiki/DAQwiki/index.php/PHYSICA)

yum -y install libX11.i686 gd.i686 libpng12.i686 readline.i686 compat-libf2c-34.i686

Install additional desktop environements (CentOS7)

# LXQT (from EPEL)
yum -y install "lxqt*"
# Cinnamon desktop (from EPEL)
yum -y install cinnamon
# KDE5 not available yet
# MATE (from epel)
yum -y groupinstall "MATE Desktop"
yum -y install mate-common mate-icon-theme-faenza mate-netspeed mate-sensors-applet mate-themes-extras mate-utils
yum -y erase ModemManager abrt abrt-libs abrt-gui-libs
# XFCE4 (from EPEL)
yum -y groupinstall xfce
yum -y install "xfce*plugin" xfce4-about --exclude xfce4-hamster-plugin
yum -y erase bash-completion

make the MATE desktop as default

cd ~root/git/scripts/
git pull
/bin/cp -v etc/lightdm_default_mate.conf /etc/lightdm/lightdm.conf.d/

lighdm login manager (from EPEL)

yum install lightdm lightdm-kde lightdm-qt lightdm-qt5

and switch from gdm to lighdm

systemctl disable gdm.service
systemctl enable lightdm.service
(systemctl stop gdm; systemctl restart lightdm) &

Make installation smaller (optional)

This is optional. Only do this if reducing the size of the OS image is very important.

yum erase "texlive*" "java*" "boost*"
yum erase "xemacs*"
yum erase "libstdc++-docs"

Install SMART scripts

(+CentOS7)

ln -sf ~/git/scripts/smart-status/smart-status.perl ~/

Install NTFS drivers

yum install ntfs-3g ntfsprogs (from EPEL)

Install HFS and HFS+ drivers (CentOS7)

yum --disablerepo=\* --enablerepo=elrepo install kmod-hfs kmod-hfsplus

Install Google Chrome web browser (64-bit SL6)

Google-chrome 27 is too old to using with recent MIDAS but it has working Flash:

rpm -vh --install https://daqshare.triumf.ca/~olchansk/google-chrome/google-chrome-stable-27.0.1453.110-202711.x86_64.rpm
/bin/rm /etc/cron.daily/google-chrome
yum-config-manager --disable google-chrome
yum-config-manager --disable google-chrome-64
google-chrome

Chromium 38 works with current MIDAS. No Flash, no PDF viewer:

yum install -y policycoreutils-python
rpm -vh --install https://daqshare.triumf.ca/~olchansk/google-chrome/chromium-browser-38.0.2125.111-1.el6.centos.x86_64.rpm
chromium-browser

Install Google Chrome web browser (64-bit CentOS7)

/bin/cp ~/git/scripts/etc/google-chrome-64.repo /etc/yum.repos.d/
yum install google-chrome-stable

Enable monitoring of HTTPS certificates

On SL6, CentOS7:

yum install crypto-utils
/etc/cron.daily/certwatch
strace -f /etc/cron.daily/certwatch  |& grep open  | grep crt

Enable 100dpi fonts for EPICS

(+CentOS7)

ln -s /usr/share/X11/fonts/100dpi /etc/X11/fontpath.d/

Enable crontab @reboot for MIDAS (CentOS7)

el7 has a bug - cron @reboot entries for normal users can run before autofs is ready, so if the home directory is on autofs/NFS, it cannot be accessed and the cron job fails. If MIDAS is supposed to be started by cron @reboot, it will not start (there *will* be an error message in /var/log/cron).

mkdir /etc/systemd/system/crond.service.d
echo -e "[Unit]\nAfter=ypbind.service autofs.service\n" > /etc/systemd/system/crond.service.d/local.conf
systemctl daemon-reload
systemctl cat crond.service

Explore the systemd dependacy tree using "systemctl list-dependencies" maybe with "--all".

Visualize the exact boot sequence from previous boot: "systemd-analyze plot > xxx.svg", look at the svg file using a web browser.

Enable firewall for MIDAS (CentOS7)

Default el7 configuration prevents all access to servers running on the local machine, including access to MIDAS mhttpd (tcp port 8443) and mserver (all tcp ports).

To enable access to mhttpd:

firewall-cmd --add-port=8443/tcp --permanent
firewall-cmd --reload
firewall-cmd --list-all

To enable access to the mserver from a specific host: (replace 142.90.111.175 with the IP address of the permitted host)

firewall-cmd --permanent --add-rich-rule="rule family="ipv4" source address="142.90.111.175/32" port protocol="tcp" port="0-65535" accept"
firewall-cmd --reload
firewall-cmd --list-all

Enable firewall for EPICS (CentOS7)

To enable access to TRIUMF EPICS servers, do this:

firewall-cmd --permanent --add-rich-rule="rule family="ipv4" source address="142.90.132.0/23" accept"
firewall-cmd --reload
firewall-cmd --list-all

For UCN the controls people seem to have EPICS setup on a different server; this might be true for CMMS as well. In this case the firewall rule change should be

firewall-cmd --permanent --add-rich-rule="rule family="ipv4" source address="142.90.139.0/23" accept"
firewall-cmd --reload
firewall-cmd --list-all

Disable gdm and X11 (OPTIONAL)

initctl stop prefdm
echo "start on never" > /etc/init/prefdm.override
echo "start on never" > /etc/init/splash-manager.override
initctl reload-configuration

then enable login on default console:

echo "plymouth quit" >> /etc/rc.local
echo "X_TTY=xxx/dev/tty1" >> /etc/sysconfig/init

Install JAVAWS (OPTIONAL)

to run Java "web start" jnlp files (EVO, SEEVOGH, etc): javaws Downloads/spider.jnlp
install javaws:
yum install icedtea-web icedtea-web-javadoc

Install firefox java plugin (OPTIONAL, DO NOT DO THIS)

This installs the Oracle Java plugin:

rpm -vh --install ~deap/jdk-7u15-linux-x64.rpm
ls -l /usr/lib64/mozilla/plugins/
ln -s /usr/java/jdk1.7.0_15/jre/lib/amd64/libnpjp2.so /usr/lib64/mozilla/plugins/
start firefox, go edit->preferences->general->manage add-ons->plugins
"java plugin 1.7.0_15" should be listed

Configure USB device permissions

(+CentOS7)

Configure USB device permissions for user access to USB-serial devices, Altera USB Blaster, etc.

create file /etc/udev/rules.d/99-usb-chmod.rules with this contents:

emacs -nw /etc/udev/rules.d/99-usb-chmod.rules
ACTION=="add", SUBSYSTEM=="usb_device", RUN+="/bin/chmod a+wr /dev/%c"
ACTION=="add", SUBSYSTEM=="usb_device", RUN+="/bin/chmod a+wr /proc/%c"
ACTION=="add", ENV{DEVTYPE}=="usb_device", RUN+="/bin/chmod a+wr $env{DEVNAME}"
ACTION=="add", ENV{DEVTYPE}=="usb_device", RUN+="/bin/chmod a+wr $env{DEVICE}"
ACTION=="add", ENV{PHYSDEVBUS}=="usb-serial", RUN+="/bin/chmod a+wr $env{DEVNAME}"
ACTION=="add", ENV{DEVPATH}=="/class/tty/ttyS*", RUN+="/bin/chmod a+wr $env{DEVNAME}"
ACTION=="add", SUBSYSTEM=="tty", DEVPATH=="*ttyUSB*", RUN+="/bin/chmod a+rw $env{DEVNAME}"
ACTION=="add", SUBSYSTEM=="tty", DEVPATH=="*ttyS*", RUN+="/bin/chmod a+rw $env{DEVNAME}"

apply new permissions: udevadm trigger --action=add

Disable modem-manager

The modem-manager will try to talk to any serial devices attached to USB serial ports. It assumes that those devices are modems and will send out modem-specific commands. if the devices are not modems and do not understand or do not like modem commands, well that's too bad. modem-manager is installed by the ModemManager package required by the NetworkManager package, and there is no configuration setting to turn modem-manager off.

One way to disable it is: chmod a= /usr/sbin/modem-manager

Another way to disable it is by forced uninstall: rpm --erase --nodeps ModemManager

Remember to kill the running copy: killall -KILL modem-manager

Caveat: it is not clear if modem-manager would not be resurrected by an update to the NetworkManager or ModemManager packages.

Configure Altera jtagd

(if needed)

mkdir /etc/jtagd
echo 'Password = "123";' > /etc/jtagd/jtagd.conf
cp -pv /triumfcs/trshare/olchansk/altera/11.0/quartus/linux/pgm_parts.txt /etc/jtagd/jtagd.pgm_parts

start local jtagd: /triumfcs/trshare/olchansk/altera/11.0/quartus/bin/jtagd
test local connection: /triumfcs/trshare/olchansk/altera/11.0/quartus/bin/jtagconfig
test remote connection (add this machine to your .jtag.conf, run jtagconfig

For more information, go to Quartus

Install EOS

Instructions from here: http://eos-docs.web.cern.ch/eos-docs/quickstart/setup_repo.html

rpm -vh --install https://dss-ci-repo.web.cern.ch/dss-ci-repo/eos/citrine/tag/el-7/x86_64/eos-repo-el7-generic-1.noarch.rpm
yum-config-manager --disable eos-citrine # disable auto-update because all packages are not signed
yum-config-manager --disable eos-dep # disable auto-update because all packages are not signed.
yum install eos-client eos-fuse --enablerepo=eos-citrine

Install fix for the el7 systemd dbus boot hang

Around early Summer 2018 el7 started showing a boot problem. In the nutshell, there is a problem with the dbus connection between dbus and systemd that prevents polkit, firewalld, etc from starting. The system eventually boots enough that one can ssh into it, but most things do not work. Notably, polkit is not running, firewalld is not running, ssh login takes about 15-30 second.

Solution is to add a special systemd service to check that dbus started correctly. It that runs after dbus is started, but before it is used, and it restarts dbus in a loop with a delay until dbus starts correctly. In testing, dbus always starts correctly after the first retry.

cd ~root/git/scripts/etc
git pull
/bin/cp -vf systemd-check-dbus.perl /usr/bin/
/bin/cp -vf systemd-check-dbus.service /etc/systemd/system/
systemctl daemon-reload
systemctl enable systemd-check-dbus
systemctl start systemd-check-dbus
systemctl status systemd-check-dbus

After linux boots, if everything was okey, the script will report this:

[root@iris01 ~]# systemctl status systemd-check-dbus
...
Feb 08 17:15:49 iris01.triumf.ca systemd[1]: Starting Check that systemd is registered with dbus...
Feb 08 17:15:49 iris01.triumf.ca sh[4283]: Starting check for systemd dbus connection
Feb 08 17:15:50 iris01.triumf.ca sh[4283]: List:       string "org.freedesktop.DBus"
Feb 08 17:15:50 iris01.triumf.ca sh[4283]: List:       string "org.freedesktop.systemd1"
Feb 08 17:15:50 iris01.triumf.ca sh[4283]: systemd1 dbus service exists, success!
Feb 08 17:15:50 iris01.triumf.ca sh[4283]: Finished check for systemd dbus connection
Feb 08 17:15:50 iris01.triumf.ca systemd[1]: Started Check that systemd is registered with dbus.

If the boot problem happened, the script will report about restarting dbus.

Note: the systemd service file adjusts the start order of other services, this adjustment seems to reduce the probability of the problem.

Configure GRUB boot loader

edit /boot/grub/grub.conf, remove the "quiet" and "rhgb" options
edit /boot/grub/grub.conf, comment out (with "#") the "splashimage=" line
check that GRUB boot loader is installed on all system disks:
- dd if=/dev/sda bs=1 count=1024 2>&1 | strings | grep GRUB
- dd if=/dev/sdb bs=1 count=1024 2>&1 | strings | grep GRUB
if GRUB is not installed, (i.e. on the 2nd disk of machines with mirrored system disk), (but check that /dev/sdb is the right disk):

# grub
grub> device (hd0) /dev/sdb
grub> root (hd0,0)
grub> setup (hd0)

Configure GRUB boot loader (CENTOS7)

edit /etc/default/grub, remove "rhgb" and "quiet" from GRUB_CMDLINE_LINUX
grub2-mkconfig -o /boot/grub2/grub.cfg
grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg
grub2-editenv list # show contents of boot environement file
/bin/rm /boot/grub2/grubenv # remove stale settings, make grub2 boot from first entry in config file

Install memtest86+ (CentOS7)

yum -y install memtest86+
/bin/cp -vf /usr/share/memtest86+/20_memtest86+ /etc/grub.d/
/bin/chmod a+x /etc/grub.d/20_memtest86+ 
grub2-mkconfig -o /boot/grub2/grub.cfg

Configure GRUB boot loader (CentOS7)

DO NOT DO ANY OF THIS.

(maybe) grub2-install /dev/sda
check that GRUB boot loader is installed on all system disks:
- dd if=/dev/sda bs=1 count=1024 2>&1 | strings | grep GRUB
- dd if=/dev/sdb bs=1 count=1024 2>&1 | strings | grep GRUB
if GRUB is not installed, (--- unfinished)

Disable ELREPO

sed 's/enabled=.*/enabled=0/' -i /etc/yum.repos.d/elrepo_triumf.repo
sed 's/enabled=.*/enabled=0/' -i /etc/yum.repos.d/elrepo.repo

Special hardware settings

ASUS Crosshair mobo

use BIOS version 1207 or newer
(before CentOS7) sensors need these drivers from ELREPO: yum install --noplugins kmod-it87 kmod-k10temp; sensors-detect; service lm_sensors restart; sensors
CentOS7: installs correct drivers automatically

ASUS Crosshair-II mobo

use BIOS version 2607 or newer
for the onboard IDE to work, add "all-generic-ide" to kernel boot options in grub.conf
sensors need these drivers from ELREPO: yum install --noplugins kmod-it87 kmod-k10temp; sensors-detect; service lm_sensors restart; sensors

ASUS P7P55D EVO mobo

use BIOS version 2004 or newer
SL6 - install special driver for on board PCIe GigE network port and disable on board PCI GigE network port:
- yum --enablerepo elrepo install kmod-r8168 kmod-r8169
- # do not do this: sed 's/^blacklist/#blacklist/' -i /etc/modprobe.d/blacklist-r8169.conf
- reboot
- verify that correct drivers are loaded: ethtool -i eth0; ethtool -i eth1
- note: there will be no eth1 - r8169 driver is disabled.

ASUS P6X58-E-WS mobo

BIOS settings
- F1 or DEL to enter BIOS setup, F8 boot menu
- go to POWER->HW mon, confirm CPU temperature is around 30C. (heatsink is installed correctly. Bad heatsink temperature quickly goes up to 50-70C).
- Main menu: Storage config - SATA change IDE->AHCI
- System information: confirm BIOS version 301, CPU type, memory size
- AI Tweak: set DRAM frequency - AUTO->DDR3-1333
- Advanced->Onboard devices: LAN BOOT: enabled
- Power->HW monitor: CPU Q-FAN: enabled
- Boot->Settings: Quick boot: enabled; Full screen logo: disabled; Wait for F1: disabled
- Save and exit

ASUS E35M1-M PRO mobo

http://www.asus.com/Motherboards/E35M1M_PRO/#specifications
use BIOS version 1002 or newer
for CPU temperature: install kmod-k10temp from ELREPO (kmod-k10temp-0.0-4.el6.elrepo.x86_64.rpm)
for Sensors: yum --enablerepo elrepo install kmod-w83627ehf; modprobe w83627ehf; sensors
for Graphics: yum --enablerepo elrepo install kmod-fglrx fglrx-x11-drv
to enable booting from USB3, edit /etc/dracut.conf, change line "add_drivers" to read: add_drivers+="xhci-hcd"
to use multiple monitors, run "aticonfig --initial --heads=2 --adapter=1 --xinerama=on", to change screen layout, edit /etc/X11/xorg.conf. Only dual monitors DVI+HDMI seem to work. Tripple monitors does not seem to work.

Sensors instructions below are obolete (use driver from ELREPO)

for Sensors, install driver for NCT6776F chip from https://github.com/groeck/w83627ehf/archives/master (in the Makefile, change the line "KERNEL_BUILD=" to read: "KERNEL_BUILD:=/usr/src/kernels/$(TARGET)"):

cd ~root
wget http://ladd00.triumf.ca/~olchansk/linux/groeck-w83627ehf-dd3e543/w83627ehf.ko
echo "modprobe hwmon; modprobe hwmon-vid; modprobe k10temp; rmmod w83627ehf; insmod /root/w83627ehf.ko" >> /etc/rc.local

ASUS E45M1-M PRO mobo

https://www.asus.com/Motherboards/E45M1M_PRO/#specifications
use BIOS 1202 or newer
follow the E35M1-M PRO instructions above

ASUS P9X79 WS

http://www.asus.com/Motherboard/P9X79_WS/
use BIOS version 3101, 3401, 4701 or newer. If BIOS is 1305 or older, install P9X79-WS-CAP-Converter.ROM (BIOS 2902/3101), then the new BIOS.
(not needed for CentOS7) for CPU temperature, install coretemp
(not needed for CentOS7) for sensors, install driver for NCT6776F chip same as E35M1-M above.
BIOS Settings:
- enter "Advanced mode"
- Ai Tweaker -> Ai Overclock Tuner -> Set to "XMP" - this enables DDR3-1600 RAM speed vs DDR3-1333 by default
- Monitor -> CPU fan speed low limit -> Set to "200 RPM" - we are using high efficiency slow turning CPU coolers and the default 600 RPM is right on the edge of firing false warnings
- Boot -> Full screen logo -> Set to "disabled"
- Wait for F1 -> Set to "disabled"

ASUS P8B-M

use BIOS version 6103 or newer
for CPU temperature, install coretemp
for sensors, install driver for NCT6776F chip same as E35M1-M above.

SUPERMICRO X9SCL

yum install kmod-w83627ehf.x86_64 coretemp
xemacs -nw /etc/rc.local, add:

modprobe coretemp
modprobe w83627ehf

ASUS Z87-WS

cd ~root
wget http://ladd00.triumf.ca/~olchansk/linux/nct6775/nct6775.ko

Place the modprobe and insmod lines in /etc/rc.local to load the drivers at boot time

modprobe hwmon-vid
insmod /root/nct6775.ko

ASUS AM1M-A

use BIOS 602 or later
SL6.5 installer cannot use USB2 ports and the network. Use USB3 ports (blue colour) to boot USB installer (memtest, rescue, etc)
SL6.5 kernels require boot option "iommu=soft" or USB2 and network do not work. (USB3 - blue ports - seems okey)
install ATI/AMD video drivers from ELREPO (see below)
sensors chip is ITE IT8623E, for SL6, use standalone driver from lm_sensors. (2 fans rpm, 2 temperatures):

cd ~root
wget http://ladd00.triumf.ca/~olchansk/linux/it87.ko
echo modprobe hwmon_vid >> /etc/rc.local
echo insmod /root/it87.ko >> /etc/rc.local
. /etc/rc.local

for el7 use it87.ko driver:

cd ~root
wget https://daqshare.triumf.ca/~olchansk/linux/CentOS7/it87.ko
echo modprobe hwmon_vid >> /etc/rc.local
echo insmod /root/it87.ko >> /etc/rc.local
. /etc/rc.local

sensors output:

[root@midemma02 ~]# sensors
radeon-pci-0008
Adapter: PCI adapter
temp1:        +22.0°C  (crit = +120.0°C, hyst = +90.0°C)

fam15h_power-pci-00c4
Adapter: PCI adapter
power1:           N/A  (crit =  25.00 W)

k10temp-pci-00c3
Adapter: PCI adapter
temp1:        +22.2°C  (high = +70.0°C)
                       (crit = +70.0°C, hyst = +69.0°C)

it8603-isa-0290
Adapter: ISA adapter
in0:          +0.96 V  (min =  +2.50 V, max =  +2.95 V)  ALARM
in1:          +2.23 V  (min =  +0.94 V, max =  +1.22 V)  ALARM
in2:          +2.03 V  (min =  +0.74 V, max =  +0.77 V)  ALARM
in3:          +2.00 V  (min =  +1.26 V, max =  +0.13 V)  ALARM
in4:          +2.23 V  (min =  +2.95 V, max =  +2.15 V)  ALARM
3VSB:         +3.36 V  (min =  +6.00 V, max =  +2.50 V)  ALARM
Vbat:         +3.22 V  
+3.3V:        +3.36 V  
fan1:         611 RPM  (min =  200 RPM)
fan2:         707 RPM  (min =  600 RPM)  ALARM
temp1:        +38.0°C  (low  = +122.0°C, high = +122.0°C)  sensor = thermistor
temp2:        +22.0°C  (low  = +119.0°C, high = -35.0°C)  ALARM  sensor = thermistor
temp3:       -128.0°C  (low  = +16.0°C, high = +93.0°C)  sensor = thermistor
intrusion0:  ALARM

[root@midemma02 ~]#

AMD "Athlon(tm) 5350 APU" graphics supports 2 monitors maximum (mobo has 3 video outputs, only 2 can be used together)

Intel SE7230NH1

front panel header connector pinout is like this:

PWR LED | 1  2|
        | 3  4|
PWR LED | 5  6|
HDD LED | 7  8|
HDD LED | 9 10|
PWR SW  |11 12| NIC1 LED
PWR SW  |13 14| NIC1 LED
RST SW  |15 16|
RST SW  |17 18|
        |19 20|
NMI SW  |21 22| NIC2 LED
NMI SW  |23 24| NIC2 LED
...     |...  |
        |33 34|

ASUS H110M-A/M.2

use BIOS 2003 or later
sensors chip is ??? for el7, use this driver:

cd ~root
wget https://daqshare.triumf.ca/~olchansk/linux/CentOS7/nct6775.ko
echo modprobe hwmon_vid >> /etc/rc.local
echo modprobe coretemp >> /etc/rc.local
echo insmod /root/nct6775.ko >> /etc/rc.local
. /etc/rc.local

sensors output:

[root@daq03 ~]# sensors
acpitz-virtual-0
Adapter: Virtual device
temp1:        +27.8°C  (crit = +119.0°C)
temp2:        +29.8°C  (crit = +119.0°C)

nct6793-isa-0290
Adapter: ISA adapter
in0:                       +0.34 V  (min =  +0.00 V, max =  +1.74 V)
in1:                       +1.02 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in2:                       +3.39 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in3:                       +3.39 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in4:                       +1.02 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in5:                       +0.15 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in6:                       +0.97 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in7:                       +3.38 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in8:                       +3.12 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in9:                       +1.00 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in10:                      +0.14 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in11:                      +0.12 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in12:                      +0.14 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in13:                      +0.12 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in14:                      +0.13 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
fan1:                     1041 RPM  (min =    0 RPM)
fan2:                     1020 RPM  (min =    0 RPM)
fan5:                        0 RPM  (min =    0 RPM)
fan6:                        0 RPM
SYSTIN:                   +119.0°C  (high = +98.0°C, hyst = +95.0°C)  sensor = thermistor
CPUTIN:                    +26.5°C  (high = +80.0°C, hyst = +75.0°C)  sensor = thermistor
AUXTIN0:                   +27.5°C    sensor = thermistor
AUXTIN1:                  +112.0°C    sensor = thermistor
AUXTIN2:                  +111.0°C    sensor = thermistor
AUXTIN3:                  +111.0°C    sensor = thermistor
PECI Agent 0:              +28.0°C  (high = +98.0°C, hyst = +95.0°C)
                                    (crit = +100.0°C)
PECI Agent 0 Calibration:  +25.5°C  
PCH_CHIP_CPU_MAX_TEMP:      +0.0°C  
PCH_CHIP_TEMP:              +0.0°C  
intrusion0:               ALARM
intrusion1:               ALARM
beep_enable:              disabled

coretemp-isa-0000
Adapter: ISA adapter
Physical id 0:  +31.0°C  (high = +80.0°C, crit = +100.0°C)
Core 0:         +31.0°C  (high = +80.0°C, crit = +100.0°C)
Core 1:         +28.0°C  (high = +80.0°C, crit = +100.0°C)

[root@daq03 ~]#

Supermicro X11SSH-F

blacklist the mei and mei_me drivers per http://www.supermicro.com/support/faqs/faq.cfm?faq=14537

[root@alpha00 ~]# more /etc/modprobe.d/blacklist.conf
blacklist mei
blacklist mei_me
[root@alpha00 ~]#

mobo requires M.2 PCIe SSD (M.2 SATA SSD would not work. SATA SATA SSD ok)
boot from M.2 PCIe SSD requires UEFI boot (from an MSDOS partition on the SSD)

Configure X11 graphics

Special settings for DAQ

add the following at the end of /etc/X11/xorg.conf. The enables Ctrl-Alt-KP-/ and Ctrl-Alt-KP-* to unlock the keyboard after Altera Quartus crash:

Section "ServerFlags"
        Option "AllowDeactivateGrabs" "true"
        Option "AllowClosedownGrabs" "true"
EndSection

Install NVIDIA drivers

yum --enablerepo=elrepo install nvidia-detect
run: nvidia-detect
as instructed by nvidia-detect, install correct driver:
- yum --enablerepo=elrepo install kmod-nvidia
- yum --enablerepo=elrepo install kmod-nvidia-304xx
- yum --enablerepo=elrepo install kmod-nvidia-173xx
(before SL6.x: if it fails due to conflict with module-init-tools, run "yum --disablerepo \* --enablerepo elrepo update module-init-tools")
yum erase xorg-x11-glamor ### see http://elrepo.org/tiki/kmod-nvidia (search for glamor)
mv /etc/X11/xorg.conf /etc/X11/xorg.conf-xxx
nvidia-xconfig
(SL6) reboot
(SL5) /dev/MAKEDEV nvidia
(SL5) restart the X11 server (Ctrl-Alt-Backspace or "killall Xorg gdm-binary")
observe that X11 server restarts using the NVIDIA driver (big NVIDIA logo on startup)
if needed, login as root and run "nvidia-settings" to setup dual-screen configuration, etc

Install legacy NVIDIA drivers

For old NVIDIA cards:

GeForce FX 5500

wget http://us.download.nvidia.com/XFree86/Linux-x86/173.14.31/NVIDIA-Linux-x86-173.14.31-pkg1.run
sh ./NVIDIA-Linux-x86-173.14.31-pkg1.run

GeForce 6200 - NVIDIA Corporation NV44A [GeForce 6200]

yum install nvidia-x11-drv-304xx-304.121 --enablerepo=elrepo
nvidia-xconfig
rmmod nvidia
killall gdm-binary
login as root
nvidia-settings to setup multiple displays

Install ATI/AMD drivers

yum --enablerepo elrepo install kmod-fglrx fglrx-x11-drv
check that /etc/X11/xorg.conf section "Device" entry "Driver" says "fglrx"
run "aticonfig --initial" to create xorg.conf if existing one is not good
run "amdcccle" as root to configure dual-screens, etc

 Note: 'amdcccle' is a GUI, so you must run this command from within a running X session

killall Xorg

Install ATI/AMD drivers (CentOS7)

wget http://elrepo.org/linux/testing/el7/x86_64/RPMS/fglrx-x11-drv-15.12-3.el7.elrepo.x86_64.rpm
wget http://elrepo.org/linux/testing/el7/x86_64/RPMS/kmod-fglrx-15.12-3.el7.elrepo.x86_64.rpm
yum install acpid
rpm -vh --install kmod-fglrx-15.12-3.el7.elrepo.x86_64.rpm fglrx-x11-drv-15.12-3.el7.elrepo.x86_64.rpm
amdconfig -f --initial
grub2-mkconfig -o /boot/grub2/grub.cfg
reboot
login as root
amdcccle

NOTE: if both drivers - radeon and fglrx are loaded, boot will hang. the radeon driver is supposed to be blacklisted through grub rdblacklist=radeon entry which is installed by running grub2-mkconfig.

Install Intel drivers for HD4600/Z87

SL6.5 has the required drivers for the socket 1150 machines with Intel HD4600 graphics and Z87 chipset.

ASUS Z87 WS motherboard has these video connections with corresponding Intel video port assignements, as reported by "xrandr":

DisplayPort - DP1/HDMI1
MiniDisplayPort - DP2/HDMI2
HDMI - HDMI3

Due to hardware limitations, 3 HDMI monitors using 2 passive DP-HDMI adapters (and 1 straight HDMI) cannot be used.

To use 3 monitors do this:

1st monitor: DisplayPort - DP-to-HDMI-passive-adapter - HDMI monitor (not tried: DP-to-DP-cable - DisplayPort monitor).
2nd monitor: MiniDisplayPort - MiniDP-to-DP-cable - DisplayPort monitor
3rd monitor: HDMI - HDMI-cable - HDMI monitor

With the monitors I have (Dell 1920x1200 VGA-HDMI-DP), the software thinks that there are 4 monitors: somehow both DP2 and HDMI2 see 1 minitor each, but the hardware cannot drive 4 monitors, so everything goes blank. To fix, disable HDMI2 (xrandr -display :0 --output HDMI2 --off) and enable DP2 (xrandr -display :0 --output DP2 --auto).

How to make this configuration permanent and how to assign monitor locations (left-right, etc), you figure it out.

Manual selection of monitor, video mode and resolution

Automatic selection of monitor and video mode usually works. When it does not, configure it manualls:

physically go to the computer
login as root
run "nvidia-settings" on machines using the NVIDIA driver
run "aticonfig" on machines with the ATI/AMD driver (use "aticonfig --initial" for initial setup, and good luck with anything more complicated)
run "system-config-display".
- In the "hardware" tab, select monitor type: "generic LCD 1280x1024" or "generic LCD 1600x1200".
- In the "settings" tab, select "1280x1024" or "1600x1200" and "Thousands of colors".
- Press "ok", the display settings application should close.
Logout, the new login window should use the new settings.

Disable screen saver

If machine is booted without any monitor connected, current video cards to not enable any video outputs. If a monitor is connected later, there is no video image and there is no easy way to get a video image.

This can be solved by configuring X11 to always enable some video output. Because the monitor type is not known when X11 starts, one has to select some standard video mode (i.e. VESA 1280x1024) on some video output (VGA, DVI or HDMI).

Only NVIDIA cards with the NVIDIA driver (from EPEL) is supported by these instructions.

create default xorg.conf: nvidia-xconfig
edit /etc/X11/xorg.conf
add monitor section for the fake monitor:

Section "Monitor"
    Identifier     "Monitor0"
    VendorName     "Unknown"
    ModelName      "Unknown"
    HorizSync       31.0 - 83.0
    VertRefresh     59.0 - 61.0
    Option         "DPMS" "off"
    ModeLine "1280x1024"   108.00   1280 1328 1440 1688   1024 1025 1028 1066 +hsync +vsync
EndSection

add output selection in the "Device" section:

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "GeForce 210"
    #Option "ConnectedMonitor" "DFP"
    #Option "ConnectedMonitor" "CRT"
    Option "ConnectedMonitor" "CRT-1"
    Option "UseEDID" "no"
EndSection

add fake video mode to the "Screen" section:

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    DefaultDepth    24
    SubSection     "Display"
        Depth       24
        Modes       "1280x1024"
    EndSubSection
EndSection

disable screen saver and DPMS power off in the "ServerLayout" or "ServerFlags" section:

Section "ServerLayout"
    Identifier     "Layout0"
    Screen      0  "Screen0" 0 0
    InputDevice    "Keyboard0" "CoreKeyboard"
    InputDevice    "Mouse0" "CorePointer"
    Option         "Xinerama" "0"
    Option         "BlankTime" "0"
    Option         "StandbyTime" "0"
    Option         "SuspendTime" "0"
    Option         "OffTime" "0"
EndSection

Section "ServerFlags" 
    Option         "BlankTime" "0" 
    Option         "StandbyTime" "0" 
    Option         "SuspendTime" "0" 
    Option         "OffTime" "0" 
EndSection

Finish installation

logout and reboot the computer to have all the changes to take effect

Configure HTTPS server (CentOS7)

This will configure the HTTPS/SSL certificate using "certbot" and "letsencrypt" and configure an HTTPS web server using apache httpd.

First, configure apache httpd:

yum install mod_ssl certwatch crypto-utils
cd /etc/httpd/conf.d/
mv ssl.conf ssl.conf-not-used ### remove the stock ssl.conf which refers to the localhost certificate that will expire in 1 year
touch ssl.conf ### create a blank file to prevent automatic updates from installing a stock ssl.conf file
rm /etc/pki/tls/certs/localhost.crt
create new file ssl-daq12.conf # use actual hostname instead of daq12

Listen 443 https
#SSLPassPhraseDialog exec:/usr/libexec/httpd-ssl-pass-dialog
SSLSessionCache         shmcb:/run/httpd/sslcache(512000)
SSLSessionCacheTimeout  300
SSLRandomSeed startup file:/dev/urandom  256
SSLRandomSeed connect builtin
SSLCryptoDevice builtin

<VirtualHost *:443>
ServerName daq12.triumf.ca
DocumentRoot /var/www/html
ErrorLog /var/log/httpd/daq12.log
SSLEngine on
# note SSLProtocol, SSLCipherSuite and some other settings are overwritten by /etc/letsencrypt/options-ssl-apache.conf
SSLProtocol all -SSLv2 -SSLv3
SSLCipherSuite HIGH:MEDIUM:!aNULL:!MD5:!SEED:!IDEA:!RC4
SSLCertificateFile /etc/pki/tls/certs/localhost.crt
SSLCertificateKeyFile /etc/pki/tls/private/localhost.key
#SSLCertificateChainFile /etc/pki/tls/certs/server-chain.crt
#ProxyPass /elog/ http://localhost:8082/ retry=1
#ProxyPass /      http://localhost:8080/ retry=1
Header always set Strict-Transport-Security "max-age=31536000; includeSubDomains"
<Location />
SSLRequireSSL
AuthType Basic
AuthName "DAQ password protected site"
Require valid-user
# create password file: touch /etc/httpd/htpasswd
# to add new user or change password: htpasswd /etc/httpd/htpasswd username
AuthUserFile /etc/httpd/htpasswd
</Location>
</VirtualHost>

stop httpd from listening on port 80: edit /etc/httpd/conf/httpd.conf, comment-out the line "Listen 80"
systemctl enable httpd
systemctl restart httpd
systemctl status httpd
try to access https://daq12.triumf.ca
- you should see a complaint about self-signed certificate
- you should see a request for password (do not login yet)
- if you get "connection refused", HTTPS port 443 may need to be enabled in the local firewall, then try again:

firewall-cmd --add-port=443/tcp --permanent
firewall-cmd --reload
firewall-cmd --list-all

Second, configure certbot:

(Note: as of 2018-01-18 certbot requires use of http port 80 to get the initial https certificate, renewal can continue to use the https port 443)

(Note: as of 2019-01-?? certbot requires use of port 80 for renewals)

check that port 80 is not used by anything:
netstat -an | grep LISTEN | grep ^tcp | grep 80
lsof -P | grep -i tcp | grep LISTEN | grep 80
if lsof reports that httpd is listening on port 80, follow the httpd instructions above (remove "listen 80" from httpd.conf

yum install certbot python2-certbot-apache # (from EPEL)
firewall-cmd --add-port=80/tcp --permanent
firewall-cmd --reload
firewall-cmd --list-all
certbot certonly --standalone --installer apache # then answer questions:
"activate HTTPS for daq12.triumf.ca" - say ok
"enter email address" - enter your own email address
"please read terms..." - read the terms and say "agree"
it will take a few moments...
"please choose..." - say "easy" (http access is disabled (a) by firewall, (b) by local configuration
"congratulations..." - say ok.
certbot install --apache --cert-name daq12.triumf.ca # then answer questions:
"choose redirect..." - say "1" (no redirect)
look inside ssl-daq12.conf to see that SSLCertificateFile & co point to certbot certificates in /etc/letsencrypt/live/daq12.triumf.ca/
enable automatic renewal

systemctl enable certbot-renew.timer
systemctl start certbot-renew.timer
systemctl list-timers --all

to check corrent renewal and to update the certbot config file in /etc/letsencrypt/renewal, run this:
certbot renew --standalone --installer apache --force-renewal

NOTE: this certificate will expire in 3 months, automatic renewal should work starting with certbot-0.12.0-4.el7.noarch. Certificate expiration should be automatically detected by "certwatch" and email will be sent to local root user, to be forwarded to an actual person by ~root/.forward.

Third, activate password protection:

as shown in the config file above, create password file and initial user: (replace "midas" with specific username)

touch /etc/httpd/htpasswd
htpasswd /etc/httpd/htpasswd midas

Final test:

access https://daq12.triumf.ca - https status should be "green"
login with password should work
the apache httpd test page should load
check site security using the SSLlabs https tester. (I get grade "A-"): https://www.ssllabs.com/ssltest/

From here:

enable proxy for MIDAS mhttpd - uncomment redirect in the config file above
enable proxy for ELOG - ditto
setsebool -P httpd_can_network_connect 1
systemctl restart httpd

NOTE: if certbot fails with errors about 'module' object has no attribute 'pyopenssl', try this: pip install requests==2.6.0

Configure large RAID6 arrays

connect the disks
check the disks health
- run smart-status.perl
partition the disks
- yum install gdisk
- gdisk /dev/sdX
- delete all partitions: o
- create new partition: n, enter, enter, enter, fd00 (default sizes, partition type fd00)
- write and exit: w
check presence of all partitions:
- /bin/ls -l /dev/sd*1
prepare to use an external bitmap file
- touch /md6bitmap
- edit /etc/fstab, change entry for root filesystem from: "defaults 1 1" to "defaults 0 0"
- edit /boot/grub/grub.conf, change entry "kernel ... ro ..." to "kernel ... rw ..."
create raid array:
- mdadm --create /dev/md6 --level=6 --bitmap=/md6bitmap --raid-devices=10 /dev/sd[b-k]1
- mdadm -Ds >> /etc/mdadm.conf
- cleanup /etc/mdadm.conf
- echo "echo 16384 > /sys/block/md6/md/stripe_cache_size" >> /etc/rc.local
- echo "echo 1 > /sys/block/md6/md/sync_speed_min" >> /etc/rc.local
- source /etc/rc.local
observe raid array rebuild:
- watch -d -n1 "cat /proc/mdstat"

Configure ZFS

Install ZFS

(from here: https://github.com/zfsonlinux/zfs/wiki/RHEL-%26-CentOS)

Follow the instructions for "kABI-tracking kmod" - dkms modules seem to always mess up the system when upgrading to next release of zfs.

#rpm -vh --install http://archive.zfsonlinux.org/epel/zfs-release.el7.noarch.rpm
#yum install http://download.zfsonlinux.org/epel/zfs-release.el7.noarch.rpm
#yum install http://download.zfsonlinux.org/epel/zfs-release.el7_3.noarch.rpm
#yum install http://download.zfsonlinux.org/epel/zfs-release.el7_4.noarch.rpm
#yum install http://download.zfsonlinux.org/epel/zfs-release.el7_5.noarch.rpm
yum install http://download.zfsonlinux.org/epel/zfs-release.el7_6.noarch.rpm
yum-config-manager --disable zfs
yum-config-manager --enable zfs-kmod
yum install zfs
#sed 's/^SELINUX=.*/SELINUX=disabled/' -i /etc/selinux/config
echo USE_DISK_BY_ID=\'yes\' >> /etc/default/zfs
#shutdown -r now # required to load the zfs kernel modules and to disable selinux
modprobe zfs # should work
zpool status # should report no pools available

Note: zfs and selinux and not compatible: with selinux enabled, files on zfs cannot be deleted (files are gone, but "df" does not go down, zfs-0.6.5.7-1.el7.centos.x86_64), see #https://github.com/zfsonlinux/zfs/issues/4845

Lock kernel and zfs packages

yum versionlock kernel
yum versionlock zfs

Misc commands

zpool status
zpool get all
zpool iostat 1
zpool iostat -v 1
zpool history
zpool scrub data14
zpool events
arcstat.py 1
cat /proc/spl/kstat/zfs/arcstats
echo 30000000000 > /sys/module/zfs/parameters/zfs_arc_meta_limit
echo 32000000000 > /sys/module/zfs/parameters/zfs_arc_max

zfs get all
zfs set dedup=verify zssd/nfsroot

zpool create data14 raidz2 /dev/sd[b-h]1
zfs create z8tb/data
zfs destroy z8tb/data
zpool add z10tb cache /dev/disk/by-id/ata-ADATA_SP550_2F4320041688
parted /dev/sdx mklabel GPT
blkid
zpool iostat -v -q 1
watch -d -n 1 "cat /proc/spl/kstat/zfs/arcstats | grep l2"
zfs set primarycache=metadata tank/datab
zfs set secondarycache=metadata tank/datab

zfs userspace -p -H zssd/home1
zfs groupspace ...

Create raid0 (mirror) volume

echo USE_DISK_BY_ID=\'yes\' >> /etc/default/zfs
dracut -vf
zpool create zssd mirror /dev/sdaX /dev/sdbX
zpool set cachefile=none zssd
zpool set failmode=continue zssd
zpool status
zpool events
zpool get all
df /zssd
ls -l /zssd

Use whole disk for zfs mirror (RAID0)

echo USE_DISK_BY_ID=\'yes\' >> /etc/default/zfs
[root@daq13 ~]# parted /dev/sdb
(parted) mklabel GPT
(parted) q                                                                
[root@daq13 ~]# parted /dev/sdc
(parted) mklabel GPT                                                      
(parted) q                                                                
[root@daq13 ~]# blkid                                                     
/dev/sda1: UUID="ab920e4b-40ae-4551-aab8-f3e893d38830" TYPE="xfs" 
/dev/sdb: PTTYPE="gpt" 
/dev/sdc: PTTYPE="gpt" 
[root@daq13 ~]# zpool create z10tb mirror /dev/sdb /dev/sdc
[root@daq13 ~]# zpool status
  pool: z10tb
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        z10tb       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0

errors: No known data errors
[root@daq13 ~]# 
[root@daq13 ~]# zfs create z10tb/emma
[root@daq13 ~]# df -kl
Filesystem      1K-blocks     Used  Available Use% Mounted on
pool           9426697856        0 9426697856   0% /pool
pool/daqstore  9426697856        0 9426697856   0% /pool/daqstore
[root@daq13 ~]#

Enable ZFS at boot

systemctl enable zfs-import-cache
systemctl enable zfs-import-scan
systemctl enable zfs-mount
systemctl enable zfs-import.target
systemctl enable zfs.target

Replace failed disk

pull failed disk out
zpool status # identify failed disk zfs label (it should be labeled FAULTED or OFFLINE
safe to reboot here
install new disk
partition new disk, i.e. "gdisk /dev/sdh", use "o" to create new partition table, use "n" to create new partition, accept all default answers, use "w" to save and exit
safe to reboot here
run tests on new disk (smart, diskscrub), if unhappy go back to "install new disk"
safe to reboot here
identify serial number of new disk, i.e. "smartctl -a /dev/sdh | grep -i serial" yields "Serial Number: WD-WCAVY0893313"
identify linux id of new disk by "ls -l /dev/disk/by-id | grep -i WD-WCAVY0893313" yields "ata-WDC_WD2002FYPS-01U1B0_WD-WCAVY0893313-part1"
zpool replace data11 zfs-label-of-failed-disk ata-WDC_WD2002FYPS-01U1B0_WD-WCAVY0893313-part1
zpool status should look like this:

[root@daq11 ~]# zpool status
  pool: data11
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Apr 29 11:51:03 2016
    24.7G scanned out of 795G at 32.3M/s, 6h46m to go
    3.00G resilvered, 3.11% done
config:

        NAME                                                   STATE     READ WRITE CKSUM
        data11                                                 DEGRADED     0     0     0
          raidz2-0                                             DEGRADED     0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WCAZA3872943-part1     ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WCAZA1973466-part1     ONLINE       0     0     0
            replacing-2                                        DEGRADED     0     0     0
              17494865033746374811                             FAULTED      0     0     0  was /dev/sdi1
              ata-WDC_WD2002FYPS-01U1B0_WD-WCAVY0893313-part1  ONLINE       0     0     0  (resilvering)
            ata-WDC_WD20EARS-00MVWB0_WD-WCAZA1973369-part1     ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0858733-part1     ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0819555-part1     ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0857075-part1     ONLINE       0     0     0
            ata-WDC_WD2002FYPS-01U1B0_WD-WCAVY0347413-part1    ONLINE       0     0     0

errors: No known data errors

wait for raid rebuild ("resilvering") to complete
zpool status should look like this:

[root@daq11 ~]# zpool status
  pool: data11
 state: ONLINE
  scan: resilvered 96.2G in 1h44m with 0 errors on Fri Apr 29 13:35:40 2016
config:

        NAME                                                 STATE     READ WRITE CKSUM
        data11                                               ONLINE       0     0     0
          raidz2-0                                           ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WCAZA3872943-part1   ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WCAZA1973466-part1   ONLINE       0     0     0
            ata-WDC_WD2002FYPS-01U1B0_WD-WCAVY0893313-part1  ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WCAZA1973369-part1   ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0858733-part1   ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0819555-part1   ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0857075-part1   ONLINE       0     0     0
            ata-WDC_WD2002FYPS-01U1B0_WD-WCAVY0347413-part1  ONLINE       0     0     0

errors: No known data errors

Rename zfs pool

zpool export oldname
zpool import oldname z6tb

Misc

ZFS tunable parameters for hopefully speeding up resilvering:

https://www.reddit.com/r/zfs/comments/4192js/resilvering_raidz_why_so_incredibly_slow/
echo 0 > /sys/module/zfs/parameters/zfs_resilver_delay
echo 512 > /sys/module/zfs/parameters/zfs_top_maxinflight
echo 5000 > /sys/module/zfs/parameters/zfs_resilver_min_time_ms

Enable periodic scrub:

cd ~/git/scripts
git pull
cd zfs
make install

Working with ZFS snapshots:

zfs list -t snapshot
cd ~/git; git clone https://github.com/zfsonlinux/zfs-auto-snapshot.git; cd zfs-auto-snapshot; make install

If ZFS becomes 100% full, "rm" will stop working, but space can still be freed by using "echo > bigfile", afterwards "rm" works again.

performance notes

Go here: disk_benchmarks

Configure UEFI boot

Some mobo can boot from NVME (PCIe) SSDs only via UEFI boot. Do this:

partition the NVME SSD using gdisk (must be GPT partition table, must have MSDOS EFI partition size 512MiB)

[root@alpha00 ~]# gdisk -l /dev/nvme0n1
GPT fdisk (gdisk) version 0.8.6 ...
Found valid GPT with protective MBR; using GPT.
Disk /dev/nvme0n1: 500118192 sectors, 238.5 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): 1A82CC87-2757-44ED-980F-C78E3681D9D3
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 500118158
Partitions will be aligned on 2048-sector boundaries
Total free space is 2014 sectors (1007.0 KiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048         1050623   512.0 MiB   EF00  EFI System
   2         1050624       500118158   238.0 GiB   8300  Linux filesystem
[root@alpha00 ~]#

create filesystems

mkfs.msdos /dev/nvme0n1p1
mkfs.xfs /dev/nvme0n1p2

prepare EFI partition

mkdir /mnt/efi
mount /dev/nvme0n1p1 /mnt/efi
mkdir -p /mnt/efi/efi/boot
cp /boot/vmlinuz... vmlinuz # copy the desired linux kernel
cp /boot/initramfs... initramfs.img # copy the matching initramfs file
#from /home/olchansk/sysadm/syslinux/syslinux-6.03 copy
cp .../efi64/efi/syslinux.efi .
cp .../efi64/com32/elflink/ldlinux/ldlinux.e64 .
cp syslinux.efi bootx64.efi

create syslinux config file: syslinux.cfg

default linux
label linux
kernel vmlinuz
append ro root=/dev/nvme0n1p2 nomodeset initrd=initramfs.img

prepare system partition

mkdir /mnt/tmp
mount /dev/nvme0n1p2 /mnt/tmp
rsync -avx / /mnt/tmp
cd /mnt/tmp
#edit etc/fstab
#edit etc/syslinux/selinux # set selinux to permissive mode because rsync did not copy the selinux labels

unmount and reboot
restore selinux labels after first boot

#login as root
cd /
restorecon -R / # can also add "-v" to see progress, but runs much slower
#edit /etc/sysconfig/selinux # enable selinux
#shutdown -r now # reboot with selinux enabled