DEAP: Difference between revisions

From DaqWiki
Jump to navigation Jump to search
daqwiki>Olchansk
 
(109 intermediate revisions by 3 users not shown)
Line 1: Line 1:
== Links ==
== Links ==


* https://deap06.triumf.ca/ MIDAS status page
* deap00 links:
* https://deap06.triumf.ca/elog/ ELOG
* https://deapdaqgw.snolab.ca/ - MIDAS status page
* https://deap06.triumf.ca/ganglia/ GANGLIA system monitoring
* https://deapdaqgw.snolab.ca/elog/ - DEAP00 ELOG
* https://deap06.triumf.ca/nodeinfo/config.html computer configuration and status
* https://deapdaqgw.snolab.ca/ganglia/ - GANGLIA system monitoring
* https://deap06.triumf.ca/vme01/ VME crate 1
* https://deapdaqgw.snolab.ca/nodeinfo/config.html - computer configuration and status
* https://deap06.triumf.ca/vme02/ VME crate 2
* http://dug1.snolab.ca/ - data server
* https://deap06.triumf.ca/vme03/ VME crate 3
* https://deapdaqgw.snolab.ca/check/ - deapgw environment monitoring MIDAS status page
* https://deap06.triumf.ca:8443/ UPS
* (obsolete) https://deapdaqgw.snolab.ca/quotareport/quota.html - report deapdaqgw disk quota
* https://deap06.triumf.ca:8444/ Power Distribution Unit
* https://deapdaqgw.snolab.ca/zfsquotareport/zfsquota.html - report deapdaqgw disk use
* https://deap06.triumf.ca:8445/ KVM (deapkvm8)
* https://deapdaqgw.snolab.ca/pingmon/deapgw.html - deapgw network monitor
* https://deapkvm8 (192.168.1.17) (deapkvm8) ATEN IP KVM (only works from deap06 gateway machine) (deapuser, deapuser)
* https://deapdaqgw.snolab.ca/pingmon_deap00/deap00.html - deap00 network monitor
 
* TWiki links:
* https://www.snolab.ca/deap/private/TWiki/bin/view SNOLAB Wiki
* https://www.snolab.ca/deap/private/TWiki/bin/view/Main/DeapDAQ - main documenation page
 
* Direct links to DAQ hardware:
* https://deapdaqgw.snolab.ca/vme01/ VME crate 1 (all 3 VME crates: does not work from google-chrome, firefox is ok, safari is ok. after 10 sec reloads to the midas status page)
* https://deapdaqgw.snolab.ca/vme02/ VME crate 2
* https://deapdaqgw.snolab.ca/vme03/ VME crate 3
* https://deapdaqgw.snolab.ca:8443/ UPS (deapups, guest/guest)
* https://deapdaqgw.snolab.ca:8444/index.html Sentry Power Distribution Unit (deapcdu, deap/deap)
* https://deapdaqgw.snolab.ca:8445/ KVM (deapkvm8) (this proxy does not work)
* https://deapkvm8 (192.168.1.17) (deapkvm8) ATEN IP KVM (only works from firefox on deapdaqgw gateway machine - follow VNC instructions below for offsite access) (deapuser, deapuser)
* https://deapdaqgw.snolab.ca/deapcam01/ (deap, deapCAM3600, select "mobile mode sign-in")
* https://deapdaqgw.snolab.ca/deapcam02/
* https://deapdaqgw.snolab.ca/deapcam03/
 
* deapcam02 ssh tunnel: from local computer start tunnel: "ssh -v deap@deapdaqgw.snolab.ca -L localhost:8080:deapcam02:80" (you can omit "-v), then on a local web browser, open "http://localhost:8080" ("8080" is the number from the ssh command, you can use some other number if port 8080 is already in use).
 
Fingerprints for the deapdaqgw.snolab.ca SSL certificate:
 
* SHA1: D3 06 C6 80 40 0D 0D 9F 83 E5 AD CD EB F2 BE 7A F2 E4 4A 67
* MD5: 7D 9E AD 4F 1B 03 D9 63 10 E5 3F 1E ED 40 A1 EF
 
Old fingerprints:
* SHA-256: 2A ED 88 3E 38 27 25 D0 1E 4D 48 A1 78 FC E0 0B 6E 58 00 FA A6 53 B1 FE 50 3D 91 CE C9 AB 08 19
* SHA-1: 1A FE 53 23 0A 22 F3 35 C8 20 CD 0E 0E F0 3F A7 72 B7 2F 29


== DAQ machines ==
== DAQ machines ==


* deapdaqgw: gateway machine (DHCP for deap00, UPS, CDU, NAT)
moved here: https://www.snolab.ca/deap/private/TWiki/bin/view/Main/InfoNet#DAQ_Hardware
* deap00: main daq machine (storage, home directories, central services, etc)
 
* deap01..05: A3818 daq machines
== Power up sequence ==
* deap06.triumf.ca: temporary network gateway
 
* deap07: spare A3818 daq machine (old deap00) (used for PCIe ADC DAQ)
* power up and turn on all 3 UPSes
* deap08: spare deap00 machine
* network switch, CDU and gateway machine are connected to non-switchable UPS ports, they should power up and boot
* lxdeap01: VME daq machine
* one should be able to ping the gateway machine
* deapvme01..03: VME crate power supplies
* ssh deapgw@deapdaqgw, ping deapups and deapcdu
* deapups: DAQ UPS unit
* open the UPS and CDU web pages through the HTTPS proxy
* deapcdu: DAQ power distribution unit
* from the CDU web page (outlet control), turn on deap00. If deap00 is off (0 power use) but outlet is "on", use the "reboot" action.
* deapkvm8: 8-port IP KVM
* wait for deap00 to boot (ping deap00)
* mscb520: MSCB-ETH bridge
* mhttpd and elog should start automatically
* open the DEAP MIDAS status page
* now one can ssh deap@deapdaqgw then ssh deap00
* on the MIDAS status page, start the slow controls frontends: UPS, CDU, VME crates and NutUps, clear all alarms. Do not start MPOD and SCB yet.
* all frontends should start "green", except for "vme02" (and "mpod") should report "communication problem"
* on the CDU web page, turn on all power outlets (use "global control action" - "on" - "apply")
* wait for VME02 (and MPOD) to boot: their frontend status should turn "green"
* wait for deap01..deap05 to boot (no simple indication, but they should ping from deap00)
* from the MIDAS "VME" slow controls pages, turn on all 3 VME crates
* from the MIDAS "programs" page, start mpod and scb frontends. They status should show "green"
* from the MIDAS "MPOD_HV" page, turn on the MPOD ("main switch ON", if ready to ramp up the voltages, "output ON")
* from the MIDAS "SCB" page, turn on all SCBs ("ON" button)
* wait for lxdeap01 to boot (should be able to ping from deap00)
* from the MIDAS "programs" page, start all daq frontends
* clear all MIDAS alarms
* start a run, wait for a bit, stop the run (to confirm all frontends are happy)
* DAQ is now ready to take data


== deapups connections ==
== deapups connections ==


* C13(f) : Switched  Load 1 - N/C
TBW
* C13(f) : Switched  Load 2 - N/C
* C13(f) : Switched  Load 3 - N/C
* C13(f) : Unswitched Load 4 - Rack Fan Left
* C13(f) : Unswitched Load 4 - Rack Fan Centre
* C13(f) : Unswitched Load 4 - Rack Fan Right
* C13(f) : Unswitched Load 4 -
* C13(f) : Unswitched Load 4 -
* C19(f) : Unswitched Load 4 - CDU


== UPS configuration ==
== UPS configuration ==
Line 44: Line 79:
=== Tripp-lite management software ===
=== Tripp-lite management software ===


ssh deap00 /var/tripplite/poweralert/console/pal_console.sh
<pre>
ssh root@deap00
/opt/nut/bin/upsdrvctl stop
(unplug USB cable from UPS, wait 10 sec, plug it back in)
service pald restart
/var/tripplite/poweralert/console/pal_console.sh
### restore NUT monitoring
service pald stop ### this takes a few minutes
/opt/nut/bin/upsdrvctl start
</pre>


=== USB connections ===
=== USB connections ===
Line 52: Line 96:


=== NUT UPS configuration ===
=== NUT UPS configuration ===
NB - UPS names are tied to the UPS serial numbers via the NUT config file!


* http://www.networkupstools.org/
* http://www.networkupstools.org/
Line 78: Line 124:
* restart drivers: /opt/nut/bin/upsdrvctl start
* restart drivers: /opt/nut/bin/upsdrvctl start
* reload upsd: /opt/nut/sbin/upsd -c reload
* reload upsd: /opt/nut/sbin/upsd -c reload
* see ups status: /opt/nut/bin/upsc ups1


== deapcdu connections ==
== deapcdu connections ==


<pre>
moved here: https://www.snolab.ca/deap/private/TWiki/bin/view/Main/HWconnection#CDU_Power_connections
1 : DEAP00
2 : DEAP01
3 : DEAP02
4 : DEAP03
5 : DEAP04
6 : DEAP05
7 : DEAPVME02
8 : DEAP07
------------
9 : DEAP08
10 : SCB1
11 : SCB2
12 : n/c
13 : n/c
14 : DEAPDAQGW (temp)
15 : DEAPMPOD
16 : DEAPKVM8
</pre>


== deapcdu snmp ==
== deapcdu snmp ==
Line 107: Line 136:
* snmpset -v 2c -M +/home/deap/online/slow/fesnmp -m +Sentry3-MIB -c write deapcdu outletControlAction.1.1.3 i 2 ### turn off outlet 3
* snmpset -v 2c -M +/home/deap/online/slow/fesnmp -m +Sentry3-MIB -c write deapcdu outletControlAction.1.1.3 i 2 ### turn off outlet 3


== Network configuration (TRIUMF) ==
== VME and MPOD snmp ==


DEAP DAQ machines are on the private network (see below).
* snmpwalk -v 2c -M +/home/deap/online/slow/fewiener -m +WIENER-CRATE-MIB -c guru deapvme01 crate
* snmpset -v 2c -M +/home/deap/online/slow/fewiener -m +WIENER-CRATE-MIB -c guru deapvme01 sysMainSwitch.0 i 1 ### turn crate on
* snmpset -v 2c -M +/home/deap/online/slow/fewiener -m +WIENER-CRATE-MIB -c guru deapvme01 sysMainSwitch.0 i 0 ### turn crate off


Gateway to TRIUMF network is 1U machine deap06.triumf.ca connected to the LADD-NIS cluster (deap account on ladd00).


Gateway services running on the gateway:
* DHCP server for the 192.168.1.x network (/etc/hosts, /etc/dhcp/dhcpd.conf)
* apache SSL/https proxy for MIDAS status page, ELOG, ganglia and nodeinfo (/etc/httpd/conf.d/ssl.conf, /etc/httpd/htpasswd)
* NAT proxy from private network to the TRIUMF network (/etc/rc.local). Makes the internet accessible from deapNN machines.


== Network configuration (DEAP) ==
== Network configuration (DEAP) ==
Line 126: Line 152:
=== Network numbers ===
=== Network numbers ===


Network numbers are assigned by deapdaqgw and deap00 DHCP servers:
moved here: https://www.snolab.ca/deap/private/TWiki/bin/view/Main/InfoNet#Network_Configuration_DEAP
 
=== Network cabling ===


<pre>
<pre>
192.168.1.x (netmask 255.255.255.0): main private network
deapdaqgw:
192.168.2.x: deap00a-deap01a connection
enp4s0 - eth0 - uplink (dhcp)
192.168.3.x: deap00b-deap02b connection
enp5s0 - eth1 - daq private network
192.168.4.x: deap00c-deap03c connection
 
192.168.5.x: deap00d-deap04d connection
deap00 mobo, e1000e driver -
eth0 - daq private network (dhcp)
eth1 - deap05e
deap00 pcie nic, igb driver - (top to bottom)
eth2 - deap01a
eth3 - deap02b
eth4 - deap03c
eth5 - deap04d
deap00 10gige nic, sfc driver -
eth6 - n/c
eth7 - dug1 (10gige)
 
deap01..deap05:
eth0 - daq private network
eth1 - secondary network direct link to deap00
 
lxdeap01:
eth0 or eth1 - daq private network (use either port)
</pre>
</pre>


=== DHCP servers ===
=== DHCP configuration ===
 
Main DHCP server is running on the gateway machine. It provides IP addresses to all devices on the main private network. apt install isc-dhcp-server, systemctl enable isc-dhcp-server
 
Additional DHCP server is running on deap00. It provides IP addresses to the secondary network links to deap00 (deap01a..deap04d)


Main DHCP server is deapdaqgw. ... On deap00 there is a dhcp server running
See following sections for more details.


DEAP network nodes with statically configured IP addresses:
DEAP network nodes with statically configured IP addresses:
 
* deapmod : Wiener MPOD firmware does not support DHCP
<pre>
deapmod : Wiener MPOD firmware does not support DHCP
</pre>


=== Gateway machine ===
=== Gateway machine ===
Line 150: Line 196:
deapdaqgw is the gateway machine that provides internet access to the DEAP DAQ cluster.
deapdaqgw is the gateway machine that provides internet access to the DEAP DAQ cluster.


* NAT (/etc/rc.local,  
* NAT ("network address translation", see /etc/rc.local)
* IP address assignement via /etc/hosts
* DNS via dnsmasq serving contents of /etc/hosts and bridge to upstream DNS (configured in /etc/resolv.conf by upstream DHCP). apt install dnsmasq, systemctl enable dnsmasq, see Ubuntu instructions.
* DHCP for all machines via /etc/dhcpd/dhcpd.conf, Special DHCP settings:
** "option routers" sets the "default route" through the gateway machine itself
** "option domain-name-servers" sets the DNS server in /etc/resolv.conf to dnsmasq on the gateway machine
** "option ntp-servers" specifies the time servers, (but not used by any hosts?)
** "option domain-name" is not specified, leaving the "domain" and "search" entries of /etc/resolv.conf blank (actually the entries are not there)
** unknown clients are assigned IP addresses in the range 192.168.x.200 through .250.
** MSCB nodes are assigned "infinite" leases by avoid a bug in MSCB firmware
** remember to "service dhcpd restart" after editing /etc/dhcp/dhcpd.conf
* HTTPS proxy for midas, elog, and other web-connected devices (see links above). Edit /etc/httpd/conf.d/*.conf
* cron job: zfsquotareport
* cron job: pingmon network monitor
* MIDAS experiment "check" (user deapgw)
 
=== deap00 machine ===
 
deap00 is the main machine for the DEAP DAQ cluster.
 
* DHCP for secondary network links to frontend machines, remember to "service dhcpd restart" after editing /etc/dhcp/dhcpd.conf
* NIS master
* NFS export of home disks, data disks (NFS exports list: edit /etc/netgroup, run "make -C /var/yp")
* httpd for ganglia, nodeinfo, etc
* xinetd, tftpd for booting deap01..05, lxdeap01
 
==== network port assignments ====
 
* eth0: main connection to the local network, IP address is assigned by DHCP from the gateway machine
* eth1: connected to deap05
* eth2..eth5: Intel 4-port card, ports are numbered from the top, connected to deap01..deap04 in order.
* eth6..eth7: 10GigE network card
 
==== disks configuration ====
 
* disk list: 2x120GB SSD, 2x3TB + 2x4TB HDD:
<pre>
[root@deap00 ~]# date
Fri Apr 10 18:00:12 EDT 2015
[root@deap00 ~]# ./smart-status.perl
      Disk                  model              serial    temperature  realloc  pending  uncorr  CRC err    RRER
  /dev/sda  WDC WD40EZRX-00SPEB0      WD-WCC4E0555417              41        0        0        0        0        0
  /dev/sdb  WDC WD40EZRX-00SPEB0      WD-WCC4E0602954              39        0        0        0        0        0
  /dev/sdc    ST3000DM001-9YN166            W1F0SG0W              31        0        0        0        6        -
  /dev/sdd    ST3000DM001-9YN166            W1F0THDE              33        0        0        0        0        -
  /dev/sde KINGSTON SV300S37A120G    50026B7744027BCB              33        0        ?        ?        ?        -
  /dev/sdf KINGSTON SV300S37A120G    50026B774909A85D              30        2        ?        ?        ?        -
[root@deap00 ~]#
</pre>
* spare disk: the 1st 4TB HDD is the hot spare. It is partitioned to be compatible with replacement in case of failure of any other disks. Only /dev/sda1 is used (as a bootable partition).
* filesystems:
** "/" is /dev/md4 RAID1 of sde1, sdf1, sdb1(W), sda1(W). Special note "W" means "write mostly" - means avoid reading from these disks - means read from SSD but write to both SSD and HDD (SSD, SSD, 4TB, 4TB)
** "/home" is /dev/md5 RAID5 of sdb1, sdc1, sdd1 (4TB, 3TB, 3TB)
** swap is /dev/md6 RAID5 sdb2, sdc2, sdd2 (4TB, 3TB, 3TB)
** "/data" is /dev/md7 RAID5 sdb3, sdc3, sdd3 (4TB, 3TB, 3TB)
** note how 4TB /dev/sda is partitioned same as /dev/sdb, but only /dev/sda1 is used - the rest of the disk as a "hot spare"
<pre>
[root@deap00 ~]# date
Mon Apr 13 14:49:03 EDT 2015
[root@deap00 ~]# cat /proc/mdstat | grep active | sort
md4 : active raid1 sda1[7](W) sdb1[4](W) sde1[5] sdf1[6]
md5 : active raid5 sdb3[5] sdc1[3] sdd1[0]
md6 : active raid5 sdb2[5] sdc2[3] sdd2[0]
md7 : active raid5 sdb4[5] sdc3[3] sdd3[0]
[root@deap00 ~]# date
Fri Apr 10 18:01:45 EDT 2015
[root@deap00 ~]# df -kl
Filesystem      1K-blocks      Used  Available Use% Mounted on
/dev/md4        115246492  52353428  57033052  48% /
tmpfs            16414940    1894852  14520088  12% /dev/shm
/dev/md5        403039088  240822952  141736292  63% /home
/dev/md7      5236247032 1709964112 3260290168  35% /data
[root@deap00 ~]#
</pre>
 
=== NIS configuration ===


Usernames, passwords and hostnames are distributed using NIS:
Usernames, passwords and hostnames are distributed using NIS:
Line 156: Line 277:
* deap00 is the master server
* deap00 is the master server
* there are no secondary servers
* there are no secondary servers
* hostnames are distributed using NIS (from deap00:/etc/hosts, MUST MATCH deap06:/etc/hosts!)
* to solve chicken-and-egg problem deap00 IP address has to be listed in each machine /etc/hosts (MUST MATCH deap06 and deap00 /etc/hosts!) (SL6.2+ NIS broadcast does not work so deap00 has to be listed in each machine /etc/yp.conf, also NFS filesystems are mounted before NIS is started).
* also NIS has to be listed in front of DNS in the "hosts:" entry of /etc/nsswitch.conf


DNS kludge:
=== Time configuration ===
* normally DNS would be used to distribute IP addresses and hostnames to the DHCP server, to deap00 and to other deap machines. But we do not have a private DNS server and the TRIUMF DNS server has the wrong IP addresses for deap machines (142.90.x.x).
 
* deap06 DHCP is telling all machines to use the TRIUMF DNS server (to resolve internet addresses - google, etc). To avoid confusion between local deap00, etc hostnames and deap00, etc hostnames from TRIUMF, /etc/nsswitch.conf "hosts:" entry has to list "nis" before "dns".
* deapdaqgw time is configured in /etc/ntp.conf (currently triumf, dsurface, ca.pool.ntp.org)
* hopefully the deap00, etc hostnames will be resolved correctly by the SNOlab DNS servers and all this kludging can go away.
* deapdaqgw DHCP configuration file /etc/dhcp/dhcpd.conf line "option ntp-servers" provides time servers to all dhcp clients (currently deapdaqgw, except deap00, deap01..05, lxdeap01: "option ntp-servers 0.0.0.0")
* deap00 and frontend machines:
** /etc/ntp.conf specifies: "server deapdaqgw iburst prefer" options are: iburst=quick sync, prefer=prefer to use this server
** in addition, local dhcp client writes the dhcp ntp-servers "0.0.0.0" into /etc/ntp.conf which seems to be benign but we do now know how turn this off.
 
deapdaqgw is used as the time master instead of deap00 to make the configuration more symmetric: deap00 and all frontend machines have the same time configuration instead of deap00 being different.
 
=== System monitoring tools ===


System monitoring tools:
* ganglia
* ganglia
* triumf_nodeinfo
* triumf_nodeinfo
* konstantin's ganglia packages (monitor_nfs, ganglia sensors, top, etc) - To install/update: yum --disablerepo="*" --enablerepo=konstantin update
* konstantin's ganglia packages (monitor_nfs, ganglia sensors, top, etc) - To install/update: see TRIUMF SL install instructions.
* diskscrub


== Backups ==
=== VNC access to the KVM ===
 
Generic VNC instructions: [[VNC]]


* backups of Linux images:
DEAP specific VNC instruction:
** backups of linux images are done to deap00:/data/root/backups using cron job on deap00:/etc/cron.d/backup.lxdaq.cron and deap00:~root/backup.lxdaq
* backups of home directories: NONE
* backups of data disks: NONE


== Creating boot disks for deap01..deap05 ==
* start local VNC client in "listen mode" as described at [[VNC]] (usually: vncviewer -listen 5500)
* ssh deap@deapdaqgw
* rm ~/.vnc/passwd
* vncserver -geometry 1600x1200 ### watch the output line: "desktop is deapdaqgw:2" <--- remember the value ":2" printed (it will be different each time)
* vncconfig -display localhost:2 -connect send.triumf.ca:5500  <--- localhost:2 uses the ":2" from the vncserver output, "send.triumf.ca:5500" is the location and port of your VNC client.
* inside VNC:
** open a terminal. We are deap@deapdaqgw
** inside terminal start: firefox
** if firefox complains about running elsewhere, find where it is running and kill it
** open https://deapkvm8
** login as administrator or deapuser (password deapuser)
** go to a console port, click "connect". If default configuration is correct, the java program for the KVM console window will open
** it may prompt for permission to run ATEN Java application
** it may prompt for which application to use to open "java web start" files. Select: Downloads/javaws (symlink to /usr/java/jre1.8.0_31/bin/javaws)
** NB: KVM firmware loads a 32-bit library libikvmlib which requires 32-bit java which requires "yum install gtk2.i686"
** the java application with the KVM console window should open
** if java refuses to start the applet due to "strict permissions", add an exception to this: run jcontrol, in the "security" tab, "edit site list...", add "https://deapkvm8"


=== mirrored 16GB USB Flash disks ===
== Backups ==


go here [[Cloning_raid1_boot_disks]]
Backups of all system disks (SSD and USB flash media) are done to the deap00 data disk. This includes deapdaqgw and deap00 SSDs:


=== V7865 single 8GB/16GB USB Flash disks ===
* deap00 cron job /etc/cron.d/backup.lxdaq.cron
* runs script deap00:~root/backup.lxdaq
* writes backups to deap00 data disk: deap00:/data/root/backups:
<pre>
[root@deap00 ~]# ls -l /data/root/backups/
total 56
-rwxr-xr-x  1 root root 4208 Dec  7  2012 clone.perl
dr-xr-xr-x 31 root root 4096 Jul  3 13:28 deap00  <--- backup of deap00 "/" (raid1 mirrored SSDs)
dr-xr-xr-x 28 root root 4096 Jul  3 13:32 deap01  <--- backups turned off
dr-xr-xr-x 28 root root 4096 Jul  3 13:32 deap02  <--- same
dr-xr-xr-x 28 root root 4096 Apr 19 18:44 deap03 <--- same
dr-xr-xr-x 28 root root 4096 Jul  3 13:32 deap04  <--- same
dr-xr-xr-x 28 root root 4096 Jul  3 13:32 deap05  <--- same
dr-xr-xr-x 28 root root 4096 Mar  2 21:32 deap07  <--- same
dr-xr-xr-x 28 root root 4096 Mar  2 21:32 deap08  <--- same
dr-xr-xr-x 29 root root 4096 Jul  3 11:10 deapdaqgw  <--- backup of deapdaqgw single SSD
dr-xr-xr-x 26 root root 4096 Jul  3 13:38 lxdeap01    <--- backup of lxdeap01 USB flash disk
-rwxr-xr-x  1 root root 7672 Dec  7  2012 uuidfix.perl
[root@deap00 ~]#
</pre>
* clone.perl and uuidfix.perl are the scripts for writing these backups back to bootable media (to 16GB USB flash disk or to 30/60GB SSD)


The V7865 VME processors use single USB flash disks. To create the boot disks,
Backups of deap00 home disk, deapdaqgw and of the backups of deap00, deap01 and lxdeap01 system disks are done to TRIUMF ladd00 data disk.
follow instructions for [[#64GB_SSD_boot_disks]], but clone "lxdeap01" instead of "deap01".


=== Single 8/16GB USB and 64GB SSD boot disks ===
* ladd00 cron job /etc/cron.d/backup.os.cron
* runs script /root/backup.os.all
* runs script "/root/backup.os deapdaqgw.snolab.ca" writes to /ladd/data0/backup.os/deapdaqgw.snolab.ca
* runs script /root/backup.deap:
<pre>
cd /ladd/data0/backup.os
rsync -avx --delete-after deapdaqgw.snolab.ca:/data/root/backups/deap00 deap/deap00 >> $lastlog 2>&1
rsync -avx --delete-after deapdaqgw.snolab.ca:/data/root/backups/deap01 deap/deap01 >> $lastlog 2>&1
rsync -avx --delete-after deapdaqgw.snolab.ca:/data/root/backups/lxdeap01 deap/lxdeap01 >> $lastlog 2>&1
rsync -avx --delete-after --max-size=100000000 deapdaqgw.snolab.ca:/home deap/home >> $lastlog 2>&1
</pre>
 
<pre>
[root@ladd00 ~]# ls -l /ladd/data0/backup.os/ /ladd/data0/backup.os/deap
/ladd/data0/backup.os/:
...
drwxr-xr-x  6 root root  4096 Jun 17 15:34 deap
dr-xr-xr-x 29 root root  4096 Jul  3 08:10 deapdaqgw.snolab.ca
-rw-r--r--  1 root root  2414 Jul  8 15:01 deapdaqgw.snolab.ca.last.log
-rw-r--r--  1 root root 113454 Jul  8 15:24 deap.last.log
...
/ladd/data0/backup.os/deap:
total 16
drwxr-xr-x 3 root root 4096 Jun 12 12:53 deap00
drwxr-xr-x 3 root root 4096 Jun 12 12:55 deap01
drwxr-xr-x 3 root root 4096 Jun 17 15:34 home
drwxr-xr-x 3 root root 4096 Jun 12 12:56 lxdeap01
[root@ladd00 ~]#
</pre>


* attach SSD disk to any of the deap01..deap05 machines (SATA+power)
There are no backups of the deap00 data disks.
* login as root to that machine
* "fdisk -l" to identify which /dev/sdX disk it is
* cd /data/root/backups
* ./clone.perl ./deap01 /dev/sdX
* observe script completes sucessfully and prints "Done. You can remove /dev/sdX and try to boot from it."
* disconnect the disk
* connect to new machine, try to boot from it

Latest revision as of 17:17, 14 February 2023

Links

  • deapcam02 ssh tunnel: from local computer start tunnel: "ssh -v deap@deapdaqgw.snolab.ca -L localhost:8080:deapcam02:80" (you can omit "-v), then on a local web browser, open "http://localhost:8080" ("8080" is the number from the ssh command, you can use some other number if port 8080 is already in use).

Fingerprints for the deapdaqgw.snolab.ca SSL certificate:

  • SHA1: D3 06 C6 80 40 0D 0D 9F 83 E5 AD CD EB F2 BE 7A F2 E4 4A 67
  • MD5: 7D 9E AD 4F 1B 03 D9 63 10 E5 3F 1E ED 40 A1 EF

Old fingerprints:

  • SHA-256: 2A ED 88 3E 38 27 25 D0 1E 4D 48 A1 78 FC E0 0B 6E 58 00 FA A6 53 B1 FE 50 3D 91 CE C9 AB 08 19
  • SHA-1: 1A FE 53 23 0A 22 F3 35 C8 20 CD 0E 0E F0 3F A7 72 B7 2F 29

DAQ machines

moved here: https://www.snolab.ca/deap/private/TWiki/bin/view/Main/InfoNet#DAQ_Hardware

Power up sequence

  • power up and turn on all 3 UPSes
  • network switch, CDU and gateway machine are connected to non-switchable UPS ports, they should power up and boot
  • one should be able to ping the gateway machine
  • ssh deapgw@deapdaqgw, ping deapups and deapcdu
  • open the UPS and CDU web pages through the HTTPS proxy
  • from the CDU web page (outlet control), turn on deap00. If deap00 is off (0 power use) but outlet is "on", use the "reboot" action.
  • wait for deap00 to boot (ping deap00)
  • mhttpd and elog should start automatically
  • open the DEAP MIDAS status page
  • now one can ssh deap@deapdaqgw then ssh deap00
  • on the MIDAS status page, start the slow controls frontends: UPS, CDU, VME crates and NutUps, clear all alarms. Do not start MPOD and SCB yet.
  • all frontends should start "green", except for "vme02" (and "mpod") should report "communication problem"
  • on the CDU web page, turn on all power outlets (use "global control action" - "on" - "apply")
  • wait for VME02 (and MPOD) to boot: their frontend status should turn "green"
  • wait for deap01..deap05 to boot (no simple indication, but they should ping from deap00)
  • from the MIDAS "VME" slow controls pages, turn on all 3 VME crates
  • from the MIDAS "programs" page, start mpod and scb frontends. They status should show "green"
  • from the MIDAS "MPOD_HV" page, turn on the MPOD ("main switch ON", if ready to ramp up the voltages, "output ON")
  • from the MIDAS "SCB" page, turn on all SCBs ("ON" button)
  • wait for lxdeap01 to boot (should be able to ping from deap00)
  • from the MIDAS "programs" page, start all daq frontends
  • clear all MIDAS alarms
  • start a run, wait for a bit, stop the run (to confirm all frontends are happy)
  • DAQ is now ready to take data

deapups connections

TBW

UPS configuration

Tripp-lite management software

ssh root@deap00
/opt/nut/bin/upsdrvctl stop
(unplug USB cable from UPS, wait 10 sec, plug it back in)
service pald restart
/var/tripplite/poweralert/console/pal_console.sh
### restore NUT monitoring
service pald stop ### this takes a few minutes
/opt/nut/bin/upsdrvctl start

USB connections

  • lsusb -v | grep -i product
  • lsusb -v | grep -i serial

NUT UPS configuration

NB - UPS names are tied to the UPS serial numbers via the NUT config file!

[ups1]
        driver = usbhid-ups
        port = auto
        desc = "ups1"
        serial = "2231ELCPS720300082"
[ups2]
        driver = usbhid-ups
        port = auto
        desc = "ups2"
        serial = "2211KW0PS733900093"
[ups3]
        driver = usbhid-ups
        port = auto
        desc = "ups3"
        serial = "2231ELCPS720300090"
  • restart drivers: /opt/nut/bin/upsdrvctl start
  • reload upsd: /opt/nut/sbin/upsd -c reload
  • see ups status: /opt/nut/bin/upsc ups1

deapcdu connections

moved here: https://www.snolab.ca/deap/private/TWiki/bin/view/Main/HWconnection#CDU_Power_connections

deapcdu snmp

  • snmpwalk -v 2c -M +/home/deap/online/slow/fesnmp -m +Sentry3-MIB -c public deapcdu sentry3
  • snmpset -v 2c -M +/home/deap/online/slow/fesnmp -m +Sentry3-MIB -c write deapcdu outletControlAction.1.1.1 i 1 ### turn on outlet 1
  • snmpset -v 2c -M +/home/deap/online/slow/fesnmp -m +Sentry3-MIB -c write deapcdu outletControlAction.1.1.3 i 2 ### turn off outlet 3

VME and MPOD snmp

  • snmpwalk -v 2c -M +/home/deap/online/slow/fewiener -m +WIENER-CRATE-MIB -c guru deapvme01 crate
  • snmpset -v 2c -M +/home/deap/online/slow/fewiener -m +WIENER-CRATE-MIB -c guru deapvme01 sysMainSwitch.0 i 1 ### turn crate on
  • snmpset -v 2c -M +/home/deap/online/slow/fewiener -m +WIENER-CRATE-MIB -c guru deapvme01 sysMainSwitch.0 i 0 ### turn crate off


Network configuration (DEAP)

The DEAP DAQ cluster is configured for standalone running with or without an internet connection.

(NB: Some internet functions are required: access to NTP for time synchronization and access to Linux package repositories to install packages, etc)

Network numbers

moved here: https://www.snolab.ca/deap/private/TWiki/bin/view/Main/InfoNet#Network_Configuration_DEAP

Network cabling

deapdaqgw:
enp4s0 - eth0 - uplink (dhcp)
enp5s0 - eth1 - daq private network

deap00 mobo, e1000e driver -
eth0 - daq private network (dhcp)
eth1 - deap05e
deap00 pcie nic, igb driver - (top to bottom)
eth2 - deap01a
eth3 - deap02b
eth4 - deap03c
eth5 - deap04d
deap00 10gige nic, sfc driver -
eth6 - n/c
eth7 - dug1 (10gige)

deap01..deap05:
eth0 - daq private network
eth1 - secondary network direct link to deap00

lxdeap01:
eth0 or eth1 - daq private network (use either port)

DHCP configuration

Main DHCP server is running on the gateway machine. It provides IP addresses to all devices on the main private network. apt install isc-dhcp-server, systemctl enable isc-dhcp-server

Additional DHCP server is running on deap00. It provides IP addresses to the secondary network links to deap00 (deap01a..deap04d)

See following sections for more details.

DEAP network nodes with statically configured IP addresses:

  • deapmod : Wiener MPOD firmware does not support DHCP

Gateway machine

deapdaqgw is the gateway machine that provides internet access to the DEAP DAQ cluster.

  • NAT ("network address translation", see /etc/rc.local)
  • IP address assignement via /etc/hosts
  • DNS via dnsmasq serving contents of /etc/hosts and bridge to upstream DNS (configured in /etc/resolv.conf by upstream DHCP). apt install dnsmasq, systemctl enable dnsmasq, see Ubuntu instructions.
  • DHCP for all machines via /etc/dhcpd/dhcpd.conf, Special DHCP settings:
    • "option routers" sets the "default route" through the gateway machine itself
    • "option domain-name-servers" sets the DNS server in /etc/resolv.conf to dnsmasq on the gateway machine
    • "option ntp-servers" specifies the time servers, (but not used by any hosts?)
    • "option domain-name" is not specified, leaving the "domain" and "search" entries of /etc/resolv.conf blank (actually the entries are not there)
    • unknown clients are assigned IP addresses in the range 192.168.x.200 through .250.
    • MSCB nodes are assigned "infinite" leases by avoid a bug in MSCB firmware
    • remember to "service dhcpd restart" after editing /etc/dhcp/dhcpd.conf
  • HTTPS proxy for midas, elog, and other web-connected devices (see links above). Edit /etc/httpd/conf.d/*.conf
  • cron job: zfsquotareport
  • cron job: pingmon network monitor
  • MIDAS experiment "check" (user deapgw)

deap00 machine

deap00 is the main machine for the DEAP DAQ cluster.

  • DHCP for secondary network links to frontend machines, remember to "service dhcpd restart" after editing /etc/dhcp/dhcpd.conf
  • NIS master
  • NFS export of home disks, data disks (NFS exports list: edit /etc/netgroup, run "make -C /var/yp")
  • httpd for ganglia, nodeinfo, etc
  • xinetd, tftpd for booting deap01..05, lxdeap01

network port assignments

  • eth0: main connection to the local network, IP address is assigned by DHCP from the gateway machine
  • eth1: connected to deap05
  • eth2..eth5: Intel 4-port card, ports are numbered from the top, connected to deap01..deap04 in order.
  • eth6..eth7: 10GigE network card

disks configuration

  • disk list: 2x120GB SSD, 2x3TB + 2x4TB HDD:
[root@deap00 ~]# date
Fri Apr 10 18:00:12 EDT 2015
[root@deap00 ~]# ./smart-status.perl 
      Disk                  model               serial     temperature  realloc  pending   uncorr  CRC err     RRER
  /dev/sda   WDC WD40EZRX-00SPEB0      WD-WCC4E0555417              41        0        0        0        0        0
  /dev/sdb   WDC WD40EZRX-00SPEB0      WD-WCC4E0602954              39        0        0        0        0        0
  /dev/sdc     ST3000DM001-9YN166             W1F0SG0W              31        0        0        0        6        -
  /dev/sdd     ST3000DM001-9YN166             W1F0THDE              33        0        0        0        0        -
  /dev/sde KINGSTON SV300S37A120G     50026B7744027BCB              33        0        ?        ?        ?        -
  /dev/sdf KINGSTON SV300S37A120G     50026B774909A85D              30        2        ?        ?        ?        -
[root@deap00 ~]# 
  • spare disk: the 1st 4TB HDD is the hot spare. It is partitioned to be compatible with replacement in case of failure of any other disks. Only /dev/sda1 is used (as a bootable partition).
  • filesystems:
    • "/" is /dev/md4 RAID1 of sde1, sdf1, sdb1(W), sda1(W). Special note "W" means "write mostly" - means avoid reading from these disks - means read from SSD but write to both SSD and HDD (SSD, SSD, 4TB, 4TB)
    • "/home" is /dev/md5 RAID5 of sdb1, sdc1, sdd1 (4TB, 3TB, 3TB)
    • swap is /dev/md6 RAID5 sdb2, sdc2, sdd2 (4TB, 3TB, 3TB)
    • "/data" is /dev/md7 RAID5 sdb3, sdc3, sdd3 (4TB, 3TB, 3TB)
    • note how 4TB /dev/sda is partitioned same as /dev/sdb, but only /dev/sda1 is used - the rest of the disk as a "hot spare"
[root@deap00 ~]# date
Mon Apr 13 14:49:03 EDT 2015
[root@deap00 ~]# cat /proc/mdstat | grep active | sort
md4 : active raid1 sda1[7](W) sdb1[4](W) sde1[5] sdf1[6]
md5 : active raid5 sdb3[5] sdc1[3] sdd1[0]
md6 : active raid5 sdb2[5] sdc2[3] sdd2[0]
md7 : active raid5 sdb4[5] sdc3[3] sdd3[0]
[root@deap00 ~]# date
Fri Apr 10 18:01:45 EDT 2015
[root@deap00 ~]# df -kl
Filesystem      1K-blocks       Used  Available Use% Mounted on
/dev/md4        115246492   52353428   57033052  48% /
tmpfs            16414940    1894852   14520088  12% /dev/shm
/dev/md5        403039088  240822952  141736292  63% /home
/dev/md7       5236247032 1709964112 3260290168  35% /data
[root@deap00 ~]# 

NIS configuration

Usernames, passwords and hostnames are distributed using NIS:

  • domain name: DEAP-NIS
  • deap00 is the master server
  • there are no secondary servers

Time configuration

  • deapdaqgw time is configured in /etc/ntp.conf (currently triumf, dsurface, ca.pool.ntp.org)
  • deapdaqgw DHCP configuration file /etc/dhcp/dhcpd.conf line "option ntp-servers" provides time servers to all dhcp clients (currently deapdaqgw, except deap00, deap01..05, lxdeap01: "option ntp-servers 0.0.0.0")
  • deap00 and frontend machines:
    • /etc/ntp.conf specifies: "server deapdaqgw iburst prefer" options are: iburst=quick sync, prefer=prefer to use this server
    • in addition, local dhcp client writes the dhcp ntp-servers "0.0.0.0" into /etc/ntp.conf which seems to be benign but we do now know how turn this off.

deapdaqgw is used as the time master instead of deap00 to make the configuration more symmetric: deap00 and all frontend machines have the same time configuration instead of deap00 being different.

System monitoring tools

  • ganglia
  • triumf_nodeinfo
  • konstantin's ganglia packages (monitor_nfs, ganglia sensors, top, etc) - To install/update: see TRIUMF SL install instructions.

VNC access to the KVM

Generic VNC instructions: VNC

DEAP specific VNC instruction:

  • start local VNC client in "listen mode" as described at VNC (usually: vncviewer -listen 5500)
  • ssh deap@deapdaqgw
  • rm ~/.vnc/passwd
  • vncserver -geometry 1600x1200 ### watch the output line: "desktop is deapdaqgw:2" <--- remember the value ":2" printed (it will be different each time)
  • vncconfig -display localhost:2 -connect send.triumf.ca:5500 <--- localhost:2 uses the ":2" from the vncserver output, "send.triumf.ca:5500" is the location and port of your VNC client.
  • inside VNC:
    • open a terminal. We are deap@deapdaqgw
    • inside terminal start: firefox
    • if firefox complains about running elsewhere, find where it is running and kill it
    • open https://deapkvm8
    • login as administrator or deapuser (password deapuser)
    • go to a console port, click "connect". If default configuration is correct, the java program for the KVM console window will open
    • it may prompt for permission to run ATEN Java application
    • it may prompt for which application to use to open "java web start" files. Select: Downloads/javaws (symlink to /usr/java/jre1.8.0_31/bin/javaws)
    • NB: KVM firmware loads a 32-bit library libikvmlib which requires 32-bit java which requires "yum install gtk2.i686"
    • the java application with the KVM console window should open
    • if java refuses to start the applet due to "strict permissions", add an exception to this: run jcontrol, in the "security" tab, "edit site list...", add "https://deapkvm8"

Backups

Backups of all system disks (SSD and USB flash media) are done to the deap00 data disk. This includes deapdaqgw and deap00 SSDs:

  • deap00 cron job /etc/cron.d/backup.lxdaq.cron
  • runs script deap00:~root/backup.lxdaq
  • writes backups to deap00 data disk: deap00:/data/root/backups:
[root@deap00 ~]# ls -l /data/root/backups/
total 56
-rwxr-xr-x  1 root root 4208 Dec  7  2012 clone.perl
dr-xr-xr-x 31 root root 4096 Jul  3 13:28 deap00   <--- backup of deap00 "/" (raid1 mirrored SSDs)
dr-xr-xr-x 28 root root 4096 Jul  3 13:32 deap01   <--- backups turned off
dr-xr-xr-x 28 root root 4096 Jul  3 13:32 deap02   <--- same
dr-xr-xr-x 28 root root 4096 Apr 19 18:44 deap03 <--- same
dr-xr-xr-x 28 root root 4096 Jul  3 13:32 deap04   <--- same
dr-xr-xr-x 28 root root 4096 Jul  3 13:32 deap05   <--- same
dr-xr-xr-x 28 root root 4096 Mar  2 21:32 deap07  <--- same
dr-xr-xr-x 28 root root 4096 Mar  2 21:32 deap08  <--- same
dr-xr-xr-x 29 root root 4096 Jul  3 11:10 deapdaqgw  <--- backup of deapdaqgw single SSD
dr-xr-xr-x 26 root root 4096 Jul  3 13:38 lxdeap01    <--- backup of lxdeap01 USB flash disk
-rwxr-xr-x  1 root root 7672 Dec  7  2012 uuidfix.perl
[root@deap00 ~]# 
  • clone.perl and uuidfix.perl are the scripts for writing these backups back to bootable media (to 16GB USB flash disk or to 30/60GB SSD)

Backups of deap00 home disk, deapdaqgw and of the backups of deap00, deap01 and lxdeap01 system disks are done to TRIUMF ladd00 data disk.

  • ladd00 cron job /etc/cron.d/backup.os.cron
  • runs script /root/backup.os.all
  • runs script "/root/backup.os deapdaqgw.snolab.ca" writes to /ladd/data0/backup.os/deapdaqgw.snolab.ca
  • runs script /root/backup.deap:
cd /ladd/data0/backup.os
rsync -avx --delete-after deapdaqgw.snolab.ca:/data/root/backups/deap00 deap/deap00 >> $lastlog 2>&1
rsync -avx --delete-after deapdaqgw.snolab.ca:/data/root/backups/deap01 deap/deap01 >> $lastlog 2>&1
rsync -avx --delete-after deapdaqgw.snolab.ca:/data/root/backups/lxdeap01 deap/lxdeap01 >> $lastlog 2>&1
rsync -avx --delete-after --max-size=100000000 deapdaqgw.snolab.ca:/home deap/home >> $lastlog 2>&1
[root@ladd00 ~]# ls -l /ladd/data0/backup.os/ /ladd/data0/backup.os/deap
/ladd/data0/backup.os/:
...
drwxr-xr-x  6 root root   4096 Jun 17 15:34 deap
dr-xr-xr-x 29 root root   4096 Jul  3 08:10 deapdaqgw.snolab.ca
-rw-r--r--  1 root root   2414 Jul  8 15:01 deapdaqgw.snolab.ca.last.log
-rw-r--r--  1 root root 113454 Jul  8 15:24 deap.last.log
...
/ladd/data0/backup.os/deap:
total 16
drwxr-xr-x 3 root root 4096 Jun 12 12:53 deap00
drwxr-xr-x 3 root root 4096 Jun 12 12:55 deap01
drwxr-xr-x 3 root root 4096 Jun 17 15:34 home
drwxr-xr-x 3 root root 4096 Jun 12 12:56 lxdeap01
[root@ladd00 ~]# 

There are no backups of the deap00 data disks.