# Discussion on Fermilab DUNE DAQ Teststand
###### tags: `DAQ` `DUNE` `artdaq` `sysadmin`
## To-dos:
* 1-2 page document covering (expected to have this before 1/24/2020):
* deliverable milestones for two years;
* procurement, systen setup;
* exercise system administration workflows;
* demonstration of running the DAQ software and running various DAQ software tests;
* integration of FELIX board;
* etc.
* upgrading plan: additions, replacements over the next few (5?) years;
* interaction with other US DAQ teststands (NSF supported teststand at Columbia).
* More accurate cost estimation (e.g. actual quotes from vendors).
## Motivation
* Maintian a "full-service" DAQ teststand at the host lab of the experiment throughout the lifetimeof the DUNE experiment;
* Plus a teststand itself will provide testing environment to Fermilab folks for:
* Practice system integration before DUNE FD;
* operating system configurations for DUNE DAQ specific hardware;
* software development and integration tests;
* FELIX and timing board firmware building and deploymentn tests;
* timing system integration tests with Fermilab accelerator signal.
## Near term goal
The near term goal of this teststand is to study the network setup, storage technology, system integration and the DAQ’s core software in the aspects of:
1. Scalability - we would like to run as many processes as in a single FD single-phase module on the teststand;
2. Network IO - we would like to see if the networking setup and the data flow software can handle the data rate across the cluster;
3. Disk IO - we would like to utilize state of art storage technology in the system.
The initial version of the teststand is a multi-node system (4 nodes), with a network setup close to that for the future DUNE FD DAQ.
The capabilities of this initial version are orthogonal to protoDUNE’s DAQ (and protoDUNE II’s), where on protoDUNE we are limited by **outdate hardwares**. The majority of the servers on protoDUNE-SP are **more than 5 years old**, which makes it impossible to evaluate the latest storage technology. The servers are also connected to **lower performant network switches**. Plus the cluster is at CERN where Fermilab does not have full control of the networking setup.
> [Giovanna: we ave to find the good timing for purchasing HW features that are interesting.]
> [Pengfei: I agree. We expect to have a nice switch which can serve for many years to come, but upgrade/add new server nodes into the teststand over the coming years. We will keep track of relevant new technologies industry provides and find the good timing to put those in the upgrading plan.]
In the initial system setup, we do not include any FELIX cards or timing boards. However they are crucial parts of the DAQ and we will need to have them for integration tests in the future.
At the moment, there are ongoing efforts to have a miniDAQ system consisting of only one server but with a FELIX card for studying hit finding trigger algorithms, and for testing APA readouts. We, the Fermilab DAQ team members, will help with the system administration and software setup/configuration for that miniDAQ system involving FELIX cards.
## Proposed Sysetem configuration
### Minimal Setup (version 0)
This version of the teststand is our starting point. It should provide basic testing environment for many things we do, and could/should be extended to a more capable system in the future.
* 1 Network controled Power Distribution Unit (PDU);
* 1 10Gbps network switch with 40/100 Gbps uplinks (minimum of 4 uplinks);
> [GLM: should we have redundant NW topology for testing (maybe using VLANs)? What device do you have in mind?]
> [Pengfei: testing the redundant NW topology, epsecially its integration with the DAQ is a very good idea and we have to do demonstrate that before data taking at DUNE FD. However, network switches are expensive. Having a redundant NW topology all the time maybe an overkill for the teststand. But we can always coordinate with Fermilab Networking, and borrow their equipments for the period of testing. To start, we may just need one network switch, but have multiple NICs on the servers for future NW redundancy test. ]
>
* 1 web camera;
* 2 datalogger nodes:
* 40 GE NIC;
> [GLM: I would consider dual 100G]
> [Pengfei: good to have if budget permits (that would require a 100G-switch which can cost significantly more). 40G might be good for most of DAQ SW testing.]
>
* Raid controller on mother board and a miminum of four HDDs for raid 10 configuration;
* Fast SSDs with size >= 2TB;
> [GLM: how many?]
* 2 Intel Xeon 14-core (or above) CPUs;
> [GLM: should we consider AMDs?]
> [Pengfei: we are open for that. Even if we start with Intels in the beginning, we can put AMDs in the upkeeping/upgrading plan. We need to ask the vendor for a test system for evaluation first.]
* Memory of 64GB (or above, all channels supported by memory controller shall be populated).
> [GLM: I would consider 256GB, at least]
> [Pengfei: we may get that now. But if not, it will certainly be in the upkeeping/upgrading plan. The constraint we have now is not on the total size but on the number of DIMMs. We need to have enough DIMMs to fill all channels the motherboard supports to get the maximum memory speed.]
* enough PCIe slots for future hardwares:
* FPGA or GPU for module level trigger;
* Adding more SSDs or other types of storage;
* 2 boardreader nodes:
* Dual 10GbE LAN and 1GbE (mgmt port for IPMI);
* Memory and CPUs can be the same as datalogger nodes;
* Fast SSDs or Intel Optane Persistent memory with size >= 2TB;
* enough PCIe slots for future hardwares:
* 1 FELIX board;
* 1 timing board;
* 1 FPGA or GPU for trigger.
* 1 Rack (maybe from Fermilab surplus?)
A rough estimation of the total cost is ~$40,000 without any FELIX card or timing board. The cost can be broke down as the following:
* PDU - $900
* Web camera - $50 for USB webcam or $500 IP cam?
* Server nodes - $29,000 (2@$8,500 and 2@$6,000)
* Network switch - $ 10,000
One FELIX card will add an additional ~$10,000 to the cost. Timing boards cost much less.
## Nevis DUNE miniDAQ
Talked with Georgia on Tuesday 12/17/2019.
* Initial version with one server + Felix + GPU/FPGA;
* This system might also serve as an APA test kit in the future;
* Will be used for studying trigger algorithm with APA data from Felix card (not module level trigger or HLT);
* Trigger calculating is done on the same host as Felix here;
* Server has enough PCIe slots for Felix card plus:
* FPGA accelerator card;
* GPU
* Server spec is available (will share around soon);
* Exact spec would be available once Fermilab procurement got a quote from vendor.
* Would like help from Fermilab with setting up OS, software configuration (e.g. setting up EventBuilder to send data to GPU) etc; Georgia has travel money to support this.
==In complementary to each other, Fermilab teststand can start with a multi-node system with a 40/100Gb/s switch, but without Felix cards. We (Fermilab folks) will help Nevis on system configuration related to Felix, and can build up our experience there.==
## Miscellaneous notes
==have upkeeping plans for the coming years==
Some notes during discussion:
10Gb NICs have standard form factor, 40/100 has a different form factor.
10Gb switch with 40/100Gb uplink (four uplinks)... (deep buffer switch)
40Gb NICs for 2 data loggers.
memory: follow MB manufactures guide, all channels are populated.
CPU: 38 for board readers, plus 10 for other processes, 48 physical cores in total. (48 + 48 + 32 or 32 + 32 + 32 + 32) (four silvers, 20 + 20 + 20 + 20) (higher clock speed)
Storage: raid 10 (4 disks minimum, provide half of the total disk size) for software, backed up area; raid 0 for scratch. speed is a priority, size comes later. (raid controller with NVMe Gen5 support might be an issue); raid may not be necessary with multiple 3GB/s NVMes; may not need NVMe with raid.
OS:
centOS7 -- 2023 (might skip centOS8)
centOS9 -- 2024
option: 4U 4 sockets servers