Galaxy CVMFS brain dump

# Galaxy CVMFS brain dump > draft agenda/notes/questions for meeting with the Galaxy Aust team on Thu 23 March 2023, 1pm. > Alex, Steele, Greg --- ## PURPOSE Deploy Galaxy CernVM-FS (stratum-1) onto AARNet infrastructure. ### OBJECTIVES 1. Expose new Galaxy CernVM-FS stratum-1 server to the CLI environment (BYOD). 2. Decommission existing stratum-1 currently on NeCTAR **(TBC)**. 3. Harmonise Galaxy digital infrastructure with new AARNet physical infrastructure. --- ### ISSUES TO DISCUSS #### OPTIONS 1. Implement duplicate Galaxy CernVM-FS stratum-1 server on AARNet infrastructure (above objectives #1 & #3 only) 2. Replace existing stratum-1 server currently on NeCTAR (all objectives above including #2) 3. Implement replacement Galaxy CernVM-FS Stratum 1 server on AWS infrastructure until on-prem resources are available. (objectives #1, #2) 4. Implement duplicate Galaxy CernVM-FS Stratum 1 server on AWS infrastructure until on-prem resources are available. (objectives #1) #### ASSUMPTIONS - Does the new repo replace the existing one on NeCTAR (Option 2) or will both be maintained (Option 1)? - Define Priorities, Responsibilities & Timeframe. #### TECHNICAL REQUIREMENTS - Tech specs (VM resources, storage requirements). Current Galaxy CVMFS storage is 23TB? - Need to get hold of /etc/cvmfs configuration directory from existing Galaxy stratum-1 VM (should be OK - contains no secrets) - (If Option 1) How/When switch over to new stratum-1 for Galaxy? - Whether/How to implement load balancing? Geo-location? - Acceptance criteria. - Deployment timing (Go-Live). - AWS Deployment a. VPC b. EC2 Instance (Type + EBS Volumes(OS, Spool)) c. S3 d. DNS alias from AARNet hostname to S3 endpoint. CVMFS is HTTP only (not uncacheable HTTPS), so an alias should be sufficient without certificates (untested) - On-Prem Deployment (Migration from AWS when resources are available) - Security Groups - Access? (If only accessible in Galaxy env not public to the world, could impact S3 deployment) > Note: > Nectar Resources for S1 + Webserver (From previous discussion with Simon) > > Centos 7 > 16GB RAM > 8VCPU > 30GB Disk > 20 TB Storage (Using 10-12TB, with regular garbage collection to delete stuff downstream that may have been deleted upstream) #### NON-FUNCTIONAL REQUIREMENTS - Downstream facility requirements (Local Proxy and Clients pulling from Galaxy stratum-1). - Must be able to sync updates from Stratum-0 (*hourly, daily, weekly*) - Must be able to ingest content up to *XX SIZE* (need projections for Galaxy stratum-0 growth). - Repo discovery layer (BYOD). #### NOT DOING - Stratum 0 Unpacked Containers - ? ### QUESTIONS List of questions to be addressed as a result of this requirements document: - What are the strategic objectives for AARNet in replacing the existing Galaxy stratum-1 server on NeCTAR? (i.e. what does it give us and/or the user community?) a. Simon's team doesn't have to manage it? Managed service? (Could potentially have been only Simon managing the existing one) b. Stable Stratum 1 repos - To clarify is the expectation that we are hosting and managing the CVMFS S1 or do Galaxy expect to manage the host that we provision and maintain? - Exactly how proddy is the new stratum-1 server? There are numerous sustainability requirements for official prod status (e.g. funding, security, availability, performance, support, RTO, RPO, etc.). - Do we want to: - register the new stratum-1 server with Galaxy? (mandatory if Option 2) - provide redundant infrastructure for high availability or just fail over to existing stratum-1 servers? - enable CVMFS-standard GeoAPI to enable redirection to geographically closer stratum-1 servers? (mandatory if Option 2?) - ~~investigate potential for publishing unpacked Singularity container images from a separate stratum-0 server to reduce storage volumes and network traffic? (Galaxy currently only publishes packed images)~~ - maintain (or replace) the existing PoC BYOD stratum-0 server and publish its repos on the new stratum 1 server? - Can we bootstrap the new stratum-1 server by initially copying the 23TB of datastores directly from the filesystem of the existing stratum-1 on Nectar instead of pulling from the Galaxy stratum-0 (or the NeCTAR stratum-1) via http? - Will Galaxy continue to maintain their Ansible playbooks for CVMFS? - Is the current stratum 1 server mirroring **all** of the Galaxy repos, or a restricted selection? (This should be resolved if we can get the /etc/cvmfs directory from the current Galaxy stratum-1 server) a. Need to clarify which repos we need to mirror. - How are we planning to manage what goes on the S1? Do we let Galaxy make pull requests to the git repo for CVMFS repositories? (Choosing what replicas are chosen) - Is the current stratum-1 server accessible from outside the Galaxy environment? (e.g. from Pawsey or NCI) - What (if any) SLAs are there on the current stratum-1 server? - Check if Galaxy ran into the Spool storage space issue on the inital snapshot seeing their disk was only 30GB - Has the required CVMFS S1 storage changed since (was previously mentioned being 10-12 TB with 20 GB of capacity) - Do you have any metrics for CVMFS usage/object get requests to the current stratum-1 server? For update traffic to the stratum-0 server? - Who is our primary contact for Galaxy Australia? - Can we steal any CloudStor resources when it is shut down? ### COMMENTS - (AI) It will be easier if we do **NOT** regard the initial AWS setup as production (could we call it "UAT"?), and only address the transition to production once it is on AARNet infrastructure. - (AI) We should probably **NOT** look at decommissioning the existing Galaxy server until the new (prod) one is deployed on AARNet infrastructure. It could potentially be messy managing a transition if we have to update the Galaxy registration as well. - (AI) It will be difficult to hide the fact that the initial deployment is on AWS, so it would probably not be advisable to try. The Galaxy people are well aware of the infrastructure work underway in AARNet, so this should be sufficient justification for using external infrastructure as an interim measure. Nobody should really care as long as the service is reliable, supported, and meets user needs.