--- tags: SA --- # UoD Power outage 2020-10-10 ## References - [image.sc announcement](https://forum.image.sc/t/ome-resources-down-due-to-uod-outage/43957) - [UoD services spreadsheet](https://docs.google.com/spreadsheets/d/1P6F9s9DS0bp372VQUPxCLDYKdPoREg7_KFdNzEd-C-8/edit#gid=0) ## 2020-10-15 Attending: Seb, Jason, Josh, Simon, J-M, Will, Dom, Petr, June, Frances ### Status - Code references (Josh/Chris) - upgrade checks (known) - `omero import` --> download hangs - CloudFlare, S3, etc. - Where on the priority list? - Shim aka new-registry (Josh/Chris) - Lambdas: GS: Haven't paid a dollar (UoD: dollars a month) - Storage: digital ocean postgres (multi-master) no effort - 120-160 USD / month - 120 GB (1 standby, auto-failover) - Note: flatted IP table - Steps - Update https://github.com/ome/qa-shim - Create database - Spin up via terraform - Change DNS - **Hardware** - Simon: was working with JD. All production VMs are back online esp. high priority. Some networking issues. - Engineer coming this morning to fix remaining GPFS boxes. No redundancy - Should have no downtime - www.openmicroscopy.org - deployed on GitHub pages - several components missing (QA, forums, site redirects, Schemas). Causes PRs to fail validation - Options: switch back to ome-www or keep working with GH pages (need redirect work) - Simon: in-progress work on redirects - J-M: GH pages seems to be the way - Jason: main concern is that spending time now means splitting resources - J-M: if 1-2 days away, worth investigating - Test website and add the schemas by EOB - List all broken - gate.openmicroscopy.org - Working with ports 22 + 443 - downloads.openmicroscopy.org/docs.openmicroscopy.org - Seem to be working. Tested via several builds - J-M: looking at it. Everything is working - Petr: presentations **ok** - artifacts.openmicroscopy.org - currently redirected to GS artifactory - some issues with artifacts - question of whether we switch back to our own artifactory - J-M: working on Maven Central/Sonatype - Focusing on Bio-Formats first - Seb: wait until we need to push artifacts to switch ? - How many days of investigation before deciding to switch to the old workflow? - J-M: need to sort out various issues (shared used, GPG) - David: 1 week necessary to test various things - learning.openmicroscopy.org - Jason to email Paul and ask to test. Minimal login works - Josh: only potential issue would be corruption when PSQL would shut down - No reason to think otherwise? - J-M: did some testing on outreach/workshop - Josh: assuming initial outage only looked like power outage - Simon: a priori yes - outreach.openmicroscopy.org/workshop.openmicroscopy.org - Petr: fully working - demo.openmicroscopy.org/pub-omero.openmicroscopy.org - Petr: fully tested and working (**yes** per spreadsheet) - Seb: looking at pub-omero today - nightshade.openmicroscopy.org - Simon: starting the omero-server should be fine - Josh: assuming no storage bump. Just access problem? - Jason: nightshade mailing list? - Simon: wait for GPFS to be back - Josh: we're still in maintenance window - Petr: to start drafting - idr-redmine.openmicroscopy.org - Frances: confirmed it's working. updating issues - merge-ci.openmicroscopy.org/latest-ci.openmicroscopy.org - Seb: could not ssh into the boxes - Simon to check the DRAC - ci.openmicroscopy.org - running - J-M: test these jobs if we are not switching the release workflow - image.sc - Seb to draft ## 2020-10-14 Attending: Will, Seb, PEtr, Dom, Josh, Simon, Frances, J-M, Jason, David, ### Status - Hardware - Simon: 2 arrays down. One out-of-support, one under warranty. Out-of-support one was fixed with no charge. Dell engineer or UoD IT person need to go to Data Centre (need to ask who) - Getting VMWare back online requires UoD IT technician - Potentially getting VMWare back except nightshade, devspaces - Impact on current services of priority 4-5 - gate back - downloads/docs still unavailable - other web-prod would work - website woudl come back - nightshade requires GPFS fix - demo/learning could come back - ome-lochy (Redmine, monitoring requires GPFS for persistent storage but could potentially bring back redmine without attachments) - Artifactory ("running") - Josh: DNS updated from UoD to GS. - Mirrors scijava. - Note that actions can timeout while mirroring is taking place. - There may be JARs not there. Caching fun - Some changes on GS side at code level to be reviewed (bioformats2raw) - Are we rolling back or switching to GS artifactory long-term plan - Jason: much harder discussion. Problems we are having also affect NPSC, proteomics, MRC... - Seb: artifactory is one of the first to come up - Simon: need resiliency - Seb: have 3 of them effectively (Scijava + OME + GS) containing a portion of the artifacts - Simon: encode them in the builds to start relying on them - JRS: need to make this decision about what is the ideal. - JMB: pushing to maven central will also help (may be easier to push these days). More resiliency. - JMB: can't yet build. i.e. not "fixed" - Investigations - See artifactory discussion above - Website (Simon): - website is in 2 parts. static content straighforward to move to GH pages - get our static content on GH pages? - Long-term could look at CDN - JRS: minimal viable? Seb: static pages and then need things to add. Most critical is **/Schemas** - Simon: basically same as https://snoopycrimecop.github.io/www.openmicroscopy.org/ - JRS: use one as the backup? Simon: could keep a list of DNS changes if there's another outage. - Docs (J-m): looked at various hosting options (e.g. GitHub pages). javadoc should be automatically available via javadoc.io. - David available to discuss Bio-Formats deployments? - Josh: capture the impossibility to build first. J-M: to start with omero-model - Petr: snoopy + www ### Next steps - Simon: redirect WWW to GitHub? Seb to change the DNS. - Simon working on redirects - Seb working on the Schemas - Josh: propose **registry shim** with help from Glencoe - Simon: GDPR ok? Will check. - J-M + David: look into artifact hosting - Potentially include Dom for omero-* artifacts - As soon as VMWare is up, test learning + demo + outreaches - Simon: expecting no need to rerun the playbooks for Ansible managed systems - Python: Josh push omero-web to readthedocs - Seb: omero-py? - Misc - List of External Resources we're focusing on - PyPI - ReadTheDocs - javadocs.io - Maven Central - GitHub Pages - GitHub Releases - Presentations (Petr) - Possible to have resiliency? - Seb: GitHub pages? - Simon: hitting limits? - e.g. https://ome.github.io/presentations