IIPC WAC 2023 - HackMD

# IIPC WAC 2023 * IIPC - International Internet Preservation Consortium * Holds two annual conferences for web archivers, over three days * IIPC General Assembly - https://netpreserve.org/ga2023/ * Web Archiving Conference - https://netpreserve.org/event/wac2023/ * Event program: https://netpreserve.org/ga2023/programme/wac/ Attendees: digital archivists, librarians and researches from universities, governments and NGOs, often with very large archives or very large targets to archive Tiny sampling below Countries - Natl Library of Singapore - US Library of Congress - Library of Alexandria - Natl Library of Brazil - Natl Library of S Korea - Natl Library of the Netherlands Unis - Archivists for Cambridge U's 36 libraries - Drexel - Old Dominion U - UT San Antonio - NYU - Harvard - Stanford Other orgs - Internet Archive - MOMA NY - Arkiwera (Sweden) - Kadoc (Belgium) - New Design Congress - The Feminist Institute - WebRecorder - Shift Collective ## Rough themes archive sizes - a few w/ multiple PB - cluster in the 100s of TBs - most talking <100GB, but often working multiple archives of that size Org / program structure / funding - mostly small teams - archive/preservation initiatives come and go (at every level) - lots of people do it as add-on to their regualar job - funding is hard to get, both internal and external Tools / Technology - heratrix - webrecorder showed up all over - python scripts, everywhere - lots of open source tooling - clear set of tools, but lots of time putting them together - years in most cases - crawls are long, hard to get right, expensive, have to be repeated even when successful - no tools for things like "how big is domain x" in any dimension Services used - a bit of proprietary crawling (but not much) - Amazon S3 - AWS Open Data Program (750tb for End of Term program) Storage - no specific mentions of storage, aside from big == hard (informal survey here would be interesting) Content to archive - social media is on everyone's minds - no discussion of protected/closed social (eg fb/insta) - ethical concerns but no clear framework Access to archives - is usually people - sometimes even physical spaces *only* - no common digital access path except maybe "website" - nothing at all really about syncing/transfer/apis - was much awareness of legal risks Who - very slow moving, under-resourced and risk averse groups with little or no funding or time - lots of experimentation happening over long periods of time - migration to new tools/svcs is slow / delayed / rare - likes: open source + in-house Vibe - the job is so complex/gargantuan/difficult in so many ways that there's a humorous fatalism in the community Misc - no talk of crypto, p2p or dist networks of any kind. one person i got as far as explaining filecoin and they looked confused and walked away quickly. ## Opportunities / strategy / next actions TODO: write up thoughts ## Glossary / mentions / tools / tech - WARC ("we made warcs"), mentioned constantly - WACZ (not much tho, too new) - Heretrix - WebRecorder (everyone uses something they made, seems like) - Conifer - Common Crawl - "fred" or "a fred" - Memento ("mementos") - WAT/WET - WARC Metadata Sidecar Record - CDX/CDXJ - Parquet - Zipnum - Python - pywb - Zonodo - Internet Archive - SWH - Wayback Machine - SOLR - SOLRWayback - OpenWayback - ArchiveTeam ## Notes from various talks - don't have time to do everything - built new tools - browsers don't do what we need - archive size: 26tb - pdf pdf pdf - government docs - crawl, recrawl - too hard to save/crawl/process -- - ml classification - not enough labeled data for advanced models - many collections are <20k documents -- - collabs with Archive-it - migration to Folio (?) - get warcs local to work with them - 33 US states work with Archive-it - hard to switch learning/tooling from one metadata system to another - moving from TF to PyTorch -- - designing for repeatability - notebooks over cmd line scripts - tool-building-ness of tools is not great -- - "the aggregating expression of a web page " - "the depth is hidden from our bibliographic descriptive metadata" - (watching what slides this audience takes photos of is wild) - ilya keeps being mentioned -- - Camunda workflow engine - Biblioteca Alexandrina 36tb - in every talk: hardware not adequate - BnF DataLab -- - "have you met an IT dept?!" (can't deploy the sw u want, etc) - compressing 1pb to 40tb - social media archiving - cost cost cost - tools, services, compute - convincing mgmt - lack of expertise - hey look webrecorder again -- - 42 orgs!!! in the dutch/belgian collab - only 1 org had experience - most had experimented - half wanted in-house everything (!) -- - Harvard LIL / Perma.cc - hey webrecorder, wacz and replaywebpage again - shift work, power, infra to client side - which is empowering for orgs (BOOM) - s3 + fly.io - no money - conversations started, but then ghosted by big social - or even banned - lobbied even from eu - no results -- - Screaming Frog proprietary crawl service - Heretrix -- - Library of Congress - 3.7pb - mirrorweb crawl vendor Netherlands domain crawl - teams of <20 - also have jobs outside or archving - collections selected manually - Web Curator Tool (WCT - Heretrix - have to visit KBLab physical to view in reading room - legal issues - no controlling law, sounds like, grey area - goal: preserve everything, including websites - opt-out principle - ongoing joint lobby for legislation - this would give legal clarity - goal is whole domain - est 6m websites - est 100tb - last pilot was in 2000 -- - new pilot is Frisian crawl - they have own domain, .frl - smaller than .nl - so better testcase - also crawling other stuff like Frisian Wikipedia - about 10k domains - out of scope - long term preservation - new collection - public access - website -> seeds -> netarchive -> heratrix -> warcs - output is warcs and logfiles - pilot crawl was 180gb - crawls are 5 days - learnings: - integrity: file level checksums - authenticity: through reproducibility -- Cambridge - 36 libraries - lots of multimedia and mobile usage - Archive of Tomorrow project - health info - intranet - used httrack and conifer - need an open source tool - 2fa is a problem - "not sure if already captured this page or not" (CIDs?) -- Los Alamos - single domain - 20 subs - tons of link rot and content drift - 679gb for first crawl - -- Brazil - concerns about gov destroying docs, or just not saving -- EOT - 500tb