# IIPC WAC 2023
* IIPC - International Internet Preservation Consortium
* Holds two annual conferences for web archivers, over three days
* IIPC General Assembly - https://netpreserve.org/ga2023/
* Web Archiving Conference - https://netpreserve.org/event/wac2023/
* Event program: https://netpreserve.org/ga2023/programme/wac/
Attendees: digital archivists, librarians and researches from universities, governments and NGOs, often with very large archives or very large targets to archive
Tiny sampling below
Countries
- Natl Library of Singapore
- US Library of Congress
- Library of Alexandria
- Natl Library of Brazil
- Natl Library of S Korea
- Natl Library of the Netherlands
Unis
- Archivists for Cambridge U's 36 libraries
- Drexel
- Old Dominion U
- UT San Antonio
- NYU
- Harvard
- Stanford
Other orgs
- Internet Archive
- MOMA NY
- Arkiwera (Sweden)
- Kadoc (Belgium)
- New Design Congress
- The Feminist Institute
- WebRecorder
- Shift Collective
## Rough themes
archive sizes
- a few w/ multiple PB
- cluster in the 100s of TBs
- most talking <100GB, but often working multiple archives of that size
Org / program structure / funding
- mostly small teams
- archive/preservation initiatives come and go (at every level)
- lots of people do it as add-on to their regualar job
- funding is hard to get, both internal and external
Tools / Technology
- heratrix
- webrecorder showed up all over
- python scripts, everywhere
- lots of open source tooling
- clear set of tools, but lots of time putting them together - years in most cases
- crawls are long, hard to get right, expensive, have to be repeated even when successful
- no tools for things like "how big is domain x" in any dimension
Services used
- a bit of proprietary crawling (but not much)
- Amazon S3
- AWS Open Data Program (750tb for End of Term program)
Storage
- no specific mentions of storage, aside from big == hard (informal survey here would be interesting)
Content to archive
- social media is on everyone's minds
- no discussion of protected/closed social (eg fb/insta)
- ethical concerns but no clear framework
Access to archives
- is usually people
- sometimes even physical spaces *only*
- no common digital access path except maybe "website"
- nothing at all really about syncing/transfer/apis
- was much awareness of legal risks
Who
- very slow moving, under-resourced and risk averse groups with little or no funding or time
- lots of experimentation happening over long periods of time
- migration to new tools/svcs is slow / delayed / rare
- likes: open source + in-house
Vibe
- the job is so complex/gargantuan/difficult in so many ways that there's a humorous fatalism in the community
Misc
- no talk of crypto, p2p or dist networks of any kind. one person i got as far as explaining filecoin and they looked confused and walked away quickly.
## Opportunities / strategy / next actions
TODO: write up thoughts
## Glossary / mentions / tools / tech
- WARC ("we made warcs"), mentioned constantly
- WACZ (not much tho, too new)
- Heretrix
- WebRecorder (everyone uses something they made, seems like)
- Conifer
- Common Crawl
- "fred" or "a fred"
- Memento ("mementos")
- WAT/WET
- WARC Metadata Sidecar Record
- CDX/CDXJ
- Parquet
- Zipnum
- Python
- pywb
- Zonodo
- Internet Archive
- SWH
- Wayback Machine
- SOLR
- SOLRWayback
- OpenWayback
- ArchiveTeam
## Notes from various talks
- don't have time to do everything
- built new tools
- browsers don't do what we need
- archive size: 26tb
- pdf pdf pdf
- government docs
- crawl, recrawl
- too hard to save/crawl/process
--
- ml classification
- not enough labeled data for advanced models
- many collections are <20k documents
--
- collabs with Archive-it
- migration to Folio (?)
- get warcs local to work with them
- 33 US states work with Archive-it
- hard to switch learning/tooling from one metadata system to another
- moving from TF to PyTorch
--
- designing for repeatability
- notebooks over cmd line scripts
- tool-building-ness of tools is not great
--
- "the aggregating expression of a web page "
- "the depth is hidden from our bibliographic descriptive metadata"
- (watching what slides this audience takes photos of is wild)
- ilya keeps being mentioned
--
- Camunda workflow engine
- Biblioteca Alexandrina 36tb
- in every talk: hardware not adequate
- BnF DataLab
--
- "have you met an IT dept?!" (can't deploy the sw u want, etc)
- compressing 1pb to 40tb
- social media archiving
- cost cost cost
- tools, services, compute
- convincing mgmt
- lack of expertise
- hey look webrecorder again
--
- 42 orgs!!! in the dutch/belgian collab
- only 1 org had experience
- most had experimented
- half wanted in-house everything (!)
--
- Harvard LIL / Perma.cc
- hey webrecorder, wacz and replaywebpage again
- shift work, power, infra to client side
- which is empowering for orgs (BOOM)
- s3 + fly.io
- no money
- conversations started, but then ghosted by big social
- or even banned
- lobbied even from eu
- no results
--
- Screaming Frog proprietary crawl service
- Heretrix
--
- Library of Congress
- 3.7pb
- mirrorweb crawl vendor
Netherlands domain crawl
- teams of <20
- also have jobs outside or archving
- collections selected manually
- Web Curator Tool (WCT
- Heretrix
- have to visit KBLab physical to view in reading room
- legal issues
- no controlling law, sounds like, grey area
- goal: preserve everything, including websites
- opt-out principle
- ongoing joint lobby for legislation
- this would give legal clarity
- goal is whole domain
- est 6m websites
- est 100tb
- last pilot was in 2000
--
- new pilot is Frisian crawl
- they have own domain, .frl
- smaller than .nl
- so better testcase
- also crawling other stuff like Frisian Wikipedia
- about 10k domains
- out of scope
- long term preservation
- new collection
- public access
- website -> seeds -> netarchive -> heratrix -> warcs
- output is warcs and logfiles
- pilot crawl was 180gb
- crawls are 5 days
- learnings:
- integrity: file level checksums
- authenticity: through reproducibility
--
Cambridge
- 36 libraries
- lots of multimedia and mobile usage
- Archive of Tomorrow project - health info
- intranet - used httrack and conifer
- need an open source tool
- 2fa is a problem
- "not sure if already captured this page or not" (CIDs?)
--
Los Alamos
- single domain
- 20 subs
- tons of link rot and content drift
- 679gb for first crawl
-
--
Brazil
- concerns about gov destroying docs, or just not saving
--
EOT
- 500tb