---
tags: ukwa
---
# UKWA Technical Overview
## Approach
- Dedicate resources: a small cross-organizational, multi-disciplinary services team.
- Use industry-standard open source systems where possible.
- Use custom components only where needed.
- Where possible, share the load through open source projects:
- Collaborate with other web archives.
- Partner on project funded work where appropriate.
- Fund vendors to maintain and improve open tools.
## Technical Roles
- _Web Archiving Technical Lead_ has overall responsibility and leads implementation.
- _Senior Software Engineer_ leads development of main user-facing components (website and curation tool).
- _Senior Web Archiving Engineer_ leads management and maintenance of web archive software and hardware.
- _Web Archiving Technical Services Engineer_ assists in hardware and software maintenance, including reporting and data integration.
## High-level Architecture
A files-in-the-middle architecture.
- Ingress:
- Web crawler downloads websites into WARC archives.
- Storage:
- Preserves the WARC archives & metadata.
- Egress:
- Provide access to data from the WARC & metadata files, as 'playback' services/data sets/etc.
Data standards used to separate parts, allowing each to evolve separately over time.
# Ingress Systems
```graphviz
digraph hierarchy {
rankdir=LR;
node [style=filled,fillcolor="#ffefef",fontname=Arial]
Web [label="Live Websites"];
W3ACT [label="Curation Tool"];
CrawlJobs [label="Crawl Job\nSpecifications",shape="rectangle"];
CrawlSystem [label="Crawl Engines\n(e.g. Heritrix3+)",style=filled,fillcolor="#ffff88"];
ThirdParty [label="Other Sources",style=filled,fillcolor="#ddffaa"];
WARCs [label="WARCs & Logs\n('AIP')",shape="rectangle",style=filled,fillcolor="#ffff88"];
ArchivalStorage [label="Archival Storage",style=filled,fillcolor="#ddffaa"];
Web -> CrawlSystem;
W3ACT -> CrawlJobs;
CrawlJobs -> CrawlSystem;
CrawlSystem -> WARCs;
ThirdParty -> WARCs;
WARCs -> ArchivalStorage;
{rank = same; W3ACT; CrawlJobs; };
{rank = same; Web;CrawlSystem; WARCs;};
{rank = same; ThirdParty; ArchivalStorage;}
}
```
# Egress Systems
```graphviz
digraph hierarchy {
rankdir=LR;
node [style=filled,fillcolor="#ffefef",fontname=Arial]
W3ACT [label="Curation Tool"];
ArchivalStorage [label="Archival Storage",style=filled,fillcolor="#ddffaa"];
WARCs [shape="square",style=filled,fillcolor="#ffff88"];
WARCServer [label="WARC Record\nServer"];
ContentIndex [label="Content Index\n(OutbackCDX)\n",style=filled,fillcolor="#ffff88"];
Indexer [label="WARC Extraction\n(webarchive-discovery)",style=filled,fillcolor="#ffff88"];
SearchIndex [label="Full-text Search\nIndex (Apache Solr)",style=filled,fillcolor="#ddffaa"];
PyWB [label="Archive Web Page\nPlayback (PyWB)",style=filled,fillcolor="#ffff88"];
Metadata [shape="rectangle", label="Descriptive\nMetadata"];
ArchivalStorage -> WARCs [arrowhead="none"];
WARCs -> {WARCServer,ContentIndex} -> PyWB -> Website
WARCs -> Indexer;
Indexer-> SearchIndex;
SearchIndex -> Website;
W3ACT -> Metadata [arrowhead="none"];
Metadata -> Website;
{rank=same; Metadata; WARCs; W3ACT; ArchivalStorage;}
}
```
Note that indexing is relatively heavy work, requiring decent CPU power and I/O bandwidth.
# Document Harvester
- Process crawl logs finding unique PDF URLs from crawls on 'Watched Targets'.
- Work out association from PDF to Watched Target.
- Extract first estimate of metadata from crawled data.
- Push this back to ACT for processing by catalogers.
Overall, this is quite difficult to generalize across publishers. i.e. there are per-website development and maintenance costs.