UKWA Technical Overview

--- tags: ukwa --- # UKWA Technical Overview ## Approach - Dedicate resources: a small cross-organizational, multi-disciplinary services team. - Use industry-standard open source systems where possible. - Use custom components only where needed. - Where possible, share the load through open source projects: - Collaborate with other web archives. - Partner on project funded work where appropriate. - Fund vendors to maintain and improve open tools. ## Technical Roles - _Web Archiving Technical Lead_ has overall responsibility and leads implementation. - _Senior Software Engineer_ leads development of main user-facing components (website and curation tool). - _Senior Web Archiving Engineer_ leads management and maintenance of web archive software and hardware. - _Web Archiving Technical Services Engineer_ assists in hardware and software maintenance, including reporting and data integration. ## High-level Architecture A files-in-the-middle architecture. - Ingress: - Web crawler downloads websites into WARC archives. - Storage: - Preserves the WARC archives & metadata. - Egress: - Provide access to data from the WARC & metadata files, as 'playback' services/data sets/etc. Data standards used to separate parts, allowing each to evolve separately over time. # Ingress Systems ```graphviz digraph hierarchy { rankdir=LR; node [style=filled,fillcolor="#ffefef",fontname=Arial] Web [label="Live Websites"]; W3ACT [label="Curation Tool"]; CrawlJobs [label="Crawl Job\nSpecifications",shape="rectangle"]; CrawlSystem [label="Crawl Engines\n(e.g. Heritrix3+)",style=filled,fillcolor="#ffff88"]; ThirdParty [label="Other Sources",style=filled,fillcolor="#ddffaa"]; WARCs [label="WARCs & Logs\n('AIP')",shape="rectangle",style=filled,fillcolor="#ffff88"]; ArchivalStorage [label="Archival Storage",style=filled,fillcolor="#ddffaa"]; Web -> CrawlSystem; W3ACT -> CrawlJobs; CrawlJobs -> CrawlSystem; CrawlSystem -> WARCs; ThirdParty -> WARCs; WARCs -> ArchivalStorage; {rank = same; W3ACT; CrawlJobs; }; {rank = same; Web;CrawlSystem; WARCs;}; {rank = same; ThirdParty; ArchivalStorage;} } ``` # Egress Systems ```graphviz digraph hierarchy { rankdir=LR; node [style=filled,fillcolor="#ffefef",fontname=Arial] W3ACT [label="Curation Tool"]; ArchivalStorage [label="Archival Storage",style=filled,fillcolor="#ddffaa"]; WARCs [shape="square",style=filled,fillcolor="#ffff88"]; WARCServer [label="WARC Record\nServer"]; ContentIndex [label="Content Index\n(OutbackCDX)\n",style=filled,fillcolor="#ffff88"]; Indexer [label="WARC Extraction\n(webarchive-discovery)",style=filled,fillcolor="#ffff88"]; SearchIndex [label="Full-text Search\nIndex (Apache Solr)",style=filled,fillcolor="#ddffaa"]; PyWB [label="Archive Web Page\nPlayback (PyWB)",style=filled,fillcolor="#ffff88"]; Metadata [shape="rectangle", label="Descriptive\nMetadata"]; ArchivalStorage -> WARCs [arrowhead="none"]; WARCs -> {WARCServer,ContentIndex} -> PyWB -> Website WARCs -> Indexer; Indexer-> SearchIndex; SearchIndex -> Website; W3ACT -> Metadata [arrowhead="none"]; Metadata -> Website; {rank=same; Metadata; WARCs; W3ACT; ArchivalStorage;} } ``` Note that indexing is relatively heavy work, requiring decent CPU power and I/O bandwidth. # Document Harvester - Process crawl logs finding unique PDF URLs from crawls on 'Watched Targets'. - Work out association from PDF to Watched Target. - Extract first estimate of metadata from crawled data. - Push this back to ACT for processing by catalogers. Overall, this is quite difficult to generalize across publishers. i.e. there are per-website development and maintenance costs.