Obseerved problem with link extraction on dk-web data

--- title: Obseerved problem with link extraction on dk-web data tags: project date: 2022-06-30 author: per@cas.au.dk, márton --- # Observed problem with link extraction on dk-web data ## Backstory CHCAA is assisting Niels Brügger with analysis and visualizations of interconnectedness between domains in the Web DK data collection. This collection stems from the Danish Web Archive and was originally analysed at The Royal Danish Library (KB) in the project [Probing a nation’s web domain — the historical development of the Danish web](https://kulturarvscluster.kb.dk/projekter/p002) from 2015 to 2018 (P002). Back then, the following people were involved: * Nielc Brügger, AU * Janne Nielsen, AU * Ulrich Have, AU * Ditte Laursen, KB * Asger Askov Bleking, KB * Per Møldrup-Dalum, KB When the original project ended, the data was transferred to AU and is now managed by CHCAA. At CHCAA, the following people are working with the data * Kenneth Enevoldsen * Márton Kardos ## The Problem Márton has discovered dicrepancies in the amount of hypertext links he extracts from the data and the number of links calculated in the original part of the project at KB. When * filter 2006 data on "kulturkanon" * extract all links based on a simple regular expression The number of links between domains in this set exceeds the number of links between the same domains in the P002 linkgraf data. ### Example ...coming... ## Tasks - [ ] Contact Ulrich for access to any existing and available source code - [ ] Look at the original data for the existence of Solr links data - [ ] Make a minimum example of the discrepancy - [ ] contact KB to inquire on this discrepancy