--- tags: application --- # INFRA-TECH - DARIAH - WP6: Use cases # Project Title -------------- __WebDistill for Euro-foundation Models -- Merging national web archives using distillation of pre-trained monolingual foundation models__ Description of the specific research activities ----------------------------------------------------- Web archives provide the largest and most complete collections of contemporary linguistic data (ex. a single year of DK web contains >10tb of deduplicated text data). Following the general trend of pre-training language models, the Danish partners develop comprehensive language-specific models (ex. DaCy, DanNLP) that can be retrained for specific tasks (ex. NER, semantic annotation, anonymization). Multilingual models trained on web archives offer a *promised land* for a range of research applications, because they allow for zeroshot task transfer, transfer between languages and management of code-mixed text, but current models suffer from multiple issues (ex. limited capacity, skewed pre-training data, sub-optimal vocabularies) and monolingual models therefore tend to be preferred. _WebDistill_ alleviates these multilingual issues by pre-training monolingual so-called foundation\* models (Enevoldsen et al 2021) on web archives and merging them using the MergeDestill framework (Khanuja et al 2021). To illustrate the wide applicability, we apply the merged multilingual models to semantic annotation of web resources and data anonymization of social media. The use case develops a modular pipeline consisting of four components: 1. Web-based National Foundation Models for Denmark (DK) and Portugal (PT) 2. WebDistill in the cloud for DK and PT 3. Application 1: Semantic annotation of web resource (A mapping of the textual web landscape 2006-2015) 4. Application 2: Deep WebDestill Anonymization (Data anonymization of unstructured social media data) \*) A foundation model is a model that is trained on broad data at scale and can be adapted to a wide range of downstream tasks and applications. Collaborators ---------------- * Kristoffer Nielbo (lead), Center for Humanities Computing Aarhus and DeiC Interactive HPC, Aarhus University * Niels Brügger, NetLab and Centre for Internet Studies, Aarhus University Budget ----------------------- ... 6.3 DARIAH: testing of application of anonymization to born-digital (especially social media archives) use cases in T6.1 to DARIAH: testing of anonymization tools and services for Danish code-mixed social media archives ------------------------