--- tags: application --- # Project Portfolio 2022 # This is a collection of abstracts for CHCAA research projects that are in need of funding. Currently three projects: _The Euclidean Error_, _Danish Foundation Models_, and _Danish Wav2vec 2.0_ ## The Euclidean Error in Temporal Reconstructions of Cultural Heritage -- Event Detection and Description in the Danish Golden Age ## Historically, the Danish Golden Age signifies the large-scale socio-cultural state change from a regional and kin-related group identity to a national group identity. This event abstraction however is formally flawed, because it depicts complex historical processes as point-like events , that is, as a collaction of Euclidean primitives with exact temporal and spatial coordinates but without extension. Most historians would agree with this observation, but argue that the abstraction is useful as a temporal reconstruction of past events, we argue however that representations of events as points (i.e., the Euclidean Error) make us blind/inattentive to the underlying dynamics of history, and, in the case of the Danish Golden Age, prone to biased and erronous inferences. In this project, we propose to reconstruct and assess developmental trajectories of the Danish Golden Age and its impact on Denmark using all\* available textual and pictorial cultural heritage data. Cultural and societal history is characterized by changes between more or less stable states of group organization. Some state changes are well-known and have distinct event signatures (i.e., they constitiue _History_), other state changes are harder to describe (i.e., the have been lost in _history_) due to noisy event signatures, their embedded nature, and classification errors. To reconstruct the dynamics of such complex event hierarchies, we combine new deep learning techniques for multimodel representation learning with recent advances in information theory and fractal geometry. More specifically, we are able to _detect_ when and where an event evolves in the event hierarchy of history, and _describe_ its dynamic signature in terms of its novel, transient and resonant dynamic properties. Beyond development in our understanding of Danish history, specifically the cultural heritage of the Danish Golden Age, the project will develop a FAIR and open database of event representation that will allow researchers and cultural heritage institutions to reproduce and explore Danish history without obtaining data licenses and training compute-intensive representations. Furthermore, the entire codebase is provided under an open source license (OSI approved MIT), and, to ensure wide usage and application of the insights, tools for event detection and description are provided as web applications that allow for interactive exploration without code. For the general public the project will in collaboration with the data providers create part of an physical exhibition at SMK and an online Golden Age experimentarium. Applicants and partners: * Co-PI*: Katrine F. Baunvig & Kristoffer L. Nielbo * AU Centres: Center for Humanities Computing Aarhus & Grundtvig Center * Data providers * SMK: National Gallery of Denmark * KB: The Royal Danish Library * DSL: Det Danske Sprog- og Litteraturselskab * GV: Grundtvigs Værker \*) preferably Co-Principal Investigators Needs: * 1-2 PhDs * 1 Postdoc * approx. 500K for data providers, licenses, compute &c Targets: * Augustinus 2022 ## Danish Foundation Models ## Denmark has in recent years seen an increase in the application of large-pretrained language models in both industry and research. These have the advantage of being pretrained once and then obtained impressive performance on downstream tasks with only little training data. This is especially important for low/middle resouce language such as Danish which lack large scale datasets. In Denmark two models, the Ælæctra and a Danish BERT by BotXO, have seen the largest adoption, but is often outperformed by non-danish Language models such as the norwegian BERT trained by the royal library in Norway or multilingual language models. This is likely due to the Danish models being either smaller or use older model architectures. Thus this shows a potential for creating a large Danish Language model, this however requires access to a large quantities of texts and large amount of computing resources. The Center for Humanities Computing Aarhus is in this case uniquely situated having access to compute resources the UCloud infrastructure as well as access to data through existing research projects, which would benefit in turn also would benefit from improved Danish language models. Training large language models however is typically a large effort. Requiring filtering data to exclude duplication and pornographic content as these have been shown to notably affect the performance of the model. We have thus started a collaboration between research and industry, with the Center for humanities leading the collaboration, with industry and other research collaborators providing invalueable expertise. This also allow us to tackle unsure that these models are trained a wide variety data source leading to better representation of e.g. news and social media. Similarly it also allow us to validate the model to a much larger extent than what would otherwise be possible. Lastly, by collaborating on training models collaboratively, we ensure that the environmental cost associated with training large language models is only occurring once, instead at each individual institution. ### Datasets ### The dataset currently available to the project for training: | Dataset | Description | Public | | ----------- | ------------------------------------------------------------------------ | ------ | | DAGW | Danish Gigaword. A wide coverage dataset of Danish text. | Yes | | HopeTwitter | A dataset of tweets collected as a part of the HOPE project. | No | | DaNews | A dataset consisting of Danish newspapers | No | | Reddit-da | A Danish subsection of reddit | Yes | | Netarkivet | A subsection of the "Danish" internet collected the royal Danish library | No | | mC4 | A cleaned part of the Common Crawl dataset | Yes | ### Models ### Currently, the plan is to train: - An encoder model, DeBERTa v3, with applications in tasks such as text classification, named entity recognition, dependency parsing and more. - A encoder-decoder, T5 v1.1, with application in Question answering, text generation and translation. Potentially other models might be include: - long-range models, with applications in long text analysis scenarios. - distilled versions of the models, with application in low compute scenarios Consortium: - Aarhus University, Center for Humanities Computing Aarhus - ITU - KMD - Ekstra Bladet - The Royal Danish Library Needs: * 2 PhDs * 2 Postdoc * 1 software engineer (2-4 yrs) * 300-500K for compute Targets: * Innovationsfonden 2023 ## Danish Wav2vec 2.0 ## Pre-training of large neural networks, so-called Foundation Models [1], has led to a paradigm shift in all fields of machine learning. Instead of starting from scratch whenever a new task is to be solved, practitioners now start from a pre-trained model which has learned to generate a highly useful representation of the data that can be further fine-tuned for downstream applications. This is the case for both Natural Language Processing (NLP) [2], Computer Vision [3], and has recently become feasible for the audio domain as well [4]. Foundation models require massive amounts of both data and compute power to be trained [1]. For instance, the English wav2vec2.0 model was pre-trained trained 53.000 hours of speech and further fine-tuned for speech-to-text on 960 hours of transcribed data [4]. However, to date, the largest Danish foundation model for audio was only trained on 1300 hours of speech data [5]. The main factors for this discrepancy are a lack of publicly available speech data, as well as large demands for computational resources. Through a collaboration with the Royal Danish Library, we have gained access to a large catalogue of public radio dating back to 1989, and consisting of approximately 240.000 hours of data. Training a large wav2vec2.0 model on this dataset would position Denmark at the forefront of speech technology, and greatly advance the development of systems for e.g., automatic speech recognition (speech-to-text) (ASR). Consortium: * The Royal Danish Library; data access. * Alvenir (alvenir.ai); a Danish company specializing in ASR. Alvenir developed the first Danish wav2vec2.0 and will assist with the training of the large wav2vec2.0 model in this project. * The Alexandra Institute; the institute is in the process of creating a new dataset for ASR in Danish. Fine-tuning the model trained in this project on their data would likely yield new state-of-the-art for Danish ASR. * DR (Danish Broadcasting Corporation); DR has an abundance of speech data in their archives from movies, interviews, news etc. The more varied the training data for the model is, the better it will be able to generalize. As such, we would benefit from having more heterogenous training data, and DR could benefit from better tools for e.g. automatic transcription or subtitling. * Capturi. Needs: * 2 PhDs * 1 Postdoc * 500K for compute References: [1] R. Bommasani et al., “On the opportunities and risks of foundation models,” ArXiv Prepr. ArXiv210807258, 2021. [2] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” ArXiv181004805 Cs, May 2019, Accessed: Sep. 16, 2020. [Online]. Available: http://arxiv.org/abs/1810.04805 [3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” ArXiv151203385 Cs, Dec. 2015, Accessed: May 24, 2019. [Online]. Available: http://arxiv.org/abs/1512.03385 [4] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” ArXiv200611477 Cs Eess, Oct. 2020, Accessed: Dec. 20, 2020. [Online]. Available: http://arxiv.org/abs/2006.11477 [5] Alvenir, “Alvenir/wav2vec2-base-da · Hugging Face.” https://huggingface.co/Alvenir/wav2vec2-base-da (accessed Feb. 22, 2022).