Workshop at Nelijärve - Accessing National Library texts

--- tags: nlib --- # Workshop at Nelijärve - Accessing National Library texts ## Workshop Usernames removed now. Ask if still want access. ## Setting up in the JupyterLab environment - Go to webpage jupyter.hpc.ut.ee/, log in. - Pick the first (default) option, with 1 CPU core, 8Gb memory, 6h timelimit. - Create new python notebook, set Kernel as R. - Convenience function: Add this code in Settings -> Advanced Settings Editor... -> Keyboard Shortcuts, on the left in the User Preferences box. This gives the RStudio shortcut ctrl-shift-m for the %>% function in the JupyterLab. ``` { "shortcuts": [ { "command": "notebook:replace-selection", "selector": ".jp-Notebook", "keys": ["Ctrl Shift M"], "args": {"text": '%>% '} } ] } ``` ## Contents The library access package is comprised of 4 native commands and a file system. - get_digar_overvew() - gives overview of the collection (issue-level) - get_subset_meta() - gives metainformation of the subset (article-level) - do_subset_search() - does a search on the subset and prints results to file - get_concordances() - gets word-in-context within the search results Any R packages can be used to manipulate the packages in the meanwhile. The native commands are based on the data.table package. --- ## Starting up 1) First, install the required package ``` #Install package remotes, if needed. JupyterLab should have it. #install.packages("remotes") #Since the JypiterLab that we use does not have write-access to #all the files, we specify a local folder for our packages. dir.create("R_pckg") remotes::install_github("peeter-t2/digar.txts",lib="~/R_pckg/",upgrade="never") ``` 2) Activate the package that was installed, use ``` library(digar.txts,lib.loc="~/R_pckg/") ``` 3) Use get_digar_overview() to get overview of the collections (issue-level). ``` all_issues <- get_digar_overview() ``` 4) Build a custom subset through any tools in R. Here is a tidyverse style example. ``` library(tidyverse) subset <- all_issues %>% filter(DocumentType=="NEWSPAPER") %>% filter(year>1880&year<1940) %>% filter(keyid=="postimeesew") ``` 5) Get meta information on that subset with get_subset_meta(). If this information is reused, sometimes storing the data is useful wth the commented lines. ``` subset_meta <- get_subset_meta(subset) #potentially write to file, for easier access if returning to it #readr::write_tsv(subset_meta,"subset_meta_postimeesew1.tsv") #subset_meta <- readr::read_tsv("subset_meta_postimeesew1.tsv") ``` 6) Do a search with do_subset_search(). This exports the search results into a file. do_subset_search() ignores case. ``` do_subset_search(searchterm="lurich", searchfile="lurich1.txt",subset) ``` 7) Read the search results. Use any R tools. It's useful to name the id and text columns id and txt. ``` texts <- fread("lurich1.txt",header=F)[,.(id=V1,txt=V2)] ``` 8) Get concordances using the get_concordances() command ``` concs <- get_concordances(searchterm="[Ll]urich",texts=texts,before=30,after=30,txt="txt",id="id") ``` ## Workshop Read and activate package. ``` dir.create("R_pckg") remotes::install_github("peeter-t2/digar.txts",lib="~/R_pckg/",upgrade="never") library(digar.txts,lib.loc="~/R_pckg/") ``` We also use tidyverse packages here. ``` library(tidyverse) ``` Get collection info. ``` all_issues <- get_digar_overview() ``` Explore metadata: newspapers 1880-1940. ``` subset <- all_issues %>% filter(DocumentType=="NEWSPAPER") %>% filter(year>1880&year<1940) ``` Let's look at the subset. ``` subset %>% count(keyid,sort=T) subset %>% count(keyid,sort=T) subset %>% count(keyid,year) %>% ggplot(aes(x=year,y=n))+ geom_point() subset %>% count(keyid,year) %>% group_by(keyid) %>% mutate(sum=sum(n)) %>% filter(sum>3000) %>% summarise(max(sum)) subset %>% count(keyid,year) %>% group_by(keyid) %>% mutate(sum=sum(n)) %>% filter(sum>3000) %>% ggplot(aes(x=year,y=n,color=keyid))+ geom_point() subset %>% count(keyid,year) %>% group_by(keyid) %>% mutate(sum=sum(n)) %>% filter(sum>3000) %>% ggplot(aes(x=year,y=n,fill=keyid))+ geom_col() subset %>% count(keyid,year) %>% group_by(keyid) %>% mutate(sum=sum(n)) %>% filter(sum>3000) %>% ggplot(aes(x=year,y=n,fill=keyid))+ geom_col()+ facet_wrap(~keyid) ``` Pick postimeesew and focus on that. ``` subset <- all_issues %>% filter(DocumentType=="NEWSPAPER") %>% filter(year>1880&year<1940) %>% filter(keyid=="postimeesew") ``` ``` subset_meta <- get_subset_meta(subset) ``` Try different genres in metadata, wordcount. ``` subset_meta %>% group_by(year) %>% summarise(words=sum(LogicalSectionTextWordCount)) %>% ggplot(aes(x=year,y=words))+ geom_col() subset_meta %>% group_by(year,LogicalSectionType) %>% summarise(words=sum(LogicalSectionTextWordCount)) %>% ggplot(aes(x=year,y=words,fill=LogicalSectionType))+ geom_col() subset_articlecounts <- subset_meta %>% count(year) subset_wordcounts <- subset_meta %>% group_by(year) %>% summarise(words=sum(LogicalSectionTextWordCount)) ``` Do a search. ``` do_subset_search(searchterm="lurich", searchfile="lurich1.txt",subset) texts <- fread("lurich1.txt",header=F)[,.(id=V1,txt=V2)] concs <- get_concordances(searchterm="[Ll]urich",texts=texts,before=30,after=30,txt="txt",id="id") ``` Join with metadata. ``` texts_w_meta <- texts %>% left_join(subset_meta %>% select(LogicalSectionID,LogicalSectionTitle,LogicalSectionType,LogicalSectionTextWordCount,MeanOCRAccuracyVol,docid,year),by=c("id"="LogicalSectionID")) texts_w_meta %>% count(year) %>% ggplot(aes(x=year,y=n))+ geom_col() text_articlecounts <- texts_w_meta %>% count(year) text_wordcounts <- texts_w_meta %>% group_by(year) %>% summarise(words=sum(LogicalSectionTextWordCount)) ``` Compare found articles with corpus. ``` subset_articlecounts %>% left_join(text_articlecounts,by="year") %>% ggplot(aes(x=year,y=n.y/n.x))+ geom_line() ``` Look at general wordcounts. ``` library(tidytext) wordcounts <- texts_w_meta %>% unnest_tokens(word, txt) %>% count(word,sort=T) ``` Join with stopwords ``` stopwords <- readr::read_csv("https://datadoi.ee/bitstream/handle/33/78/estonian-stopwords.txt?sequence=1&isAllowed=y",col_names = F) %>% rename(word=X1) contentwords <- wordcounts %>% anti_join(stopwords,by="word") contentwords %>% head(20) ``` Compare with another corpus. ``` do_subset_search(searchterm="konrad mä[ge]", searchfile="magi.txt",subset) texts2 <- fread("magi.txt",header=F)[,.(id=V1,txt=V2)] nrow(texts2) texts_w_meta2 <- texts2 %>% left_join(subset_meta %>% select(LogicalSectionID,LogicalSectionTitle,LogicalSectionType,LogicalSectionTextWordCount,MeanOCRAccuracyVol,docid,year),by=c("id"="LogicalSectionID")) ``` ``` wordcounts2 <- texts_w_meta2 %>% unnest_tokens(word, txt) %>% count(word,sort=T) contentwords2 <- wordcounts2 %>% anti_join(stopwords,by="word") %>% mutate(set="mägi") contentwords2 %>% head(20) contentwords <- wordcounts %>% anti_join(stopwords,by="word") %>% mutate(set="lurich") tf_idf <- contentwords %>% rbind(contentwords2) %>% bind_tf_idf(word,set,n) tf_idf %>% arrange(desc(tf_idf)) %>% group_by(set) %>% filter(n>5) %>% filter(!stringr::str_detect(word,"[0-9]")) %>% mutate(row_number=row_number()) %>% filter(row_number<21) %>% ggplot(aes(x=set,y=row_number,label=word))+ geom_label() ``` Look at preceding context. ``` concs_before <- get_concordances(searchterm="[Ll]urich",texts=texts,before=15,after=0,txt="txt",id="id") str(concs_before) ``` ``` concs_before %>% unnest_tokens(word,context) %>% count(word,sort=T) %>% head(20) ``` ``` concs_before %>% filter(str_detect(context,"jõuumees")) ``` --- Same with more texts. Let's look for 'auru', 'elekt', 'hobu'. ``` do_subset_search(searchterm="auru", searchfile="aur.txt",subset) texts3 <- fread("aur.txt",header=F)[,.(id=V1,txt=V2)] texts_w_meta3 <- texts3 %>% left_join(subset_meta %>% select(LogicalSectionID,LogicalSectionTitle,LogicalSectionType,LogicalSectionTextWordCount,MeanOCRAccuracyVol,docid,year),by=c("id"="LogicalSectionID")) do_subset_search(searchterm="elekt", searchfile="elekter.txt",subset) texts4 <- fread("elekter.txt",header=F)[,.(id=V1,txt=V2)] texts_w_meta4 <- texts4 %>% left_join(subset_meta %>% select(LogicalSectionID,LogicalSectionTitle,LogicalSectionType,LogicalSectionTextWordCount,MeanOCRAccuracyVol,docid,year),by=c("id"="LogicalSectionID")) do_subset_search(searchterm="hobu", searchfile="hobu.txt",subset) texts5 <- fread("hobu.txt",header=F)[,.(id=V1,txt=V2)] texts_w_meta5 <- texts5 %>% left_join(subset_meta %>% select(LogicalSectionID,LogicalSectionTitle,LogicalSectionType,LogicalSectionTextWordCount,MeanOCRAccuracyVol,docid,year),by=c("id"="LogicalSectionID")) text_articlecounts3 <- texts_w_meta3 %>% count(year) %>% mutate(set="aur") text_articlecounts4 <- texts_w_meta4 %>% count(year) %>% mutate(set="elekter") text_articlecounts5 <- texts_w_meta5 %>% count(year) %>% mutate(set="hobu") text_articlecounts3 %>% rbind(text_articlecounts4) %>% rbind(text_articlecounts5) %>% left_join(subset_articlecounts,by="year") %>% ggplot(aes(x=year,y=n.x/n.y,color=set))+ geom_line() ``` Unnest_tokens() takes longer with 20,000 texts already. ``` wordcounts3 <- texts_w_meta3 %>% unnest_tokens(word,txt) %>% count(word,sort=T) auru <- wordcounts3 %>% filter(str_detect(word,"auru")) ``` ``` auru %>% head(20) ``` With 30,000 texts unnest_tokens() takes even longer. ``` wordcounts4 <- texts_w_meta4 %>% unnest_tokens(word,txt) %>% count(word,sort=T) elek <- wordcounts4 %>% filter(str_detect(word,"elekt")) elek %>% head(20) ``` And 50,000 texts unnest_tokens() takes even longer. There are faster ways to tokenize in R. unnest_tokens() is mostly good for smaller text collections. ``` wordcounts5 <- texts_w_meta5 %>% unnest_tokens(word,txt) %>% count(word,sort=T) hobu <- wordcounts5 %>% filter(str_detect(word,"hobu")) hobu %>% head(20) ``` --- ### Language 1 simple - <- - save values - str() - overview of table - %>% - pipe values forward - filter() - filter by some value - count() - count occurrences of values - mutate() - make new column - head() - take first n rows ggplot - ggplot(aes(x=x,y=y,color=color,label=label))+ - geom_point()+ - geom_line()+ - geom_col()+ - geom_label() extra - n() - row_number() - min() - max() ### Language 2 Tidytext commands - unnest_tokens() - teeb tekstist väiksemad ühikud (vaikimisi sõnad) - bind_tf_idf() - leiab sõna-dokumendi sageduste põhjal eristavad sõnad

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.