owned this note
owned this note
Published
Linked with GitHub
---
tags: nlib
---
# Workshop at Nelijärve - Accessing National Library texts
## Workshop
Usernames removed now. Ask if still want access.
## Setting up in the JupyterLab environment
- Go to webpage jupyter.hpc.ut.ee/, log in.
- Pick the first (default) option, with 1 CPU core, 8Gb memory, 6h timelimit.
- Create new python notebook, set Kernel as R.
- Convenience function: Add this code in Settings -> Advanced Settings Editor... -> Keyboard Shortcuts, on the left in the User Preferences box. This gives the RStudio shortcut ctrl-shift-m for the %>% function in the JupyterLab.
```
{
"shortcuts": [
{
"command": "notebook:replace-selection",
"selector": ".jp-Notebook",
"keys": ["Ctrl Shift M"],
"args": {"text": '%>% '}
}
]
}
```
## Contents
The library access package is comprised of 4 native commands and a file system.
- get_digar_overvew() - gives overview of the collection (issue-level)
- get_subset_meta() - gives metainformation of the subset (article-level)
- do_subset_search() - does a search on the subset and prints results to file
- get_concordances() - gets word-in-context within the search results
Any R packages can be used to manipulate the packages in the meanwhile. The native commands are based on the data.table package.
---
## Starting up
1) First, install the required package
```
#Install package remotes, if needed. JupyterLab should have it.
#install.packages("remotes")
#Since the JypiterLab that we use does not have write-access to
#all the files, we specify a local folder for our packages.
dir.create("R_pckg")
remotes::install_github("peeter-t2/digar.txts",lib="~/R_pckg/",upgrade="never")
```
2) Activate the package that was installed, use
```
library(digar.txts,lib.loc="~/R_pckg/")
```
3) Use get_digar_overview() to get overview of the collections (issue-level).
```
all_issues <- get_digar_overview()
```
4) Build a custom subset through any tools in R. Here is a tidyverse style example.
```
library(tidyverse)
subset <- all_issues %>%
filter(DocumentType=="NEWSPAPER") %>%
filter(year>1880&year<1940) %>%
filter(keyid=="postimeesew")
```
5) Get meta information on that subset with get_subset_meta(). If this information is reused, sometimes storing the data is useful wth the commented lines.
```
subset_meta <- get_subset_meta(subset)
#potentially write to file, for easier access if returning to it
#readr::write_tsv(subset_meta,"subset_meta_postimeesew1.tsv")
#subset_meta <- readr::read_tsv("subset_meta_postimeesew1.tsv")
```
6) Do a search with do_subset_search(). This exports the search results into a file. do_subset_search() ignores case.
```
do_subset_search(searchterm="lurich", searchfile="lurich1.txt",subset)
```
7) Read the search results. Use any R tools. It's useful to name the id and text columns id and txt.
```
texts <- fread("lurich1.txt",header=F)[,.(id=V1,txt=V2)]
```
8) Get concordances using the get_concordances() command
```
concs <- get_concordances(searchterm="[Ll]urich",texts=texts,before=30,after=30,txt="txt",id="id")
```
## Workshop
Read and activate package.
```
dir.create("R_pckg")
remotes::install_github("peeter-t2/digar.txts",lib="~/R_pckg/",upgrade="never")
library(digar.txts,lib.loc="~/R_pckg/")
```
We also use tidyverse packages here.
```
library(tidyverse)
```
Get collection info.
```
all_issues <- get_digar_overview()
```
Explore metadata: newspapers 1880-1940.
```
subset <- all_issues %>%
filter(DocumentType=="NEWSPAPER") %>%
filter(year>1880&year<1940)
```
Let's look at the subset.
```
subset %>%
count(keyid,sort=T)
subset %>%
count(keyid,sort=T)
subset %>%
count(keyid,year) %>%
ggplot(aes(x=year,y=n))+
geom_point()
subset %>%
count(keyid,year) %>%
group_by(keyid) %>%
mutate(sum=sum(n)) %>%
filter(sum>3000) %>%
summarise(max(sum))
subset %>%
count(keyid,year) %>%
group_by(keyid) %>%
mutate(sum=sum(n)) %>%
filter(sum>3000) %>%
ggplot(aes(x=year,y=n,color=keyid))+
geom_point()
subset %>%
count(keyid,year) %>%
group_by(keyid) %>%
mutate(sum=sum(n)) %>%
filter(sum>3000) %>%
ggplot(aes(x=year,y=n,fill=keyid))+
geom_col()
subset %>%
count(keyid,year) %>%
group_by(keyid) %>%
mutate(sum=sum(n)) %>%
filter(sum>3000) %>%
ggplot(aes(x=year,y=n,fill=keyid))+
geom_col()+
facet_wrap(~keyid)
```
Pick postimeesew and focus on that.
```
subset <- all_issues %>%
filter(DocumentType=="NEWSPAPER") %>%
filter(year>1880&year<1940) %>%
filter(keyid=="postimeesew")
```
```
subset_meta <- get_subset_meta(subset)
```
Try different genres in metadata, wordcount.
```
subset_meta %>%
group_by(year) %>%
summarise(words=sum(LogicalSectionTextWordCount)) %>%
ggplot(aes(x=year,y=words))+
geom_col()
subset_meta %>%
group_by(year,LogicalSectionType) %>%
summarise(words=sum(LogicalSectionTextWordCount)) %>%
ggplot(aes(x=year,y=words,fill=LogicalSectionType))+
geom_col()
subset_articlecounts <- subset_meta %>%
count(year)
subset_wordcounts <- subset_meta %>%
group_by(year) %>%
summarise(words=sum(LogicalSectionTextWordCount))
```
Do a search.
```
do_subset_search(searchterm="lurich", searchfile="lurich1.txt",subset)
texts <- fread("lurich1.txt",header=F)[,.(id=V1,txt=V2)]
concs <- get_concordances(searchterm="[Ll]urich",texts=texts,before=30,after=30,txt="txt",id="id")
```
Join with metadata.
```
texts_w_meta <- texts %>% left_join(subset_meta %>% select(LogicalSectionID,LogicalSectionTitle,LogicalSectionType,LogicalSectionTextWordCount,MeanOCRAccuracyVol,docid,year),by=c("id"="LogicalSectionID"))
texts_w_meta %>%
count(year) %>%
ggplot(aes(x=year,y=n))+
geom_col()
text_articlecounts <- texts_w_meta %>%
count(year)
text_wordcounts <- texts_w_meta %>%
group_by(year) %>%
summarise(words=sum(LogicalSectionTextWordCount))
```
Compare found articles with corpus.
```
subset_articlecounts %>%
left_join(text_articlecounts,by="year") %>%
ggplot(aes(x=year,y=n.y/n.x))+
geom_line()
```
Look at general wordcounts.
```
library(tidytext)
wordcounts <- texts_w_meta %>%
unnest_tokens(word, txt) %>%
count(word,sort=T)
```
Join with stopwords
```
stopwords <- readr::read_csv("https://datadoi.ee/bitstream/handle/33/78/estonian-stopwords.txt?sequence=1&isAllowed=y",col_names = F) %>% rename(word=X1)
contentwords <- wordcounts %>%
anti_join(stopwords,by="word")
contentwords %>%
head(20)
```
Compare with another corpus.
```
do_subset_search(searchterm="konrad mä[ge]", searchfile="magi.txt",subset)
texts2 <- fread("magi.txt",header=F)[,.(id=V1,txt=V2)]
nrow(texts2)
texts_w_meta2 <- texts2 %>% left_join(subset_meta %>% select(LogicalSectionID,LogicalSectionTitle,LogicalSectionType,LogicalSectionTextWordCount,MeanOCRAccuracyVol,docid,year),by=c("id"="LogicalSectionID"))
```
```
wordcounts2 <- texts_w_meta2 %>%
unnest_tokens(word, txt) %>%
count(word,sort=T)
contentwords2 <- wordcounts2 %>%
anti_join(stopwords,by="word") %>%
mutate(set="mägi")
contentwords2 %>%
head(20)
contentwords <- wordcounts %>%
anti_join(stopwords,by="word") %>%
mutate(set="lurich")
tf_idf <- contentwords %>%
rbind(contentwords2) %>%
bind_tf_idf(word,set,n)
tf_idf %>%
arrange(desc(tf_idf)) %>%
group_by(set) %>%
filter(n>5) %>%
filter(!stringr::str_detect(word,"[0-9]")) %>%
mutate(row_number=row_number()) %>%
filter(row_number<21) %>%
ggplot(aes(x=set,y=row_number,label=word))+
geom_label()
```
Look at preceding context.
```
concs_before <- get_concordances(searchterm="[Ll]urich",texts=texts,before=15,after=0,txt="txt",id="id")
str(concs_before)
```
```
concs_before %>%
unnest_tokens(word,context) %>%
count(word,sort=T) %>%
head(20)
```
```
concs_before %>%
filter(str_detect(context,"jõuumees"))
```
---
Same with more texts. Let's look for 'auru', 'elekt', 'hobu'.
```
do_subset_search(searchterm="auru", searchfile="aur.txt",subset)
texts3 <- fread("aur.txt",header=F)[,.(id=V1,txt=V2)]
texts_w_meta3 <- texts3 %>% left_join(subset_meta %>% select(LogicalSectionID,LogicalSectionTitle,LogicalSectionType,LogicalSectionTextWordCount,MeanOCRAccuracyVol,docid,year),by=c("id"="LogicalSectionID"))
do_subset_search(searchterm="elekt", searchfile="elekter.txt",subset)
texts4 <- fread("elekter.txt",header=F)[,.(id=V1,txt=V2)]
texts_w_meta4 <- texts4 %>% left_join(subset_meta %>% select(LogicalSectionID,LogicalSectionTitle,LogicalSectionType,LogicalSectionTextWordCount,MeanOCRAccuracyVol,docid,year),by=c("id"="LogicalSectionID"))
do_subset_search(searchterm="hobu", searchfile="hobu.txt",subset)
texts5 <- fread("hobu.txt",header=F)[,.(id=V1,txt=V2)]
texts_w_meta5 <- texts5 %>% left_join(subset_meta %>% select(LogicalSectionID,LogicalSectionTitle,LogicalSectionType,LogicalSectionTextWordCount,MeanOCRAccuracyVol,docid,year),by=c("id"="LogicalSectionID"))
text_articlecounts3 <- texts_w_meta3 %>%
count(year) %>% mutate(set="aur")
text_articlecounts4 <- texts_w_meta4 %>%
count(year) %>% mutate(set="elekter")
text_articlecounts5 <- texts_w_meta5 %>%
count(year) %>% mutate(set="hobu")
text_articlecounts3 %>%
rbind(text_articlecounts4) %>%
rbind(text_articlecounts5) %>%
left_join(subset_articlecounts,by="year") %>%
ggplot(aes(x=year,y=n.x/n.y,color=set))+
geom_line()
```
Unnest_tokens() takes longer with 20,000 texts already.
```
wordcounts3 <- texts_w_meta3 %>%
unnest_tokens(word,txt) %>%
count(word,sort=T)
auru <- wordcounts3 %>%
filter(str_detect(word,"auru"))
```
```
auru %>%
head(20)
```
With 30,000 texts unnest_tokens() takes even longer.
```
wordcounts4 <- texts_w_meta4 %>%
unnest_tokens(word,txt) %>%
count(word,sort=T)
elek <- wordcounts4 %>%
filter(str_detect(word,"elekt"))
elek %>%
head(20)
```
And 50,000 texts unnest_tokens() takes even longer. There are faster ways to tokenize in R. unnest_tokens() is mostly good for smaller text collections.
```
wordcounts5 <- texts_w_meta5 %>%
unnest_tokens(word,txt) %>%
count(word,sort=T)
hobu <- wordcounts5 %>%
filter(str_detect(word,"hobu"))
hobu %>%
head(20)
```
---
### Language 1
simple
- <- - save values
- str() - overview of table
- %>% - pipe values forward
- filter() - filter by some value
- count() - count occurrences of values
- mutate() - make new column
- head() - take first n rows
ggplot
- ggplot(aes(x=x,y=y,color=color,label=label))+
- geom_point()+
- geom_line()+
- geom_col()+
- geom_label()
extra
- n()
- row_number()
- min()
- max()
### Language 2
Tidytext commands
- unnest_tokens() - teeb tekstist väiksemad ühikud (vaikimisi sõnad)
- bind_tf_idf() - leiab sõna-dokumendi sageduste põhjal eristavad sõnad