---
tags: tekstiaine
---
# Tunni jooksul tehtud märkmed (SVUH.00.093 tekstid2 09.04.2022)
Koondan siia tunni jooksul tehtud märkmed kergemaks kopeerimiseks
```
Siia saab koodi kirjutada
```
str_detect, str_extract, str_extract_all
```
meie oleme siin" %>% str_detect("me")
"meie oleme siin" %>% str_detect("te")
"meie oleme siin" %>% str_extract("me")
"meie oleme siin" %>% str_extract("te")
"meie oleme siin" %>% str_extract_all("me")
"Meie oleme siin" %>% str_extract_all("me|Me")
"Meie oleme siin" %>% str_extract_all("[Mm]e")
```
Kõrts-kirik kõik vormid koos


Näidised asukohtade otsingutest
```
asukohad %>%
filter(str_detect(word,"must")|str_detect(word,"valge")) %>%
mutate(type=str_extract(word,"must|valge")) %>%
ggplot(aes(x=asukoht,y=type))+
geom_point()
asukohad %>%
filter(str_detect(word,"elu")|str_detect(word,"surm")) %>%
mutate(type=str_extract(word,"elu|surm")) %>%
ggplot(aes(x=asukoht,y=type))+
geom_point()
```






---
2- ja 1-sõnalised rublaga sõnad
```
rahad <- raamat1 %>%
mutate(leiud=str_extract_all(txt,"[a-zõäüöA-ZÕÄÖÜ]+( )*rubla[a-zõäüöA-ZÕÄÖÜ]+"))
```
Tühikute otsimine - kui palju on tühikuid reas
```
tühikud <- raamat1 %>%
mutate(leiud=str_extract_all(txt," ")) %>%
unnest() %>%
count(txt)
```
Bigrammide otsimine, kus rubla on lõpus
```
bigrammid %>%
filter(str_detect(bigram,"rubla$"))
```
Top 10 viisgrammi otsimine samm-sammult.
```
top10 <- viisgrammid %>%
count(fivegram,sort=T) %>%
filter(row_number()<11)
asukohad_viisgrammid <- viisgrammid %>%
mutate(nr=row_number(), n=n()) %>%
mutate(asukoht=nr/(n+1)) %>%
ungroup()
top10 %>%
left_join(asukohad_viisgrammid,by="fivegram") %>%
ggplot()+
geom_point(aes(x=asukoht,y=fivegram))
```

Bigrammide asukohad.
```
asukohad_bigrammid <- bigrammid %>%
filter(str_detect(bigram,"ei saanud")) %>%
mutate(nr=row_number(), n=n()) %>%
mutate(asukoht=nr/(n+1)) %>%
ungroup()
asukohad_bigrammid %>%
ggplot()+
geom_point(aes(x=asukoht,y=bigram))
```
----
----
Kuidas hoida alles suurtähed aga kaotada suurtähtedega stopsõnad
```
raamat1_sonad2 %>%
count(word,sort=T) %>%
mutate(original_word=word) %>%
mutate(word=tolower(word)) %>%
anti_join(stopwords,"word")
```
Iga peatüki stopsõnadeta top sõnad
```
peatykid_sonad2 <- raamat1 %>%
group_by(chapter) %>%
unnest_tokens(word,txt,to_lower=F)
peatykid_sonad2 %>%
anti_join(stopwords,"word") %>%
filter(!str_detect(word,"[A-ZÕÄÖÜ]")) %>%
mutate(row_number=row_number()) %>%
filter(row_number<11) %>%
filter(chapter<11) %>%
ggplot(aes(x=chapter,y=row_number,label=word))+
geom_label()
```
Peatükkide 10-20 eristavad sõnad
```
tf_idfs_peatykid %>%
mutate(row_number=row_number()) %>%
filter(row_number<11) %>%
filter(chapter %in% c(10:20)) %>%
ggplot(aes(x=chapter,y=row_number,label=word))+
geom_label()
Proovi ka! Võta eristavad sõnad peatükid 10-20. Raivo:
{r}
tf_idfs_peatykid <- raamat1_sonad %>%
group_by(chapter) %>%
count(word,sort=T) %>%
bind_tf_idf(word,chapter,n) %>%
arrange(desc(tf_idf))
tf_idfs_peatykid %>%
mutate(row_number=row_number()) %>%
filter(row_number<11) %>%
filter(9<chapter&chapter<21) %>%
ggplot(aes(x=chapter,y=row_number,label=word))+
geom_label()
```
Mitme teksti sisse lugemine
Sys.setenv on vajalik virtuaalkeskkonnas.
```
Sys.setenv(VROOM_CONNECTION_SIZE = "500000")
filelist <- list.files("data/uiboaed_ilukirjandus/soned",full.names=T)
texts <- map_df(filelist, ~ tibble(txt = read_lines(.x)) %>%
mutate(filename = .x)) %>%
mutate(filename= basename(filename))
```
Mitme teksti küsimustele vastused
```
words %>%
count(filename) %>%
ggplot(aes(y=filename,x=n))+
geom_col()
words %>%
count(word,sort=T) %>%
mutate(sagedus=n/sum(n))
words %>%
count(word,sort=T) %>%
filter(row_number()<11) %>%
mutate(sagedus=n/sum(n)) %>%
ggplot(aes(y=word,x=n))+
geom_col()
words %>%
group_by(filename) %>%
count(word,sort=T)
```
Valime ühe teksti
```
words %>%
filter(str_detect(filename,'Vambola')) %>%
count(word,sort=T) %>%
filter(row_number()<11)
```
Sagedasemad sõnad igas tekstis.
```
words %>%
group_by(filename) %>%
count(word,sort=T) %>%
mutate(rownr=row_number()) %>%
filter(rownr<11)
```
Sama sõna mitmes tekstis
```
words %>%
group_by(filename) %>%
count(word,sort=T) %>%
mutate(freq=n/sum(n))%>%
filter(str_detect(word,"sõda")) %>%
#filter(row_number()<10) %>% #sort pani need juba õigessejärjekorda
ggplot(aes(y=filename,x=freq,color=filename))+
geom_point()+
guides(color=F)
```
Punktide juurde sõnad ka.
```
words %>%
group_by(filename) %>%
count(word,sort=T) %>%
mutate(freq=n/sum(n)) %>%
filter(str_detect(word,"^hobu")) %>%
ggplot(aes(y=filename,x=freq,label=word))+
geom_point()+
geom_text()
```