# NLP on Local Gazetteers
<!-- Put the link to this slide here so people can follow -->
slide: https://frama.link/mpi
data & code : https://sharedocs.huma-num.fr/wl/?id=1S5xuGEuwiOGCtAp6f1vfofUioNXcMGB
---
## Who am I?
Pierre Magistry
(pierre.magistry@univ-amu.fr)
- PhD in Computational Linguistics
- postdoc in Aix-Marseille Univ. on NLP on Modern China historical documents (1830~1949)
- working mostly with scala, python and R
---
# Today's topic:
# Language Modeling
---
## Tutorial and materials for NLP
- bible : https://web.stanford.edu/~jurafsky/slp3/
- python toolkits : nltk, spacy
- deeplearning frameworks
- flair https://github.com/zalandoresearch/flair
- allenNLP : https://demo.allennlp.org/
---
## Language Modeling
-> slides from SLP, chapter 4
---
## What about 文言文 ?
- what are word ?
- what are sentences ?
- what are documents ?
- what about sinograms ?
----
## The more important thing is...
**Preprocessing and data curation**
- clear goal
- clean input
- well defined units
----
## back to SLP
---
## Hands on
### Tools
- using KenLM https://kheafield.com/code/kenlm/
- for python `pip install kenlm`
- in a jupyter notebook
----
### Goal
- LM-based distance between gazetteers
- Using perplexity of cross-trained models
- we train one model for each gazetteers
- we compute the perplexity of all other
- simple visualization of the result
----
### Data
- text from Rise API
- I took only some large gazetters
- each line as a "sentence"
- each sinogram as a "word" (spaces replaced by _ )
----
#### training the models
`lmplz -o 4 < input.text > model.arpa`
`build_binary model.arpa model.bin`
#### See the Notebook
or LM.py
{"metaMigratedAt":"2023-06-14T23:09:31.658Z","metaMigratedFrom":"YAML","title":"Talk slides template","breaks":true,"description":"View the slide with \"Slide Mode\".","contributors":"[{\"id\":\"67044989-3f98-4f73-b3b7-e5286adbb616\",\"add\":1855,\"del\":40}]"}