# NLP on Local Gazetteers <!-- Put the link to this slide here so people can follow --> slide: https://frama.link/mpi data & code : https://sharedocs.huma-num.fr/wl/?id=1S5xuGEuwiOGCtAp6f1vfofUioNXcMGB --- ## Who am I? Pierre Magistry (pierre.magistry@univ-amu.fr) - PhD in Computational Linguistics - postdoc in Aix-Marseille Univ. on NLP on Modern China historical documents (1830~1949) - working mostly with scala, python and R --- # Today's topic: # Language Modeling --- ## Tutorial and materials for NLP - bible : https://web.stanford.edu/~jurafsky/slp3/ - python toolkits : nltk, spacy - deeplearning frameworks - flair https://github.com/zalandoresearch/flair - allenNLP : https://demo.allennlp.org/ --- ## Language Modeling -> slides from SLP, chapter 4 --- ## What about 文言文 ? - what are word ? - what are sentences ? - what are documents ? - what about sinograms ? ---- ## The more important thing is... **Preprocessing and data curation** - clear goal - clean input - well defined units ---- ## back to SLP --- ## Hands on ### Tools - using KenLM https://kheafield.com/code/kenlm/ - for python `pip install kenlm` - in a jupyter notebook ---- ### Goal - LM-based distance between gazetteers - Using perplexity of cross-trained models - we train one model for each gazetteers - we compute the perplexity of all other - simple visualization of the result ---- ### Data - text from Rise API - I took only some large gazetters - each line as a "sentence" - each sinogram as a "word" (spaces replaced by _ ) ---- #### training the models `lmplz -o 4 < input.text > model.arpa` `build_binary model.arpa model.bin` #### See the Notebook or LM.py
{"metaMigratedAt":"2023-06-14T23:09:31.658Z","metaMigratedFrom":"YAML","title":"Talk slides template","breaks":true,"description":"View the slide with \"Slide Mode\".","contributors":"[{\"id\":\"67044989-3f98-4f73-b3b7-e5286adbb616\",\"add\":1855,\"del\":40}]"}
    571 views