# Data
## slides for OECD
https://docs.google.com/presentation/d/1TfsMH2fTfoVF7-3keuFCA7EhEOTznzzz7nMDNDacuvE/edit?usp=sharing
## Awful Data
Claire:
UK_02.pdf - page 6 table
--> pypdf missing space and doesn not indicate table format
UK_07.pdf - page 4 big T pypdf is ok
page 6 figure with text partially extracted, words cut (Connectivit y
Horiz ontal OS)
UK_08 page 6 graph
page 7 3 columns pypdf didn't recognize
Carlos:
UK_119, page 6, 8 (interesting to look at)
UK_90, page 5, 9, 11 (interesting to look at)
UK_82, page 5 (text in img format)
## major issues
3 columns docs (both pypdf and nougat)
graph / figures with text on it
## manually added documents
missing UK_06 --> downloaded from internet as a PDF
## Symbols and text to remove
* © + what follows
* ●
* ■
* words longer than x nb of characters (pypdf missed spaces expl ethicsandsafetyinthepublicsector)
### Nougat
* consecutive dots or any consecutive symbol (UK_82 p. 2)
* Sometimes `\({}^{<footnote_nr>}\)` is used instead of footnote
* `[MISSING_PAGE_EMPTY:<page_nr>]`
* `[MISSING_PAGE_FAIL:39]`
links
footnotes
repeated text
figure captions
## corrupt files
should be no more in the drive
## random about files
official english translation for DEU_13 so we can compare with our translation
## PDF to Text evaluation
| pdf2text tool name | Processing time / page | Quality of pure text | Completeness | Filtering of redundant data | inclusion of titles, figure descriptions etc. | Format | Issues |
|---|---|---|---|---|---|---|---|
|[nougat](https://github.com/facebookresearch/nougat)|3-4s|10/10|Few missing and failed pages per large doc |removes headers/footers otherwise provides a label "Footnote 2:" |table descriptions, tables (`begin{tabular}...`)||---|
|---|---|---|---|---|---|---|---|
|---|---|---|---|---|---|---|---|
|---|---|---|---|---|---|---|---|
# Models
# Lit/Tools Review
[Information Retrieval Google Doc](https://docs.google.com/document/d/1tCzKefkn2KPKQFqileZ5RX4YMXmbhnABHMA7wzoC4AM/edit?usp=sharing)
## qs for Jan
- Clarify deliverable formats
- Summary of possible exact deliverables
- knowledge of uncertainty
- Interface Notebook ok?
- Interact via chat with Doc?
Possible analyses
- User summary
- Document level analysis
### Summary from Meeting 18.10.2023 (carlos)
Example qs:
- In what context is AI being mentioned?
- What steps are being proposed for X?
- Who should be interacted with at point X?
Suggestions from OECD for analysis:
- Maybe try to split into categories what is being done: i.e. for AI innovation: is it a uni program? job training? how relevant is it?
- Try to get a sense of magnitude
Further points:
- Dont worry about time relations
- Have anything is better than having good preprocessing
- Document summary not that central
- Want quantitative.
# Idea Dump
- Have list of prompts based on the type of technology. But have a separate query that takes in all these technologies and the corresponding prompts, allowing user to input a new technology and generating prompts based upon that
- Explore translating documents vs translating prompts + multi-lingual embedding model: [Cohere API works for this](https://txt.cohere.com/search-cohere-langchain/#overview-of-multilingual-semantic-search)
- If the chunk size is small, removing stopwords helps, otherwise it doesn't matter
- Make sure to not upload GPT key to GitHub. Maybe have some encrypted .byte file or upload a txt file in colab to fetch key from