# Data ## slides for OECD https://docs.google.com/presentation/d/1TfsMH2fTfoVF7-3keuFCA7EhEOTznzzz7nMDNDacuvE/edit?usp=sharing ## Awful Data Claire: UK_02.pdf - page 6 table --> pypdf missing space and doesn not indicate table format UK_07.pdf - page 4 big T pypdf is ok page 6 figure with text partially extracted, words cut (Connectivit y Horiz ontal OS) UK_08 page 6 graph page 7 3 columns pypdf didn't recognize Carlos: UK_119, page 6, 8 (interesting to look at) UK_90, page 5, 9, 11 (interesting to look at) UK_82, page 5 (text in img format) ## major issues 3 columns docs (both pypdf and nougat) graph / figures with text on it ## manually added documents missing UK_06 --> downloaded from internet as a PDF ## Symbols and text to remove * © + what follows * ● * ■ * words longer than x nb of characters (pypdf missed spaces expl ethicsandsafetyinthepublicsector) ### Nougat * consecutive dots or any consecutive symbol (UK_82 p. 2) * Sometimes `\({}^{<footnote_nr>}\)` is used instead of footnote * `[MISSING_PAGE_EMPTY:<page_nr>]` * `[MISSING_PAGE_FAIL:39]` links footnotes repeated text figure captions ## corrupt files should be no more in the drive ## random about files official english translation for DEU_13 so we can compare with our translation ## PDF to Text evaluation | pdf2text tool name | Processing time / page | Quality of pure text | Completeness | Filtering of redundant data | inclusion of titles, figure descriptions etc. | Format | Issues | |---|---|---|---|---|---|---|---| |[nougat](https://github.com/facebookresearch/nougat)|3-4s|10/10|Few missing and failed pages per large doc |removes headers/footers otherwise provides a label "Footnote 2:" |table descriptions, tables (`begin{tabular}...`)||---| |---|---|---|---|---|---|---|---| |---|---|---|---|---|---|---|---| |---|---|---|---|---|---|---|---| # Models # Lit/Tools Review [Information Retrieval Google Doc](https://docs.google.com/document/d/1tCzKefkn2KPKQFqileZ5RX4YMXmbhnABHMA7wzoC4AM/edit?usp=sharing) ## qs for Jan - Clarify deliverable formats - Summary of possible exact deliverables - knowledge of uncertainty - Interface Notebook ok? - Interact via chat with Doc? Possible analyses - User summary - Document level analysis ### Summary from Meeting 18.10.2023 (carlos) Example qs: - In what context is AI being mentioned? - What steps are being proposed for X? - Who should be interacted with at point X? Suggestions from OECD for analysis: - Maybe try to split into categories what is being done: i.e. for AI innovation: is it a uni program? job training? how relevant is it? - Try to get a sense of magnitude Further points: - Dont worry about time relations - Have anything is better than having good preprocessing - Document summary not that central - Want quantitative. # Idea Dump - Have list of prompts based on the type of technology. But have a separate query that takes in all these technologies and the corresponding prompts, allowing user to input a new technology and generating prompts based upon that - Explore translating documents vs translating prompts + multi-lingual embedding model: [Cohere API works for this](https://txt.cohere.com/search-cohere-langchain/#overview-of-multilingual-semantic-search) - If the chunk size is small, removing stopwords helps, otherwise it doesn't matter - Make sure to not upload GPT key to GitHub. Maybe have some encrypted .byte file or upload a txt file in colab to fetch key from