Private Semi-structured and Multi-modal RAG w/ LLaMA2 and LLaVA
===
###### tags: `LLM`
###### tags: `ML`, `NLP`, `NLU`, `langchain`, `PDF`, `unstructured`
[TOC]
:::info
## [Semi_structured_multi_modal_RAG_LLaMA2.ipynb](https://github.com/langchain-ai/langchain/blob/master/cookbook/Semi_structured_multi_modal_RAG_LLaMA2.ipynb?ref=blog.langchain.dev)
:::
<br>
## 241 環境
- ### 開發環境
```
conda activate langchain
```
- ### unstructured
- 不完整安裝方式:`pip install "unstructured`
- error
`ModuleNotFoundError: No module named 'pdf2image'`
- 完整安裝方式:`pip install "unstructured[all-docs]`
[[github] Unstructured-IO / unstructured](https://github.com/Unstructured-IO/unstructured?tab=readme-ov-file#installing-the-library)
---
```
$ pip list | grep unstructured
unstructured 0.12.4
unstructured-client 0.18.0
unstructured-inference 0.7.23
unstructured.pytesseract 0.3.12
```
---
```
$ pipdeptree -p unstructured
Warning!!! Possibly conflicting dependencies found:
* langchain-core==0.1.23
- packaging [required: >=23.2,<24.0, installed: 24.1]
------------------------------------------------------------------------
unstructured==0.12.4
├── backoff [required: Any, installed: 2.2.1]
├── beautifulsoup4 [required: Any, installed: 4.12.2]
│ └── soupsieve [required: >1.2, installed: 2.5]
├── chardet [required: Any, installed: 5.2.0]
├── dataclasses-json [required: Any, installed: 0.5.14]
│ ├── marshmallow [required: >=3.18.0,<4.0.0, installed: 3.20.1]
│ │ └── packaging [required: >=17.0, installed: 24.1]
│ └── typing-inspect [required: >=0.4.0,<1, installed: 0.9.0]
│ ├── mypy-extensions [required: >=0.3.0, installed: 1.0.0]
│ └── typing_extensions [required: >=3.7.4, installed: 4.9.0]
├── emoji [required: Any, installed: 2.10.1]
├── filetype [required: Any, installed: 1.2.0]
├── langdetect [required: Any, installed: 1.0.9]
│ └── six [required: Any, installed: 1.16.0]
├── lxml [required: Any, installed: 4.9.3]
├── nltk [required: Any, installed: 3.8.1]
│ ├── click [required: Any, installed: 8.1.7]
│ ├── joblib [required: Any, installed: 1.3.2]
│ ├── regex [required: >=2021.8.3, installed: 2023.12.25]
│ └── tqdm [required: Any, installed: 4.66.1]
├── numpy [required: Any, installed: 1.25.2]
├── python-iso639 [required: Any, installed: 2024.2.7]
├── python-magic [required: Any, installed: 0.4.27]
├── rapidfuzz [required: Any, installed: 3.6.1]
├── requests [required: Any, installed: 2.31.0]
│ ├── certifi [required: >=2017.4.17, installed: 2023.7.22]
│ ├── charset-normalizer [required: >=2,<4, installed: 3.2.0]
│ ├── idna [required: >=2.5,<4, installed: 3.4]
│ └── urllib3 [required: >=1.21.1,<3, installed: 2.0.4]
├── tabulate [required: Any, installed: 0.9.0]
├── typing_extensions [required: Any, installed: 4.9.0]
├── unstructured-client [required: >=0.15.1, installed: 0.18.0]
│ ├── certifi [required: >=2023.7.22, installed: 2023.7.22]
│ ├── charset-normalizer [required: >=3.2.0, installed: 3.2.0]
│ ├── dataclasses-json-speakeasy [required: >=0.5.11, installed: 0.5.11]
│ │ ├── marshmallow [required: >=3.18.0,<4.0.0, installed: 3.20.1]
│ │ │ └── packaging [required: >=17.0, installed: 24.1]
│ │ └── typing-inspect [required: >=0.4.0,<1, installed: 0.9.0]
│ │ ├── mypy-extensions [required: >=0.3.0, installed: 1.0.0]
│ │ └── typing_extensions [required: >=3.7.4, installed: 4.9.0]
│ ├── idna [required: >=3.4, installed: 3.4]
│ ├── jsonpath-python [required: >=1.0.6, installed: 1.0.6]
│ ├── marshmallow [required: >=3.19.0, installed: 3.20.1]
│ │ └── packaging [required: >=17.0, installed: 24.1]
│ ├── mypy-extensions [required: >=1.0.0, installed: 1.0.0]
│ ├── packaging [required: >=23.1, installed: 24.1]
│ ├── python-dateutil [required: >=2.8.2, installed: 2.8.2]
│ │ └── six [required: >=1.5, installed: 1.16.0]
│ ├── requests [required: >=2.31.0, installed: 2.31.0]
│ │ ├── certifi [required: >=2017.4.17, installed: 2023.7.22]
│ │ ├── charset-normalizer [required: >=2,<4, installed: 3.2.0]
│ │ ├── idna [required: >=2.5,<4, installed: 3.4]
│ │ └── urllib3 [required: >=1.21.1,<3, installed: 2.0.4]
│ ├── six [required: >=1.16.0, installed: 1.16.0]
│ ├── typing_extensions [required: >=4.7.1, installed: 4.9.0]
│ ├── typing-inspect [required: >=0.9.0, installed: 0.9.0]
│ │ ├── mypy-extensions [required: >=0.3.0, installed: 1.0.0]
│ │ └── typing_extensions [required: >=3.7.4, installed: 4.9.0]
│ └── urllib3 [required: >=1.26.18, installed: 2.0.4]
└── wrapt [required: Any, installed: 1.16.0]
```
- ### pdf2image
```
$ pipdeptree -p pdf2image
Warning!!! Possibly conflicting dependencies found:
* langchain-core==0.1.23
- packaging [required: >=23.2,<24.0, installed: 24.1]
------------------------------------------------------------------------
pdf2image==1.17.0
└── pillow [required: Any, installed: 10.2.0]
```
- ### use ollama.ai to run LLaMA2 locally.
https://python.langchain.com/docs/guides/local_llms
- [Download Ollama](https://ollama.com/download)
`$ curl -fsSL https://ollama.com/install.sh | sh`
```
>>> Installing ollama to /usr/local/bin...
>>> Adding ollama user to render group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service.
>>> NVIDIA GPU installed.
```
- demo
```
$ ollama serve
$ ollama pull llama2
$ ollama pull llama2:13b-chat
$ ollama pull llama2:70b-chat
$ ollama list
$ ollama run llama2:13b-chat
```

- how to stop ollama?
[Stop Ollama #690](https://github.com/ollama/ollama/issues/690)
```
$ pgrep ollama
74877
$ kill 74877
```
<br>
## 錯誤排解
### 發生在 partition_pdf(...)
- ### PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?
- 解法:https://gist.github.com/Dayjo/618794d4ff37bb82ddfb02c63b450a81
- apt update 失敗
- E: The repository 'https://nvidia.github.io/libnvidia-container/stable/ubuntu20.04/amd64 Release' no longer has a Release file.
- 解法
- https://github.com/NVIDIA/nvidia-container-toolkit/issues/94
- 編輯 `/etc/apt/sources.list.d/nvidia-container-toolkit.list`,移除 ubuntu20.04
- E: Unable to locate package libopenjpeg-dev
- 解法
- https://github.com/1adrianb/face-alignment/issues/172#issuecomment-626382330
- Try installing then `libopenjp2-7-dev` please
```
apt install libopenjp2-7-dev
```
- 備註:在 ubuntu22.04, make poppler-0.48.0 資料夾會失敗
`error: argument 2 of '__atomic_load' must not be a pointer to a 'volatile' type`
- https://stackoverflow.com/questions/57535731/
```
conda install -c conda-forge poppler
```
- 在 ubuntu22.04 有裝成功
- ### TesseractNotFoundError: tesseract is not installed or it's not in your PATH.
- 解法1:(失敗)
- https://stackoverflow.com/questions/50951955/
`pip install pytesseract` -> 還是失敗
- 解法2:
- https://stackoverflow.com/questions/66659166/
- https://tesseract-ocr.github.io/tessdoc/Installation.html
`sudo apt install tesseract-ocr`
(安裝到全域)
- ### ModuleNotFoundError: No module named 'langchain_community'
- 解法:https://stackoverflow.com/questions/77998568/
```
pip install langchain-community langchain-core
```
- ### ConnectionError: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/chat/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f25293a64d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
- 參考上面 ollama 環境介紹
<br>
### 發生在 Add to vectorstore
- ### ImportError: Could not import gpt4all library. Please install the gpt4all library to use this embedding model: pip install gpt4all
```
pip install gpt4all
```
- ### OSError: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /.../python3.11/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/libllmodel.so)
- 簡易重製方式:
[Generating embeddings](https://docs.gpt4all.io/gpt4all_python_embedding.html)
```python=
from gpt4all import GPT4All, Embed4All
text = 'The quick brown fox jumps over the lazy dog'
embedder = Embed4All()
output = embedder.embed(text)
print(output)
```
OSError: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by .../python3.11/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/libllmodel.so)
- 可能解法:
- [GLIBC_2.32 not found when running under Ubuntu 20.04.6 #25](https://github.com/huggingface/llm-ls/issues/25)
- [libc.so.6: version `GLIBC_2.34' not found #1480](https://github.com/nomic-ai/gpt4all/issues/1480)
- ### ImportError: Could not import chromadb python package. Please install it with `pip install chromadb`.
```
pip install chromadb
```
<br>
## 改用 ubuntu22.04 image 跑
### 錯誤排解
- ImportError: libGL.so.1: cannot open shared object file: No such file or directory
- [docker环境里安装opencv ImportError: libGL.so.1: cannot open shared object file: No such file or directory](https://blog.csdn.net/Max_ZhangJF/article/details/108920050)
```
pip uninstall opencv-python
pip install opencv-python-headless
```
- 又延伸新問題
- ModuleNotFoundError: No module named 'cv2.typing'; 'cv2' is not a package
- ImportError: libGL.so.1: cannot open shared object file: No such file or directory
- https://stackoverflow.com/questions/55313610/
```
RUN apt-get update && apt-get install ffmpeg libsm6 libxext6 -y
```
### Dockerfile
```Dockerfile=
# Usages:
# - build the image (may take 11 minutes)
# $ docker build -t multi-modal-rag .
#
# - run the image
# $ docker run --rm -it -p 58888:8888 --gpus='"device=0"' -v `pwd`/workspace:/workspace multi-modal-rag
# base image
FROM ubuntu:22.04
# [Software Repository]
#
# update the Ubuntu software repository
#
# - for the 'tzdata' package, or you will meet:
# E: Unable to locate package tzdata
#
# - for the 'software-properties-common' package, or you will meet:
# E: Unable to locate package software-properties-common
#
RUN apt-get update \
&& apt-get install -y curl wget
# Install conda
# -------------------------------------------------------------------
# Install Miniconda 75.7 MB
#
# Note:
# - The conda command might be updated frequently.
# - https://docs.conda.io/projects/conda/en/latest/release-notes.html
# - 4.12.0 (2022-03-08)
# - 4.11.0 (2021-11-22)
# - Since Miniconda is smaller (75.7 MB),
# let's always get the Miniconda installer from online.
#
# - wget
# -q, --quiet:
# quiet (no output)
#
# - Miniconda3-latest-Linux-x86_64.sh
# -b:
# run install in batch mode (without manual intervention),
# it is expected the license terms are agreed upon
#
RUN wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh \
&& chmod +x ./Miniconda3-latest-Linux-x86_64.sh \
&& ./Miniconda3-latest-Linux-x86_64.sh -b \
&& rm ./Miniconda3-latest-Linux-x86_64.sh
# Add Conda path to your environment path
ENV PATH="/root/miniconda3/bin:${PATH}"
RUN conda --version
# Update Conda if needed
# -------------------------------------------------
# ==> WARNING: A newer version of conda exists. <==
# current version: 4.12.0
# latest version: 4.14.0
#
# Please update conda by running
# -------------------------------------------------
RUN conda update -n base -c defaults conda
# Install Python 3.9
RUN conda create -y --name python3.9 python=3.9
# Init conda configuration
RUN conda init
# set the default environment in Dockerfile
RUN echo 'conda activate python3.9' >> ~/.bashrc
# ===================================================================
# install other realted packages in conda env python3.9
# ===================================================================
# - note
# $ conda activate python3.9 # cannot work
#
# - troubleshooting
# $ source ~/.bashrc
# /bin/sh: 1: source: not found --> SHELL ["/bin/bash", "-c"]
#
# for the source command, or you will get `source` not found
SHELL ["/bin/bash", "-c"]
RUN source activate python3.9 \
&& pip install jupyterlab ipywidgets \
&& pip install langchain unstructured[all-docs] pydantic lxml
# for CV2
# - @: from unstructured.partition.pdf import partition_pdf
# - https://stackoverflow.com/questions/55313610/
RUN apt-get update && apt-get install ffmpeg libsm6 libxext6 -y
# for poppler
# PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?
RUN source activate python3.9 \
&& conda install -c conda-forge poppler \
&& pip install gpt4all chromadb
# for tesseract
# TesseractNotFoundError: tesseract is not installed or it's not in your PATH.
RUN apt install tesseract-ocr -y
# for ChatOllama
RUN curl -fsSL https://ollama.com/install.sh | sh
#RUN ollama serve && ollama pull llama2:13b-cha
CMD source activate python3.9 && jupyter lab --ip 0.0.0.0 --port 8888 --ServerApp.token='' --ServerApp.password='' --no-browser --allow-root /workspace
```
<br>
<hr>
<hr>
<br>
## langchain
### text_splliter
> 文章來源: [state_of_the_union.txt](https://github.com/langchain-ai/langchain/blob/master/docs/docs/modules/state_of_the_union.txt)
```python
from langchain.document_loaders import TextLoader
loader = TextLoader('state_of_the_union.txt', encoding='utf8')
documents = loader.load()
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)
```
```
texts[0]
```
> Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.', metadata={'source': 'state_of_the_union.txt'})
```
texts[1]
```
> Document(page_content='He met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. \n\nGroups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland. \n\nIn this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight. \n\nLet each of us here tonight in this Chamber send an unmistakable signal to Ukraine and to the world. \n\nPlease rise if you are able and show that, Yes, we the United States of America stand with the Ukrainian people. \n\nThroughout our history we’ve learned this lesson when dictators do not pay a price for their aggression they cause more chaos. \n\nThey keep moving. \n\nAnd the costs and the threats to America and the world keep rising.', metadata={'source': 'state_of_the_union.txt'})
### embeddings
- 筆記轉移至 [langchain](https://hackmd.io/DF0QVi5IS3uDq2gIj-dfJg)
<br>
## 參考資料
### RAG
- [什麼是RAG檢索增強生成](https://www.webcomm.com.tw/blog/rag/)
### ChatOllama
- [[參數] langchain_community.chat_models.ollama.ChatOllama](https://api.python.langchain.com/en/latest/chat_models/langchain_community.chat_models.ollama.ChatOllama.html#)
### langchain
- [利用LangChain實作ChatPDF:問個問題,輕鬆找出文件重點](https://edge.aif.tw/express-langchain-chatpdf/)
```python
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.document_loaders import TextLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
```
- [LangChain 与 Chroma 的大模型语义搜索应用](https://cloud.baidu.com/qianfandev/topic/267979)
```python
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import SentenceTransformerEmbeddings
```
- [[D21] LangChain 專題實做 - 文本嵌入與向量資料庫](https://ithelp.ithome.com.tw/articles/10327164)

- [【LangChain】检索器(Retrievers)](https://blog.csdn.net/u013066244/article/details/131843772)
```python=
from langchain.document_loaders import TextLoader
loader = TextLoader('日本親子旅行.txt', encoding='utf8')
documents = loader.load()
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=700, chunk_overlap=60)
texts = text_splitter.split_documents(documents)
from langchain.embeddings import GPT4AllEmbeddings
from langchain.vectorstores import Chroma
embeddings = GPT4AllEmbeddings()
db = Chroma.from_documents(texts, embeddings)
retriever = db.as_retriever()
qa.run('帶著嬰兒通關有哪些好處?')
```
> " Based on the article, here are some benefits of taking a self-guided trip to Japan with children:\n\n1. Flexibility: With a self-guided trip, you can create your own itinerary and schedule, allowing for more flexibility in terms of pacing and activities.\n2. Cost-effective: Planning your own trip can be more cost-effective than relying on a travel agency or tour operator.\n3. Family bonding: Traveling with children can be a great opportunity for family bonding and creating lasting memories.\n4. Improved language skills: Children can learn about Japanese culture and language, which can be an enriching experience.\n5. Exposure to new experiences: Japan offers a wide range of unique experiences, such as onsen (hot springs), temples, and traditional Japanese cuisine, which can be an exciting exposure for children.\n6. Personalized experience: With a self-guided trip, you can tailor the experience to your family's interests and preferences.\n7. Development of problem-solving skills: Traveling with children can require problem-solving and flexibility, which can be beneficial for their development.\n8. Learning about different cultures: Japan is a country with a rich culture, and traveling there with children can provide an opportunity for them to learn about different customs and traditions.\n9. Creating lifelong memories: A self-guided trip to Japan with children can create lifelong memories of the experiences and adventures had during the trip."
- [The Power of LangChain’s Question Answer Framework](https://medium.com/@kbdhunga/enhancing-conversational-ai-the-power-of-langchains-question-answer-framework-4974e1cab3cf)
