# Embedding Vitalik
> Books speak volumes when minds engage.
> *- ChatGPT*
<center><img src="https://hackmd.io/_uploads/B1Nn9BcZ6.png" width=400/></center>
## ChatGPT as an author's digital twin
ChatGPT works really well in summarizing texts. With the correct system prompts, it can be instructed to even emulate the author of the text.
> system prompt >
> Answer the users questions as if you were the author of the following text.
However, even with the model *gpt-3.5-turbo-16k*, which supports a 16k context window, this is very limited to texts like articles and blog posts with about 20 pages.
What if I want to speak with longer texts, even books? And can I combine texts with different topics?
## Embedding and Finetuning
There are 4 different ways to influence the answers of GPT:
* user prompts
* system prompts
* embeddings
* fine-tuning
While _user prompts_ and _system prompts_ are obvious and used in every day conversations with ChatGPT. _Embedding_ and _fine-tuning_ are more sophisticated techniques.
There are many lengthy descriptions about these techniques, however, in short and to the point the use cases are
* **_Fine-tuning_**: Teach the model _how_ to answer a question (e.g. structure/format, personality, etc)
* **_Embedding_**: Provide the model with new/specific information with which to answer questions.
(see elmstedt's comment at [openai.community.openai.com](https://community.openai.com/t/chatgpt-3-5s-fine-tuning-or-embeddings-or-both/328061/3))
This is exactly what I verified in my experiments, I started with fine-tuning and it did not really include the knowledge of the texts in its answers, but as I used a Q&A format for training, adapted this "new" format perfectly for its answers.
## Embedding Vitalik's blog posts
Even if you are not affiliated with the "crypto" space, you might have heard of [Vitalik Buterin](https://en.wikipedia.org/wiki/Vitalik_Buterin). In [his blog](https://vitalik.ca/) he publishes posts with different topics which are really long and cumbersome to read, but often brilliant, creative and informative.
I like to read them, but afterwards I am often not sure if I really understood those posts. I'd like to ask questions and get answers to better grasp the topics.
From what we have learned about LLMs and customization, we could just add those texts to the user or system prompt and go for it - but quite often, even _one_ of Vitalik's blog post exceeds the 16k context windows.
<img src="https://hackmd.io/_uploads/SJxG_nzza.png" width=500/>
_Most recent blog posts as of the time of writing this. Just the plain text of two of them already exceed the default 4k window context._
After fine-tuning was a failure for this use case (there was no knowledge stored _at all_ and only the format could be adjusted, as stated above), I wanted to try embeddings.
## LangChain4j "Talk to Documents"
Watch the [great talk of Lize Reas about LangChain4j](https://www.youtube.com/watch?v=BD1MSLbs9KE) at this years Devoxx. All code mentioned in the video is in the LangChain4j repository and the [12th example ("Talk to Documents") is included in the Devoxx samples](https://github.com/langchain4j/langchain4j-examples/blob/main/devoxx/src/main/java/_12_ChatWithDocumentsExamples.java), so this is a great starting point for my test.
For this example, I am using the embedding model [Hugging Face all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) which is wrapped in [LangChain4j](https://github.com/langchain4j) class `AllMiniLmL6V2EmbeddingModel`.
---
I am reusing the `_12_ChatWithDocumentsExamples` class and added JBang comments, so you can start the example locally with:
`jbang https://gist.github.com/ice09/ae5409de706fbd5e08183f7632b44434 OPENAI_API_KEY`
[JBang](https://www.jbang.dev/) will read and run the sources from the remote location. You just have to add your OPENAI_API_KEY as a first argument and can start with the conversation with a GPT using the three blogposts as embeddings.
---
There are four steps included:
1. Create a local embedding model for _HF all-MiniLM-L6-v2_
2. Create embeddings for the three most recent posts of Vitalik's blog
3. Create an `OpenAiChatModel`
4. Create a `ConversationalRetrievalChain` with the chat model and the embeddings
(_Full code is stored as Gist here: https://gist.github.com/ice09/ae5409de706fbd5e08183f7632b44434_)
```java
// Create local embedding model
EmbeddingModel embeddingModel = new AllMiniLmL6V2EmbeddingModel();
EmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>();
EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
.documentSplitter(DocumentSplitters.recursive(500, 0))
.embeddingModel(embeddingModel)
.embeddingStore(embeddingStore)
.build();
// Create embeddings from web site content
Document document = new TextDocumentParser(DocumentType.TXT).parse(...);
ingestor.ingest(document);
// Create OpenAI chat model for gpt-3.5-turbo
OpenAiChatModel assistant =
OpenAiChatModel.builder()
.modelName("gpt-3.5-turbo")
.apiKey(OPENAI_API_KEY)
.timeout(Duration.ofMinutes(5)).build();
// Create conversational chain with embedding store and chat model
ConversationalRetrievalChain chain = ConversationalRetrievalChain.builder()
.chatLanguageModel(assistant)
.retriever(EmbeddingStoreRetriever.from(embeddingStore, embeddingModel))
.build();
```
![](https://hackmd.io/_uploads/Hk4NqBez6.png)
_Our Java toolchain: JBang executes a remote Gist, creates embeddings and queries GPT_
## Sample Dialogs
I added some sample dialogs with this setup here to show that there are differences between a plain GPT and the one using embeddings and also to show that embeddings seem to be a very good solution to this use case.
The sample dialogs have been created after embedding these blog posts:
* [Should Ethereum be okay with enshrining more things in the protocol?](https://vitalik.ca/general/2023/09/30/enshrinement.html)
* [What do I think about Community Notes?](https://vitalik.ca/general/2023/08/16/communitynotes.html)
* [What do I think of biometric proof of personhood?](https://vitalik.eth.limo/general/2023/07/24/biometric.html)
with a `MessageWindowChatMemory` of 0, ie. there Q&A is "stateless", there is no context involved except the provided embeddings.
### Simple questions related to one blog post's topic
> **What is proof-of-personhood?**
#### With embeddings
> Proof-of-personhood refers to the concept of verifying and establishing the identity of an individual in a digital or online environment. It is a means of distinguishing between humans and bots or automated systems. There are two main approaches to proof-of-personhood: social-graph-based and biometric.
> Social-graph-based proof of personhood relies on the idea of vouching, where verified individuals vouch for the identity of others. For example, if Alice, Bob, Charlie, and David are all verified humans and they all vouch for Emily, then Emily is likely considered a verified human as well. Incentives are often used to encourage accurate vouching, with penalties for false claims.
> On the other hand, biometric proof of personhood involves verifying physical or behavioral traits unique to humans. This could include fingerprint recognition, facial recognition, voice recognition, or other biometric markers that help distinguish humans from bots or differentiate between individuals.
> Many projects and implementations combine both social-graph-based and biometric techniques to create a more robust proof-of-personhood solution. The goal is to achieve the security properties required for various applications without relying on centralized identity solutions or risking the dominance of money or closed communities.
> While the development of an effective proof-of-personhood solution may take time and experimentation, it is seen as a valuable concept that can enhance trust and security in digital interactions.
#### Plain GPT
> Proof-of-personhood refers to a method or system that verifies an individual's identity or existence as a unique human being. It involves providing evidence or credentials to establish that an individual is a real person with legal rights and responsibilities.
This concept is often discussed in the context of emerging technologies such as blockchain and digital identity systems, where the need for reliable and secure verification of personhood arises. The aim is to prevent fraud, identity theft, and ensure the integrity of online transactions and interactions.
> Proof-of-personhood mechanisms may involve various forms of identification, biometric data, or other means of authentication to establish and validate an individual's personal identity.
These are really good answers, given this open question. Obviously, the embeddings really pay off here, the plain GPT answer is good for no additional knowledge, however the embeddings answer is way better and fits the information given in the blog post (eg. differentiating between social and biometric proofs).
> **How are biometric and social-based proof-of-personhood related?**
#### With embeddings
> [...left out similar sections...]
> In practice, many projects combine both biometric and social-based techniques to enhance the security and reliability of proof-of-personhood. Biometric traits can provide strong individual identification, while social connections and vouching can add a layer of trust and scalability to the verification process. The combination of these techniques aims to achieve better security properties without the drawbacks of centralized approaches.
#### Plain GPT
> [...left out similar sections...]
> In summary, while biometric proof-of-personhood focuses on physical or behavioral characteristics, social-based proof-of-personhood revolves around social connections and activities. Although they are distinct approaches, there can be instances where they intersect, such as with facial recognition technology.
While these answers are both impressive, the conclusions of the embeddings-fed one really grasps "the spirit" of the original blog post better, namely that biometric and social-based proofs have different goals and can be complementary (eg. wrt. scalability).
> **What are zks used for?**
#### With Embeddings
> ZK, in this context, refers to Zero-Knowledge proofs. Zero-Knowledge proofs are cryptographic protocols that allow a prover to demonstrate knowledge of certain information without revealing the actual information itself. They are used to enhance privacy and security in various applications.
> In the given information, ZK-EVMs (Zero-Knowledge Ethereum Virtual Machines) are mentioned as a potential target for enshrining into the Ethereum protocol. ZK-EVMs are implementations that utilize Zero-Knowledge proofs to verify the execution of Ethereum-like blocks inside a ZK-SNARK (Zero-Knowledge Succinct Non-Interactive Argument of Knowledge).
> By employing Zero-Knowledge proofs, ZK-EVMs can provide privacy and scalability benefits to Ethereum and other blockchain platforms.
> The use of Zero-Knowledge proofs in ZK-EVMs can address concerns regarding bugs in the ZK-code. The mentioned systems have some form of a "security council" mechanism that can override the proving system in case of a bug, ensuring the integrity and security of the ZK-EVM implementation.
> Overall, Zero-Knowledge proofs are used in applications like ZK-EVMs to enhance privacy, security, and scalability in blockchain and cryptographic systems.
#### Plain GPT
> ZKs, or Zero-Knowledge Proofs, are cryptographic protocols used to prove the validity of a statement without revealing any additional information beyond the fact that the statement is true. They are primarily used for enhancing privacy and security in various applications. Some common uses of ZKs include:
> 1. Authentication: ZKs can be used to prove the authenticity of a user without revealing their actual identity or any other personal information.
> 2. Digital currencies: ZKs are employed in some cryptocurrencies, like Zcash, to ensure transaction privacy by proving the validity of a transaction without disclosing the sender, recipient, or transaction amount.
> 3. Data privacy: ZKs can be applied to protect sensitive data by proving certain properties about the data without revealing its actual content.
> 4. Access control: ZKs can enable users to prove their access rights to resources or systems without disclosing any additional information about themselves.
> 5. Verifiable computations: ZKs can be used to prove the correctness of computations or algorithms without sharing the actual input or intermediate steps.
> Overall, ZKs provide a powerful tool for achieving privacy, security, and trust in various domains where proving statements without revealing details is essential.
I like the plain GPT answer as it is closer to the question without context, however if I want to query for information in the blog posts, the embeddings answer is to the point.
> **Explain in detail how Community Notes work.**
#### With embeddings
> Community Notes is a feature on Twitter that serves as a fact-checking and anti-misinformation tool. It allows users to attach context notes to tweets, providing additional information or clarifications about the content. The feature was originally called Birdwatch and was introduced as a pilot project in January 2021, gradually expanding its reach since then.
> One notable aspect of Community Notes is that it embodies the principles of decentralization and open participation, aligning with the values often associated with the cryptocurrency and blockchain space. Unlike traditional fact-checking processes that rely on a centrally selected group of experts, Community Notes can be written and voted on by anyone on the platform.
> The visibility of Community Notes is determined by an open-source algorithm. Rather than being controlled by a centralized authority, the algorithm decides which notes are shown or not shown. Twitter provides a detailed guide explaining how the algorithm works, and users can even download the data containing published notes and votes to run the algorithm locally and verify its output. This level of transparency and openness in the algorithm aims to promote credibility and neutrality.
> While the Community Notes algorithm may not be perfect, it strives to achieve a high level of credibility and neutrality in presenting fact-checking information. It allows for broad participation, enabling a diverse range of perspectives and expertise to contribute to the fact-checking process. Users across the political spectrum have found value in the informative nature of these notes, especially in the context of contentious political topics.
> In summary, Community Notes on Twitter function as a decentralized fact-checking tool that relies on open participation and an algorithmic approach to determine the visibility of notes. It aims to combat misinformation and provide valuable context to users, embodying principles closely associated with the crypto world.
#### Plain GPT
> Community notes are a collaborative feature that allows multiple users to contribute and edit information on a specific topic or subject. These notes function as a shared resource, where users can share their knowledge, insights, and experiences with others.
> Here's a step-by-step explanation of how community
notes work:
> [...]
> In summary, community notes are a collaborative platform where users can contribute, edit, and share information on various topics. This collective effort creates a valuable and reliable resource for the community, enabling the pooling of knowledge and fostering ongoing learning and engagement.
Again, of course the embeddings-version knows more about the topic and focusses on the specific details, but even more, it gets the message of the blog post very well.
But now, let's try to combine this! This is the "leverage effect" of embeddings and makes it better than just "bigger" context windows.
### Questions related to all given information about different topics
> **How are ZKS, Community Notes and Proof-of-Personhood related to the concept of decentralization?**
#### With embeddings
> ZKS, Community Notes, and Proof-of-Personhood are all related to the concept of decentralization in the context of crypto protocols.
> [...]
> Overall, ZKS, Community Notes, and Proof-of-Personhood contribute to decentralization by reducing the reliance on centralized entities, enhancing privacy and security, promoting community-driven decision-making, and establishing trust among participants in a decentralized network.
#### Plain GPT
> ZKS, Community Notes, and Proof-of-Personhood are all related to the concept of decentralization in various ways.
> [...]
> Overall, ZKS, Community Notes, and Proof-of-Personhood contribute to decentralization by addressing key aspects such as privacy, governance, and identity verification, which are crucial for creating trust, fairness, and transparency within decentralized ecosystems.
This is good example how the embeddings-version is able to summarize more closely related to the information given in the embeddings.
> **Tell me a shared aspect between Community Notes, Proof-of-Personhood and ZKS. How are these concepts related, what do they have in common?**
#### With embeddings
> A shared aspect between Community Notes, Proof-of-Personhood, and Zero-Knowledge Proofs (ZKs) is the aim of enhancing trust, security, and reliability within their respective domains.
> [...]
> In summary, Community Notes, Proof-of-Personhood, and ZKs are related through their shared objective of enhancing trust, security, and reliability. Community Notes aim to provide accurate information, Proof-of-Personhood seeks to establish the authenticity of individuals, and ZKs enhance privacy and confidentiality. Ultimately, all three concepts contribute to building more trustworthy and secure systems or platforms.
#### Plain GPT
> Community Notes, Proof-of-Personhood, and ZKS (Zero-Knowledge Succinct Non-Interactive Arguments of Knowledge) are all concepts related to privacy, identity verification, and security in digital systems.
> [...]
> In summary, these concepts are related through their shared focus on privacy, identity verification, and security in digital systems. While Community Notes emphasizes privacy in messaging, Proof-of-Personhood focuses on verifying personhood without revealing personal information, and ZKS enables proving knowledge without disclosing the actual knowledge itself.
Even though the conclusions are quite similar, the details of the embeddings-version is again more biased to the embedded information and is able to enhance the same conclusion with the correct facts from the blog posts.
### Try it out
Have fun and play around with your own embeddings and consider changing the parameters which I kept with the defaults for this example. It is very likely that the results with different, adjusted parameters are even better.
See the langchain4j "Talk to Documents" section above for step-by-step instructions on how to run the example code of this post.