### THIS IMPLEMENTATION IS NOW COMPLETE = RELEASED NOV 2025
**BETA LINK https://hi-aleph.com/**
**GITHUB https://github.com/Aleph-Org/aleph**
---
### **1 Month MVP Implementation Log**
Prepared by Maz for Aleph
Doc created 01092025 // Last updated 05092025
:::success
Objective: to create a prototype of an AI audiobook web app for a general consumer base for testing and user feedback purposes. I will use this document as a way to provide a daily to weekly log on the progress of the prototype implementation and GTM rollout.
:::
**Deliverables**
* A working demo prototype (PWA) with a starter library of c.10–15 audiobooks
* A pre-launch landing page with early-bird waitlist functionality
* A GTM strategy and supporting demo materials for beta user onboarding and testing
* A prioritised backlog for ongoing development towards a production-ready v1
### **4-week plan overview**
| | **W1** | **W2** | **W3** | **W4** |
|----------------------|------------------|---------|---------|---------|
| **Goals**| Existing repo review, understanding of audio processing workflow, basic concept with frontend interface, incorporate 1st first book and test LLM function | UI refinment to enable rollout of additional books in a modular UI deisgn, Introduce 3 new books with 3 chapters each, Product Roadmap with product phases agreed, Concept iteration and feature refinement i.e. system instructions, user scenario testing, communicate feature tradeoffs for v1 if applicable | GTM and user testing flow preparation, landing page design, copy and v1 development, web app codebase review and modification, introduce x additional books | Rollout preparation for MVP for waitlist sign up purposes, testing and user feedback, communication flow for signups and access to MVP for testing |
### **Roadmap overview: Phase 1 and Phase 2** (Under review)
### **01092025**
Review of the cloned Aleph repo in Cursor to support prototype build foundation and understand codebase and workflow fundamentals.
Key takeaways from review so far include the following that I would need to double check with Giacomo and Masoud if my understanding is correct:
1. There is an emphasis on **Sentence-level synchronization** making the processing very granular, done intentionally to get a level of detail foundational/generalist models cant.
2. **Audio processing workflow** comprises of multiple phases including 
3. **Book metadata structure** utilising a hierarchical metadata system for audiobooks with sentence-level synchronization.
Key metadata features for documentation:
- Precise timing: Each sentence has exact `begin` and `end` timestamps (in seconds)
- Context-aware: System can identify current sentence during playback
- Progress tracking: Uses metadata to calculate reading progress across chapters
Codebase commands to be aware of
- Sentence detection: `getCurrentSentenceIndex()` finds current sentence based on playback time
- Context extraction: `getCurrentContext(n)` gets surrounding text (n words)
- Position persistence: Saves/restores playback position using cookies
- Multi-chapter support: Handles books with multiple chapters
- Natural Voice Control
"Next chapter" → AI calls `next_chapter()` function
"Go back" → AI calls `previous_chapter()` function
"Resume the book" → AI calls `stop_the_chat()` function
**End of day achievement:** basic frontend interface on local setup on a seperate branch on forked repo.
{%preview https://screen.studio/share/6XQMOjUZ %}
### 02092025
Based on understanding of book data processing workflow to set up a **implementation** for a test book in existing repo. With the aim of having a basic functionality by end of day.
**Recommended UX flow**
* Start reading: instantiate the audio manager with `bartleby-chapters.json`, then auto-play.
* Unmute to ask: clicking the mic pauses playback, opens the mic connection, sends current reading context, and plays the AI’s spoken response. When the AI is done, it sends an audio control back (e.g., play/seek) which your player applies; the voice client then disconnects.
**Possible modification:**
- Auto-resume fallback: if the server ever doesn’t send a “resume” control, you can remember “wasPlaying” before connect and resume on disconnect.
- Push-to-talk: wrap ConnectButton so press-and-hold = connect, release = disconnect for a true “unmute” feel.
**End of day achievement:** updated landing page ui and pop up modal focusing on Bartleby player. The audio plays and user can connect mic to interrupt but still need to set up the OpenAI and Daily.co APIs.
{%preview https://screen.studio/share/2A11ZZ6Z %}
### **03092025**
```
📞 sync team call @ 8 am PST w Masoud and Giacomo.
```
Implement relevant environment API keys to test user abilty to converse with the LLM about the book in focus: Bartleby, the Scrivener. Current setup in repo includes:
- Existing AI integration: The system already uses **OpenAI Realtime Beta for voice conversations** with an AI assistant named "Caroline" who acts as an English literature professor
- Context-aware system: The AI receives **book context** (title, author, current reading position) when users connect
- Voice commands: Users can already a**sk questions about books** and control playbook navigation
**API keys set up for prototype testing**
- **OpenAI API** and looking into incorporating their Responses API to store conversations for future analysis
- **Daily.co API** and **Pipecat Private API setup**
https://github.com/daily-co/pipecat-cloud-starter
> OpenAI Responses API would be useful for:
> - Storing user conversations for improving the experience
> - Analysing what types of questions users ask
> - Build training data for a custom book-discussion model
**End of day achievement:** {%preview https://screen.studio/share/SQ4RuT3M?private-access=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzaGFyZWFibGVMaW5rSWQiOiIxMzI3MzE4Yi0xMDhjLTQxOTItYTE0OS0zZjYxOGNjZWNiOGMiLCJpYXQiOjE3NTcxMDQwODV9.Ypsl3MAdZkHf_sV3WYaOoPD6Eyk14ureHVUONMbNHmw %}
**Based on current status of the forked repo, here is how the LLM knows about the book content:**
1. Book metadata from `bartleby-chapters.json`
{
"title": "Bartleby, the Scrivener",
"author": "Herman Melville",
"notes": "Bartleby is pronounced Bar-tell-bee-"
}
2. Full book text from `bartleby.json` contains 3,617 timestamped sentences with the complete story text
[
{"begin": "0.000", "end": "2.320", "text": "I am a rather elderly man."},
{"begin": "2.320", "end": "22.600", "text": "The nature of my avocations..."},
// ... entire book text with timestamps
]
3. Context extraction the `getCurrentContext(100)` function sends the current 100 words around where the user is listening to the LLM/player
Summary of current LLM understanding:
✅ Book title, author, pronunciation notes
✅ Current 100 words around where user is listening
✅ Caroline's personality (English professor character)
❌ Limited context (just 100 words around current position)
❌ No full book knowledge (doesn't have the entire story memorised)
*Note to self: chapter based data pipeline to be shared by Giacomo this week based on team sync call this morning.*
**Data Flow When User Clicks "Connect":**
Frontend → Backend → OpenAI
1. Frontend `(useRTVIClient.ts:329-340)`
systemInstruction += `Book: ${bookInfo.title}\n`;
systemInstruction += `Author: ${bookInfo.author}\n`;
systemInstruction += `Notes: ${bookInfo.notes}\n`;
systemInstruction += `\nCurrent context: ${sentenceText}`; // 100 words around current position
2. Backend `(server.py:201)`Receives `systemInstruction` from frontend
### 04092025
Review existing setup with Openai API and other examples to test performace.
**Existing set up includes:**
- API Type: OpenAI Realtime Beta API (not standard Chat Completions)/OpenAI Realtime Beta WebSocket API through Pipecat framework
- Model: gpt-4o-transcribe for STT + built-in TTS
- Voice: shimmer (one of OpenAI's neural voices)
- Integration: Fully integrated STT + LLM + TTS in one service
**Relevant links**
https://platform.openai.com/audio/tts
https://platform.openai.com/docs/api-reference/audio/createSpeech
(this shows the endpoints i.e. `POST https://api.openai.com/v1/audio/speech` generates audio from the input text this is NOT what we are using we are using OpenAI Realtime Beta WebSocket API through Pipecat framework)
We have 4 different TTS services imported but only using one to test:
✅ OpenAIRealtimeBetaLLMService (current) - All-in-one: STT+LLM+TTS (keep using for now - best performing in comparison to below)
⚠️ OpenAITTSService (available) - Standard OpenAI TTS AP
⚠️ ElevenLabsTTSService (available) - ElevenLabs TTS
⚠️ GeminiMultimodalLiveLLMService (available) - Google's TTS
Testing today:
- context
- interruption
- conversational/natural
**Testing**
✅ OpenAIRealtimeBetaLLMService (current) - All-in-one: STT+LLM+TTS
{%preview https://screen.studio/share/S8P1woCY?private-access=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzaGFyZWFibGVMaW5rSWQiOiIzYWY2NjYyNi1kZTU4LTQ2ZjktYTZlOC04OTQzYTg1OTUyMjUiLCJpYXQiOjE3NTcxMDQwNjB9.RKDxPu10o3fSPKHp94Xp9z1-Lzhj178lIg5NCakrGmg %}
--> Current prompt in single_bot.py
```
SYSTEM_INSTRUCTION = """
You are Caroline, a young professor of english and deeply knowledgeable about 19th and 20th-century American and English literature.
Your aim is to support the user understanding of the story through friendly conversation, slowly building a deep appreciation for the literature without rushing through points.
Output Format
- Start with a concise statement, elaborating further only upon follow-up queries.
- Diversify responses in length, structure, and tone to avoid predictability.
- As appropriate, keep some answers succinct, sometimes even one-worded.
Notes
- Steer clear of directly referencing plot outcomes or future events to prevent spoilers.
- Focus on the story's context, character analysis, and historical background.
- Ensure your demeanor is calm, conversational, and dynamic, employing varied vocabulary and expressions. Tailor interactions for long-term engagement rather than short-term understanding, embodying a friendly conversational tone
This is the book the user is going through and the last few sentences listened before initiating the conversation (the last sentence might have been interrupted half way):
"""
```
--> Enhanced System Prompt for interruption + continuation flow:
```
SYSTEM_INSTRUCTION = """
You are Caroline, a young professor of English literature, specializing in 19th and 20th-century American and English works.
CORE BEHAVIOR: Interruption-Based Reading Companion
You are designed to be INTERRUPTED by users at any point in their audiobook journey. When users pause their reading to ask questions:
1. **Answer their question** based on where they currently are in the story
2. **Keep responses concise** (30-60 seconds max) to respect their reading flow
3. **After answering, ALWAYS ask**: "Shall I continue reading?" or "Ready to continue?" or similar natural transition
Context Awareness
- You know EXACTLY where the user is in their reading journey
- Reference the current scene/passage they just heard
- Avoid spoilers - only discuss what they've already experienced
- Use their current context to provide relevant insights
Conversation Style
- **Natural and conversational** - like a friend discussing books over coffee
- **Vary your transitions**: "Shall I continue?", "Ready to go back?", "Want to keep reading?"
- **Match the story's tone** - more formal for classics, casual for modern works
- **Be encouraging** about their reading progress
Response Length
- **Quick questions**: 1-2 sentences + transition
- **Complex analysis**: 3-4 sentences max + transition
- **Always end with invitation to continue reading**
This is the book the user is currently experiencing and their exact position:
"""
```
### 05092025
```
📞 sync call @ 8 am PST w Giacomo.
Alignment on product rollout strategy and roles:
- Landing page + prototype = Lead Maz
- Main app backend and architecture - lead Giacomo
- Main iOS app development front-end/fullstack - ?
Deep dive into:
- data processing/pipeline per book = chapter to metadata to process files,
- LLM methodology, logic and behaviour
- Roadamp with product phases
```
**Call notes and areas to further develop:**
**Pipeline overview (Giacomo to provide complete version)**
* chapter is broken down
* splitting it into paragraphs
* sends the parahraph to openai
* taking chunks/paragraphs
* open ai sends back the file
* once all the files are generated and merged into a single file
* sentence based segmentation (https://www.readbeyond.it/aeneas/docs/clitutorial.html)
**User scenario and model performance testing cases to write up:**
* Different modalities of conversation
* Editorial power = the user made a suggestion on command of the prog i.e. go back or reference a RAG
`for example user prompt: 'go back to the section in the story where Turkey is introduced'`
{%preview https://screen.studio/share/2jSJOabR?private-access=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzaGFyZWFibGVMaW5rSWQiOiJmZjZjMDkxZC04Yzc5LTRmYzQtOTA3NS05ZjYyMzg0MGU1MzMiLCJpYXQiOjE3NTcxMDM3NDl9.hqkGwP49N2q9zxuOPh5MiLmaczavLP3VWuCJDR3QPKU %}
**Product roadmap breakdown (incoming) and analytics:**
- ✅ To agree on basic functionality for mvp
- ☑️ Data collection for Mixpanel and data gathering i.e. how are people interacting with this = add Maz to this
**Relevant areas for development to consider for future app:**
* wake word detection = initiation of the detection (https://en.wikipedia.org/wiki/Voice_activity_detection)
* https://github.com/pipecat-ai/smart-turn
**Exploration on changing endpoint to latest openai gpt-realtime model** (summary https://chatgpt.com/share/68bb3538-73b0-800e-a69b-162b1afe51b6)
- TLDR our setup is pipecat + python instead of the openai direct sdk so pipecat needs to add support for gpt-realtime in order for us to avoid potential errors on execution (ref https://github.com/pipecat-ai/pipecat/issues/2581?utm_source=chatgpt.com)
**End of the week pending tasks in order of priority based on this week's progress:**
1. ✅ Data pipeline/processing and framework from Giacomo: vibe code interface if need be to allow others to allow Maz to also process books. DEADLINE = when would you be able to provide this if not the interface/the documentation would be a good first start...
2. ✅ User requirments and product phase roadmap: Chester is leading this then I will take this over from him to work on it with Giacomo. DEADLINE = I am expecting to receive a draft from Chester next week
3. ☑️ User scenario prompt testing doc: listing different version of responses based on command made by user and measure performance i.e. go back, ask X and reference RAG (post-mvp) DEADLINE = v1 draft 08/09.09.2025
4. 🚧 Landing page + prototype UI/UX and testing and modification will continue in W2.
- Fix BartlebyPlayer UI update to AudioPlayer for consistency across all audio players for future books
- Clean up repo
{%preview https://screen.studio/share/aUaBGnHS %}
### **08092025**
**Prototype refinment week 2 objectives include:** (1) Organising repo to separate front-end from backend, (2) Making UI components consistent throughout pages and not specific to a specific book* i.e. in Store add rule to use UI audioplayer component for every book for consistency*, (3) Introduce book tags etc as part of data model, (4) create foundational requirement to introduce the rollout of more books.
**Book data processing findings** to process the 2nd book I have personally chosen 'The Last Economy' [docs](https://thelasteconomy.com/docs/the-last-economy/introduction) and [github](https://github.com/Intelligent-Internet/Symbioism-TLE) for practice and to show on the prototype web app frontend.
1. Time-aligned transcript prep
```
Time-aligned transcript of a chapter, broken into JSON objects. Each object contains:
1. "begin" → start timestamp in seconds (string formatted to 3 decimals)
2. "end" → end timestamp in seconds (same format)
3. "text" → the sentence or segment spoken during that interval
This structure is typical for audiobooks when the text is synced with audio (sometimes called SMIL-style or TextGrid-style alignment). It allows you to:
- Highlight text as audio plays (karaoke-style read-along).
- Jump to exact parts of the audio when clicking a sentence.
- Slice audio into sentence-level clips for interactive experiences.
- Do analytics like reading speed, sentence duration, or word timing if you add word-level detail.
A few notes on what you have:
- The "begin" and "end" values line up with seconds in the audio track. For example, "begin": "0.000", "end": "10.560" means that first sentence spans the first 10.56 seconds of the chapter.
- These alignments were likely generated by a forced-alignment tool (like what we already use Aeneas https://readbeyond.it/aeneas/docs/clitutorial.html)**, WhisperX, or AWS Transcribe with timestamps).
- The "text" isn’t always exactly a “sentence” in grammar terms — it’s a segment that matches pauses in the audio, so sometimes it will be a fragment or multiple sentences.
(Chat summary https://chatgpt.com/share/68bf6f6e-a150-800e-bfe7-f2fe5207720b))
```
2. Feeding the chapter to the LLM
```
Book Selection: BookSelectionScreen.tsx loads books from books.json and maps titles to URL names
System Instruction: Currently hardcoded for "Caroline, a young professor of english" focused on 19th/20th-century literature
Voice: Already set to "shimmer" ✅
Book Context: Gets sent via the systemInstruction parameter with book info + current listening context
```
### 09092025
```
📞 sync call @ 8 am PST w Masoud, Giacomo And Chester
Action items
- Maz to continue working on the web app prototype, incorporating feedback on the design and layout, and to provide updates for review
- Complete product roadmap per product phase as led on call by Chester
- Giacomo to upload the data processing pipeline to IOF and work on compiling 100 books across the agreed-upon categories.
```
Giacomo has delivered the following for book processing at scale: http://138.201.22.12:5005/

### 12092025
Continued work on forked repo tasks
- system instruction edit and test
- change over api keys
- scope demo for pinar and next steps
front end next steps
- each book to have individual pages following the same ui params for better seo
- ensure web app and responsiveness on desktop and mobile
- get ready to merge into main repo
- create small branches for edits in main repo
- discuss signin/login or capability in proto
{%preview https://screen.studio/share/wW43ROdR %}