[WIP] Think AR

# [WIP] Think AR ## Background - AR glasses that connect to iPhone app - Wants to do iOS inference & function calling/ json response (i.e. speak -> open apps) - Likely to prototype on M4 iPads to prep for M4 iPhones - Already using WhisperCPP for STT (possibly ffmpeg + whisper.cpp?) ## Scope ### 1. Llama.cpp in Swift - `cortex-swift`: A Swift package that runs recent LLM architectures (llama3, phi3) - Encapsulates [llama.cpp inference engine](https://github.com/janhq/cortex.llamacpp) and [Cortex server](https://github.com/janhq/cortex). - We'll likely implement language bindings to Swift using Cortex's [engine interface](https://github.com/janhq/cortex/blob/dev/cortex-cpp/cortex-common/EngineI.h). This is extensible if Android support is needed in the future. - Cortex server includes production-level features such as request queues, an OpenAI compatible API, components that enable rapid development on top of the underlying AI capabilities. - Likely to chain with existing whisperCPP similar to [Talk Llama](https://github.com/ggerganov/whisper.cpp/tree/master/examples/talk-llama) #### Not in Scope - See all [Inference frameworks](#Inference-Frameworks) ### 2. Custom Model(s) - A custom model for iOS function calling/ json response - Synthetic training datasets for functions and popular APIs - Construct IPC function calls - Construct deep links (iOS IPC not always great) - **Expected speed**: `6-20 tokens/second` depending on model size. This is around 4.5-15 words per second. Note: the average Speaking Speed Rate is 150 wpm (2.5 words per second). ### 3. Function calling vs JSON output - Model outputs structured JSON for manual-programmatic post processing - Cortex will enable a JSON response format so that it is really easy to extract information this is better for limited context lengths and fast integration) ```json # POST /chat/completion ```json { "role": "system", "content": "You are a helpful assistant that outputs in JSON.", }, {"role": "user", "content": "Who won the world series in 2020"}, ], "response_format"={ "type": "json_object", "schema": { "type": "object", "properties": {"team_name": {"type": "string"}}, "required": ["team_name"], }, } ``` #### Not in scope? - Function calling as a full OpenAI-compatible [Tool](https://platform.openai.com/docs/guides/function-calling/function-calling-behavior) - Function chaining, parallel function calling, etc. ## Timeline *This timeline is agressive and will require us to internally reprioritize. To be confirmed.* - Week 1: Understand requirements, validate course of action. - Week 2: LLM runs - Week 3: Train model(s) - Week 4: Function / JSON format # Appendix ## Models The following foundation models are suitable as base models for further post training. | Model | Description | Peak memory | Disk size | Performance | License | | -------- | -------- | -------- | -------- |-------- | ---- | | Phi-3 Small (7b) | Text | Text | Text | Text | [MIT](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/blob/main/LICENSE) | | Mistral 7B v0.3 Instruct | Text | Text | Text | Text |[Apache 2.0](https://mistral.ai/news/announcing-mistral-7b/) | ## Finetuning Post training | Method | Description | Data | | -------- | -------- | -------- | | Supervised Fine-tuning | Finetune on a curated dataset to align the model on token formats, tone, style, and probably add a limited amount of knowledge | Conversation Dataset, Synthetic Datasets, Structured Datasets (for fn calling) | | Directed Preference Optimization | Alternative to RLHF | Binary Dataset | We can also explore other SOTA fine-tuning techiques: - ORPO - DORA - LORA/QLORA In addition to exploring - Micro-training - RAG quality improvement - Continual learning (Finetuning without losing base capability) ## Inference Frameworks | Tool | Description | iPhone | Android | Performance* | License | |------------------------------------------------------------------------|----------------------------------------------------------------|----------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------| | [llamacpp](https://github.com/ggerganov/llama.cpp) | Cross platform inference in C++ | ✅ Delivery mechanism needed | ✅ JNIWrapper/solution Need | ~9 TPS (Phi-small on iPhone) <br> ~4 TPS (Mistral 7b on iPhone) | [MIT](https://github.com/ggerganov/llama.cpp?tab=MIT-1-ov-file#readme) | | [MLX](https://github.com/ml-explore/mlx) | ML framework for inference & training on Apple Silicon | ✅ [MLX Swift](https://github.com/ml-explore/mlx-swift) | Not compatible | ~8.5 TPS (Llama 3 8b on iPhone) | [MIT](https://github.com/ml-explore/mlx?tab=MIT-1-ov-file#readme) | | [MLC](https://github.com/mlc-ai/mlc-llm) | Universal deployment for on device inference. | ✅ MLC Swift (direct binding & API) | ✅ OpenCL | ~11 TPS (Mistral 7B on iPhone) | [Apache 2.0](https://github.com/mlc-ai/mlc-llm?tab=Apache-2.0-1-ov-file#readme) | | [Executorch](https://github.com/pytorch/executorch) | On-device AI across mobile, embedded and edge for PyTorch | ✅ Bundles CoreML | ✅ Bundles JNIWrapper | 11.5 TPS ([llama 2 7B on OnePlus12](https://github.com/pytorch/executorch/blob/main/examples/models/llama2/README.md#performance)) | [BSD](https://github.com/pytorch/executorch?tab=License-1-ov-file#readme) | | [onnx genai](https://github.com/microsoft/onnxruntime-genai?tab=readme-ov-file) | On-device AI across mobile, desktop. iOS in upcoming, not now | ✅ Can use [CoreML Execution provider](https://onnxruntime.ai/docs/execution-providers/CoreML-ExecutionProvider.html) | ✅ [NNAPI execution provider](https://onnxruntime.ai/docs/execution-providers/NNAPI-ExecutionProvider.html) | N/A | [MIT](https://github.com/microsoft/onnxruntime-genai/blob/main/LICENSE) | *\*Crowdsourced performance stats that we will need to verify.* ## Reference - llama.cpp A chip: https://github.com/ggerganov/llama.cpp/issues/4358 - llama.cpp M chip: https://github.com/ggerganov/llama.cpp/discussions/4167 - Apple Transformer (not LLM) on Neural Engine: https://machinelearning.apple.com/research/neural-engine-transformers - Explanation for Apple Neural Engine: https://github.com/hollance/neural-engine

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.