# [WIP] Think AR
## Background
- AR glasses that connect to iPhone app
- Wants to do iOS inference & function calling/ json response (i.e. speak -> open apps)
- Likely to prototype on M4 iPads to prep for M4 iPhones
- Already using WhisperCPP for STT (possibly ffmpeg + whisper.cpp?)
## Scope
### 1. Llama.cpp in Swift
- `cortex-swift`: A Swift package that runs recent LLM architectures (llama3, phi3)
- Encapsulates [llama.cpp inference engine](https://github.com/janhq/cortex.llamacpp) and [Cortex server](https://github.com/janhq/cortex).
- We'll likely implement language bindings to Swift using Cortex's [engine interface](https://github.com/janhq/cortex/blob/dev/cortex-cpp/cortex-common/EngineI.h). This is extensible if Android support is needed in the future.
- Cortex server includes production-level features such as request queues, an OpenAI compatible API, components that enable rapid development on top of the underlying AI capabilities.
- Likely to chain with existing whisperCPP similar to [Talk Llama](https://github.com/ggerganov/whisper.cpp/tree/master/examples/talk-llama)
#### Not in Scope
- See all [Inference frameworks](#Inference-Frameworks)
### 2. Custom Model(s)
- A custom model for iOS function calling/ json response
- Synthetic training datasets for functions and popular APIs
- Construct IPC function calls
- Construct deep links (iOS IPC not always great)
- **Expected speed**: `6-20 tokens/second` depending on model size. This is around 4.5-15 words per second. Note: the average Speaking Speed Rate is 150 wpm (2.5 words per second).
### 3. Function calling vs JSON output
- Model outputs structured JSON for manual-programmatic post processing
- Cortex will enable a JSON response format so that it is really easy to extract information this is better for limited context lengths and fast integration)
```json
# POST /chat/completion
```json
{
"role": "system",
"content": "You are a helpful assistant that outputs in JSON.",
},
{"role": "user", "content": "Who won the world series in 2020"},
],
"response_format"={
"type": "json_object",
"schema": {
"type": "object",
"properties": {"team_name": {"type": "string"}},
"required": ["team_name"],
},
}
```
#### Not in scope?
- Function calling as a full OpenAI-compatible [Tool](https://platform.openai.com/docs/guides/function-calling/function-calling-behavior)
- Function chaining, parallel function calling, etc.
## Timeline
*This timeline is agressive and will require us to internally reprioritize. To be confirmed.*
- Week 1: Understand requirements, validate course of action.
- Week 2: LLM runs
- Week 3: Train model(s)
- Week 4: Function / JSON format
# Appendix
## Models
The following foundation models are suitable as base models for further post training.
| Model | Description | Peak memory | Disk size | Performance | License |
| -------- | -------- | -------- | -------- |-------- | ---- |
| Phi-3 Small (7b) | Text | Text | Text | Text | [MIT](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/blob/main/LICENSE) |
| Mistral 7B v0.3 Instruct | Text | Text | Text | Text |[Apache 2.0](https://mistral.ai/news/announcing-mistral-7b/) |
## Finetuning
Post training
| Method | Description | Data |
| -------- | -------- | -------- |
| Supervised Fine-tuning | Finetune on a curated dataset to align the model on token formats, tone, style, and probably add a limited amount of knowledge | Conversation Dataset, Synthetic Datasets, Structured Datasets (for fn calling) |
| Directed Preference Optimization | Alternative to RLHF | Binary Dataset |
We can also explore other SOTA fine-tuning techiques:
- ORPO
- DORA
- LORA/QLORA
In addition to exploring
- Micro-training
- RAG quality improvement
- Continual learning (Finetuning without losing base capability)
## Inference Frameworks
| Tool | Description | iPhone | Android | Performance* | License |
|------------------------------------------------------------------------|----------------------------------------------------------------|----------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|
| [llamacpp](https://github.com/ggerganov/llama.cpp) | Cross platform inference in C++ | ✅ Delivery mechanism needed | ✅ JNIWrapper/solution Need | ~9 TPS (Phi-small on iPhone) <br> ~4 TPS (Mistral 7b on iPhone) | [MIT](https://github.com/ggerganov/llama.cpp?tab=MIT-1-ov-file#readme) |
| [MLX](https://github.com/ml-explore/mlx) | ML framework for inference & training on Apple Silicon | ✅ [MLX Swift](https://github.com/ml-explore/mlx-swift) | Not compatible | ~8.5 TPS (Llama 3 8b on iPhone) | [MIT](https://github.com/ml-explore/mlx?tab=MIT-1-ov-file#readme) |
| [MLC](https://github.com/mlc-ai/mlc-llm) | Universal deployment for on device inference. | ✅ MLC Swift (direct binding & API) | ✅ OpenCL | ~11 TPS (Mistral 7B on iPhone) | [Apache 2.0](https://github.com/mlc-ai/mlc-llm?tab=Apache-2.0-1-ov-file#readme) |
| [Executorch](https://github.com/pytorch/executorch) | On-device AI across mobile, embedded and edge for PyTorch | ✅ Bundles CoreML | ✅ Bundles JNIWrapper | 11.5 TPS ([llama 2 7B on OnePlus12](https://github.com/pytorch/executorch/blob/main/examples/models/llama2/README.md#performance)) | [BSD](https://github.com/pytorch/executorch?tab=License-1-ov-file#readme) |
| [onnx genai](https://github.com/microsoft/onnxruntime-genai?tab=readme-ov-file) | On-device AI across mobile, desktop. iOS in upcoming, not now | ✅ Can use [CoreML Execution provider](https://onnxruntime.ai/docs/execution-providers/CoreML-ExecutionProvider.html) | ✅ [NNAPI execution provider](https://onnxruntime.ai/docs/execution-providers/NNAPI-ExecutionProvider.html) | N/A | [MIT](https://github.com/microsoft/onnxruntime-genai/blob/main/LICENSE) |
*\*Crowdsourced performance stats that we will need to verify.*
## Reference
- llama.cpp A chip: https://github.com/ggerganov/llama.cpp/issues/4358
- llama.cpp M chip: https://github.com/ggerganov/llama.cpp/discussions/4167
- Apple Transformer (not LLM) on Neural Engine: https://machinelearning.apple.com/research/neural-engine-transformers
- Explanation for Apple Neural Engine: https://github.com/hollance/neural-engine