Cortex Roadmap 🧠

Wa1#shington01 # Cortex Roadmap 🧠 **Mission** > To add intelligence to all hardware, in particular Robots **Vision** > To become the go-to tool to put AI on any device This document highlights where Cortex is currently at in terms of development, what needs to be addressed ordered by priority levels, and what lies ahead in 2025. ## Table of Content 1. State of Cortex 2. Path to Success - Configuration - Portability - Python Engine - Data Management - Performance - Benchmarks & Metrics - Model Management 3. Content - Documentation - Guides - Videos - Conferences 4. Next Stage - Cortex Platform - Functional Requirements - Non-Functional Requirements - Design - Monetization - Model Hub - Hardware Integration - Monetization 5. QA - Tests - CI/CD - Hardware Assurance 6. Roadmap & Action Items - Milestones - TODOs - v2 ## 1. State of Cortex The current version of Cortex, `v1.0.10`, allows developers to run LLMs across different platforms after installation but it is falling short from being able to call itself v1-ready. What's working well - Cortex has a clean CLI with straightforward commands, taking inspiration from the way docker manages images and initializes containers. - The OpenAI compatible API provides a familiar way to communicate with models, making it easy for even non-developers to start using cortex with a few commands. :raised_hands: - The API docs generated upon starting the server are very good. :ok_hand: What needs improvement - Cortex doesn't provide a complete way to customize the `.cortexrc` file it creates in the home directory of its host. This leads to having to manually tweak settings like `apiServerHost:` to `0.0.0.0` in order to self-host cortex in a remote machine. - Not enough educational materials like videos, tutorials and guides. At the moment, users need to know, or have an idea of, what they want to do when they come to Cortex. - The distribution of Cortex is too manual. Users need to go to the documentation and copy a command in order to use it. - The multi-branch approach doesn't seem to offer a lot of benefits on top of the one branch to rule them all approach. It would be benficial to AB test this hypothesis inside the `cortexso` hub to make sure we want to continue investing time in this approach. - `llama.cpp`, our main engine, comes from a repo called `cortex.llamacpp` adding complexity into the codebase. - Interactions with a model are stateless. - It is not possibe to use different modalities at the moment. Competitors - Ollama - The have good momentum to the point that users do the marketing for them. This is a great place to be in. - [ZML](https://github.com/zml/zml) - uses a granular inference pipeline where the model’s forward computation is compiled into an accelerator-specific executable, taking advantange of type-safe tensor constructs and explicit buffer management to minimize overhead and giving fine-grained control over memory and compute operations. The video below by one of the founders is quite good. {%youtube hLHITkWb77s %} ## 2. Path to Success The following path is meant to serve as a tentative blueprint to make Cortex's internals achieve a high degree of reliability, flexibility, and usability in 2025 alongside good distribution. ### Configuration At the moment, Cortex has minimal support for editing its own configuration via the CLI or HTTP and this makes it challenging for developers wanting to deploy it on a VM in the Cloud or even an environment where the ultimate goal is to let Cortex talk to other tools via the server. The first stage in making Cortex more configurable would involve adding a CLI flag for most of the options in the `.cortexrc` file. At the moment, the generated `.cortexrc` file contains the following parameters. ``` logFolderPath: /home/user/cortexcpp logLlamaCppPath: ./logs/cortex.log logTensorrtLLMPath: ./logs/cortex.log logOnnxPath: ./logs/cortex.log dataFolderPath: /home/user/cortexcpp maxLogLines: 100000 apiServerHost: 127.0.0.1 apiServerPort: 39281 checkedForUpdateAt: 1740630061 checkedForLlamacppUpdateAt: 1740628158149 latestRelease: v1.0.10 latestLlamacppRelease: v0.1.49 huggingFaceToken: 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 gitHubUserAgent: "" gitHubToken: "" llamacppVariant: linux-amd64-avx2-cuda-12-0 llamacppVersion: v0.1.49 enableCors: true allowedOrigins: - http://localhost:39281 - http://127.0.0.1:39281 - http://0.0.0.0:39281 proxyUrl: "" verifyProxySsl: true verifyProxyHostSsl: true proxyUsername: "" proxyPassword: "" noProxy: example.com,::1,localhost,127.0.0.1 verifyPeerSsl: true verifyHostSsl: true sslCertPath: "" sslKeyPath: "" supportedEngines: - llama-cpp - onnxruntime - tensorrt-llm - python-engine - python checkedForSyncHubAt: 0 ``` To start a server, we currently only offer three options: ```shell cortex start --port 7777 --loglevel DBUG --help ``` The most crucial option needed in early 2025 is undoubtedly the `apiServerHost` to be able to deploy cortex in a remote VM. Ideally, we would provide users with the full menu to start the cortex server with different configurations. For example: First, a little abstraction for better DX - `logFolderPath` --> `--logspath </path/to/nirvana>` - `logLlamaCppPath` --> `--logsllama </path/to/llamaland>` - `logTensorrtLLMPath` --> Needs to be removed 🪓 - `logOnnxPath` --> `--logsonnx </path/to/devsdevsdevs>` - `dataFolderPath` --> `--datapath </path/to/dataland` - `maxLogLines` --> `--loglines <100000>` - `apiServerHost` --> `--host <0.0.0.0>` - `apiServerPort` --> `--host 7777` ✅ - `checkedForUpdateAt` --> ... Not Needed to start the server ☕ - `checkedForLlamacppUpdateAt` --> ... Not Needed to start the server ☕ - `latestRelease` --> ... Not Needed to start the server ☕ - `latestLlamacppRelease` --> ... Not Needed to start the server ☕ - `huggingFaceToken` --> `--hf-token <token>` - `gitHubUserAgent` --> `--gh-agent <that-thing>` - `gitHubToken` `--gh-token <that-token>` - `llamacppVariant` --> ... Not Needed to start the server ☕ - `llamacppVersion` --> ... Not Needed to start the server ☕ - `enableCors` --> `--cors 1` (1 = true & 0 = false) - `allowedOrigins` --> `--origins <list of origins>` - `proxyUrl` --> `--proxu-url "https://hey.you"` - `verifyProxySsl` --> `--verify-proxy` - `verifyProxyHostSsl` --> `--verify-proxy-host` - `proxyUsername` --> `--proxy-username` - `proxyPassword` --> `--proxy-password` - `noProxy`: example.com,::1,localhost,127.0.0.1 - `verifyPeerSsl` --> `--verify-ssl-peer` - `verifyHostSsl` --> `--verify-ssl-host` - `sslCertPath` --> `--ssl-cert-path` - `sslKeyPath` --> `--ssl-key-path` - `supportedEngines` --> ... Not Needed to start the server ☕ - `checkedForSyncHubAt` --> ... Not Needed to start the server ☕ ```sh cortex start --host "0.0.0.0" \ --port 7777 \ --hf-token "<some-token>" \ --cors 1 \ --logspath "/some/interesting/path" \ ... ``` The second stage would involve allowing cortex to live as a long-standing process within the system in which it is installed by using `systemd` or whatever might be available on the user's device. We could implement this in different ways. Here is one example: ```sh sudo touch /etc/systemd/system/cortex.service sudo chmod 664 /etc/systemd/system/cortex.service ``` In the `cortex.service` file we would include: ```txt [Unit] Description=Cortex [Service] ExecStart=/usr/path/to/cortex/binary start [Install] WantedBy=multi-user.target ``` Then we ca start reload `systemctl` with: ```sh sudo systemctl daemon-reload ``` And operate on the service as a long standing process: ```shell sudo systemctl start cortex.service sudo systemctl stop cortex.service sudo systemctl restart cortex.service sudo systemctl enable cortex.service systemctl status cortex.service ``` ### Portability Portability here means making cortex more accessible via package managers or other distribution channels that might be more appropriate to different hardware. For example, users going for tiny devices or micro-controllers might opt for Alpine Linux as it weights on average 30-50 MB. One pathway to have cortex installed would be via the respective package manager of each platform, in the case of Alpine, that would be the through `apk` manager. Ideally, we would add the required workflow in our CI to distribute cortex via: Mac - homebrew - `brew` - Nix - `nix` - Mac Ports - `port` Windows - chocolatey - `choco` - Scoop - Winget Linux - Alpine - `apk` - Arch - `pacman` or `yay` / `paru` - Fedora - `dnf` - Debian - `apt` or `apt-get` - NixOS - `nix` In addition, we would provide docker images with cortex installed in different OS environments, for example: - `menloltd/cortex-ubuntu:latest` - `menloltd/cortex-ubuntu-nogpu:latest` - `menloltd/cortex-arch:latest` - `menloltd/cortex-fedora:latest` - `menloltd/cortex-alpine:latest` ### Flexibility via the Python Engine The Python engine will provide flexibility in different ways. The main two being development velocity and library ecosystem. In addition to this, the Python engine would provide **the ability to serve models of different modalities** like image, audio, video, robotics actions, and so on, the **ability to serve unquantized models** if the user desires it, and the **ability to offer additional services** on top of it. For example, **Custom Tools**: Users might want to create tools that interact with their deployed models. These might include bespoke benchmarking tools, metrics, and so on. **Fine-tuning**: - On-device fine-tuning could happen as follows: - The user send a copy of the training file and runs the fine-tuning step inside the device. - If a Cortex server is inside a central Menlo or Raspberry Pi providing and providing intelligence to other devices where the data is being collected at, copies of the model could be sent to such devices for fine-tuning and the weights would be sent back to the main device for integration. - Via their own cloud - Menlo Cloud(?) ### Data Management Single node applications might not have access to the internet for a while or work under networks with limited bandwidth. This means that providing a way to save interactions with the model, or to save logs and other kind of metadata would enable different use cases that users would appreciate and that we can add functionality on top of, for example, the fine-tuning option previously mentioned. **Data Saved to the Database** The `cortexcpp` directory we create when Cortex is installed in a system contain a `cortex.db` sqlite database. This database could be leverage to store interactions with different models and provide capabilities such as: - **long-term memory** This would be similar to how systems like mem0 and memGTP **Logs** We do collect a limited amount of logs from both the server and the CLI but have a lot of room for improvement. We could provide different views into the logs via the CLI, for example, simple TUI that pops up with `cortex view logs` allowing you to scan different information available in them. Something similar to the [tui-logger crate](https://github.com/gin66/tui-logger) below: ![](https://github.com/gin66/tui-logger/blob/master/doc/demo_v0.14.4.gif?raw=true) **Metadata** Information coming from the model providers, highly customized models, metrics, benchmarks, and more, can all be considered metadata. Organizing these in an easily accessible tables inside the `cortex.db` would add a delightful touch many tools lack. ### Performance Ability to run models on NVIDIA and non-nvidia based GPUs. Ability to run efficiently on CPU-only architecture. ### Benchmarks & Metrics Being able to provide benchmark data on a per model and per hardware basis would serve the following purposes. - It will provide developers with useful information regarding their model and hardware of interest. - It will help populate the Model Hub's **BenchCards**, which are sister of the Model Cards provided on the HuggingFace Hub. At the moment, `robobench` provides a suite of benchmarks for Cortex across seven areas: 1. Model Initialization Tracks the model's startup performance: - Disk to RAM loading time :heavy_check_mark: - Cold vs warm start times :heavy_check_mark: - Model switching overhead :heavy_check_mark: - Memory spike during initialization :heavy_check_mark: - Multi-GPU loading efficiency (when available) :x: (Not toroughtly tested) - Initial memory footprint :heavy_check_mark: 2. Runtime Performance Measures inference capabilities: - Time to first token (latency) :heavy_check_mark: - Tokens per second (throughput) :heavy_check_mark: - Token generation consistency :x: (Not toroughtly tested) - Streaming performance :x: (Not toroughtly tested) - Response quality vs speed tradeoffs :x: (Not toroughtly tested) - Context window utilization :heavy_check_mark: - KV cache efficiency (Somewhat useful) - Memory usage per token :heavy_check_mark: - Batch processing efficiency :x: (Not toroughtly tested) 3. Resource Utilization Monitors system resource usage: - Memory management patterns - Peak usage - Growth patterns - Cache efficiency - Fragmentation - Hardware utilization - CPU core scaling - GPU memory bandwidth - PCIe bandwidth - Temperature impacts - Power consumption 4. Advanced Processing Evaluates complex scenarios: - Multi-model GPU sharing :pray: ideal - Layer allocation efficiency :pray: ideal - Inter-model interference :pray: ideal (pipeline setting with more than one model loaded) - Memory sharing effectiveness - Multi-user performance - Request queuing behavior - Resource contention handling 5. Workload Performance tests different scenarios: - Short vs long prompt handling - Code generation performance - Mathematical computation speed - Multi-language capabilities - System prompt impact - Mixed workload handling :pray: ideal (very cool with multiple models of different modalities loaded) - Session management :pray: ideal - Error recovery :pray: ideal (similar to recovery behavior section 7 but with different level of details) 6. System Integration Measures API and system performance: - API latency :heavy_check_mark: - Bandwidth utilization :pray: ideal - Connection management :pray: ideal - WebSocket performance :pray: ideal - Request queue behavior :pray: ideal - Inter-process communication :pray: ideal - Monitoring overhead :thinking_face: maybe 7. Reliability and Stability Tracks long-term performance: - Performance degradation patterns :pray: ideal - Memory leak detection :pray: ideal - Error rates and types :pray: ideal - Recovery behavior - Thermal throttling impact :pray: ideal - System stability under load :pray: ideal **Usage and Output** Robobench provides these metrics via a simple CLI: ```bash # Basic benchmark robobench "model-name:quantization" # Specific benchmark type robobench "model-name:quantization" --type runtime # Extended stability test robobench "model-name:quantization" --type stability --duration 24 ``` Results are displayed in clear, formatted tables and can be exported to JSON as well. The ideal pipeline would be that when these benchmarks are run, either through CI or a separate workflow, the result would be fed directly into the new Cortex's Model Hub, providing. What we are currently not including is information regarding the open benchmarks like MMLU, SWE, MATH, and so on. ### Model Management At the moment, the model management Hardware-based suggestion of models upon installa Model Merging capabilities ## 3. Content ### Documentation Cortex will continue to grow and include features that will not be available in previous versions or that might be removed at a later time, because of this, we want to improve our documentation strategy and include **Different versions:** This means keeping up to 3 to 5 versions back to allow developers. If cortex will be used on edge devices, or in used in situations where the internet is flaky or not available, we should assume developers and companies using Cortex won't be able to live on the bleeding edge of our software, therefore, it is important that we give them the appropriate documentation for their version as things change. 2 Some examples: The Zig Programming Language ![Pasted image 20250303113118](https://hackmd.io/_uploads/Bk6Fb6Xjyl.png) Pydantic ![Pasted image 20250303114719](https://hackmd.io/_uploads/rJejW67iyl.png) **Feedback Widgets** As our software matures, it would be useful to passively get feedback on how things are going for our users. A nice way of doing this is via widgets at the bottom of the pages in our docs. OrbitCSS does this quite nicely. ![Pasted image 20250303115043](https://hackmd.io/_uploads/rkB2Z6Xokg.png) **Chatbot or Search & Ask functionality similar to Claire** The chatbot piece might be an overkill but a nice search and chat bar could be quite useful. I find that the way drizzle did this by separating both is quite nice. ![Pasted image 20250303115756](https://hackmd.io/_uploads/Hy-RZ67ikl.png) Their search bar is powered by Algolia but the chatbot is a separate widget powered by [inkeep](https://inkeep.com). ![Pasted image 20250303115920](https://hackmd.io/_uploads/rkiAW67oke.png) ### Guides For developers and companies to adopt Cortex we need to show them what they can do with it. That means creating examples running models via Cortex on Raspberry Pis, Orange Pis, Arduinos, and others with practical and/or cool use cases. **Examples** ![Pasted image 20250303122717](https://hackmd.io/_uploads/Sybxfpmikg.png) - Smart Camera --> adjust lenses, increase accuracy, adjust for movement, detect depth, etc. all via a model - Mobile phones --> fine-tune model on-device to better math the user's behavior when using their phone. - Personal Laptop --> Offline use cases. - Groceries cart --> small device to detect items on the cart. - Support animal --> device that detects environment and helps dog take better care of the person. ### Videos Video tutorials are key for showing developers how to use our software, troubleshoot different situations that might arise, and increase engagements. The lineup for videos this year includes at least one a month in out YouTube channel covering the following topics. - Introduction to Cortex - Use Cases on Top of Cortex - Structured Outputs - Guardrails - Tool-calling - MCP - ... - How-to - Deploy on a Raspberry Pi - Deploy on an Orange Pi - Deploy on an Arduino - ... - Practical & Fun Examples - Smart Home use cases - Airplane coding - ... ### Conferences Conferences represent the perfect environment to connect with developers and potential users of our software. In addition, it let's you see in real-time what is working well and what isn't and take note to iterate faster. Some conferences we might want to attend to: - OSS - Open Source Summit Japan - Python - PyCon - PyData - SciPy - [EuroSciPy](https://euroscipy.org/2025/) - EuroPython - C++ - C++ Now - CppCon - Cpp North - AI - AI_dev - Science - ODSC - Sci Meetup events can often be a mini-conference in and of themselves, and while these would be more location-dependent, it is a good opportunity to get the wider Menlo team involved in community-related activities. ## 4. Next Stage The next stage of Cortex involves thinking about the future and **making it a sustainable product**. This means monetizing the value it adds to teams at companies and corporations while keeping the core of it available in OSS form to users. We want to provide a batteries-included tool with everything a team would want to build around Cortex if **the NeoCortex Platform** didn't exist. Here are some ideas. ### Cortex Platform > **NeoCortex** or simply > **Cortex Platform** ![Pasted image 20250302124612](https://hackmd.io/_uploads/SJYEGpQj1l.png) The Cortex Platform would be the management hub for the deployment of Cortex into multiple devices. It would provide users with a way to oversee, test, and manage their deployed instances of Cortex in devices such as Menlo Pi, Raspberry Pi, Orange Pi, and more, using an intuitive and extensible user interface. The following two sections describe the dish we offer our patrons, what we do in the kitchen to make the dish, and how it should look when it reaches the our patrons' plates. #### Functional Requirements Control instances of Cortex in different platforms. ![Pasted image 20250302130851](https://hackmd.io/_uploads/rkjIzTmske.png) Connection could be managed via SSH and secret access keys. Menlo Pi's, for example, would provide a quick and straightforward experience for connecting to the Cortex Platform. Other platforms with an OS in them could use hardware-specific docker images, the package manager of the OS or another method to provide access to Cortex in their bespoke hardware. The platform would allow to load and unload, test, quantize, and benchmark models. In addition, it would provide data management and sync capabilities - Benchmark hardware - Track metrics - Troubleshoot via ssh - Fine-tuning - on-device - on-prem - cloud , and further customized ones using the Cortex Platform Sync Layer Hardware-model visualization Model Merging #### Non-Functional Requirements - Notes-taking capabilities - Collaborate with team members(?) - Comment on a deployment - Start a thread - Notifications via Slack or Discord regarding a deployment #### Design ![Pasted image 20250303125113](https://hackmd.io/_uploads/S1rYG6QjJe.png) ![Pasted image 20250303182451](https://hackmd.io/_uploads/BkE9G6mi1g.png) ![Pasted image 20250303125451](https://hackmd.io/_uploads/H1soMTQsyl.png) ![image](https://hackmd.io/_uploads/Hk5MRlNjJx.png) ![Pasted image 20250303200441](https://hackmd.io/_uploads/S1-6zpXs1e.png) #### Monetization :moneybag: Revenue will flow in through different tiers of NeoCortex but there are unexplored avenues in this document that could prove quite lucrative, for example, bespoke contacts to set up Menlo Pis, Cortex, and NeoCortex, or engagements with institutions. **Free Tier** NeoCortex will be free to download but with limited features from the get go. As developers or teams go into different tiers, they would be able to access more and more functionality individually or for their team. **Indie Developer Tier** The indie developer tier will be a step up from the free tier and include model-hardware visualisation, nicer logs view, and potentially something else. This could cost (in USD) - $20/month - $200/year (save $40) **Teams** This would include everything in the developer Tier plus the ability to fine-tune on-device, collaboration features like comments, reports, sharing dashboards, invite viewers, and sync layer between deployed instance of Cortex and their desired DB. This could cost (in USD) - Flat monthly fee of $100 - $20/user **Enterprise** Everything in Teams plus Service Level Agreements, early access to new features, and more. This could cost (in USD) - Flat monthly fee of $1000 - $20/user **Bespoke Engagement** These could include. - Hardware-software setup - Model merging - Model fine-tuning - Consulting on how to extract the most out of Cortex ### Model Hub We want to provide users with a good overview on how models work on different hardware. In order to do this, we will revamp the Model Hub and add our own flavor of Model Cards called, BenchCards. For starters, the hub will provide high-level details on each model via a quick drop-down as follows: ![Pasted image 20250303201120](https://hackmd.io/_uploads/BycyQaXo1g.png) ![Pasted image 20250303201137](https://hackmd.io/_uploads/r15JXaQi1l.png) The model card would look somewhat like this. ![Pasted image 20250303202551](https://hackmd.io/_uploads/BkTeQa7oJe.png) ### Hardware Integration ## 5. QA ### Tests The current test suite does not l ### CI/CD ### Hardware Assurance ## 6. Roadmap & Action Items Individual-level focus Ramon - Create content - `robobench` - which will feed information to the new - Model Hub - Polish Model-Hardware Visualization - NeoCortex - Design - Initial prototype in Tauri Thien - Python Engine Harry - Improve the testing suite and CI/CD pipelines of Cortex (if you don't have a lot of experience with C++, you can create tests in your favorite language and convert it to C++ [with this tool](https://codingfleet.com/code-converter/python/)) Sang - Improve the configuration capabilities of Cortex Akarsham - Enable Cortex to run models on different GPUs - Make Cortex go Brrr on CPU Minh - Improve Cortex distribution mechanism i.e. make it installable via the package manager mentioned in the sections above. ### Milestones ```mermaid gantt title Milestones dateFormat YYYY-MM-DD axisFormat %d-%m excludes weekends section M1-Flexibility Python Engine :a1, 2025-03-04, 30d Intel GPU :a2, after a1, 20d AMD GPU :a3, after a2, 20d Other (Web?)GPU :after a3, 20d section M2-Distribution Arch - AUR :2025-03-10, 2d Debian - APT :2025-03-12, 2d Fedora - dnf :2025-03-17, 2d Alpine - apk :2025-03-19, 2d NixOS/MacOS - Nix :2025-03-24, 2d Win - chocolatey :2025-03-26, 2d Win - Scoop :2025-03-31, 2d Mac - Homebrew :2025-04-02, 2d Docker Img Variants :2025-04-04, 21d Tutorials :2025-03-05, 14d Guides :2025-03-14, 21d New Docs Site :2025-03-26, 21d New Model Hub :after b1, 21d section M3-Performance Benchmarks :b1, 2025-03-12, 10d Metrics :b2, 2025-03-12, 24d Robust Testing :after b2, 24d section M4-Monetization Cortex Platform Prototype :2025-04-15, 60d ``` #### Flexibility - Run models in - different formats - quntized and non-quantized - different quantization methods - Run in different platforms - Run in different GPUs - Provide a straightforward memory layer (to start) via SQLite or a similar DB - Provide SDKs that go beyond talking to a model via the OpenAI SDK #### Distribution - Make Cortex accessible via package manages - Make distinct docker images - Create better documentation with - guides on different hardware - tutorials on how to do X with Cortex #### Performance - Make it fast - Make it small - Enable metrics - Give users piece of mind with increased testing #### Monetization - Lunch alpha version of Cortex Platform ### v2 Nvidia alternative wrap all ideas into cortex enterprise? + what are the unix like libraries that can come togehter to the main solution Cleaning up the GitHub project How do we organize the team to tackle the whole of cortex Cortex positioning for now: Firmware on top of Hardware ![image](https://hackmd.io/_uploads/SkC9SNVokl.png)