Cortex 2.0 Strategy

# New Cortex Strategy > If a device obtains the ability to locally think, perceive, and communicate using Cortex, we call that a robot. The problem Cortex is solving: Deploying AI locally is fragmented, forcing developers to juggle different tools for personal projects, on-device applications (whether a robot or an app), and scalable cluster deployments. Cortex bridges these worlds. It eliminates the friction by providing a single, elegant toolkit designed for developers, regardless if they're tinkering on a laptop, building intelligent robots, or deploying AI workloads in the cloud. Cortex simplifies local AI execution and management across all scales, enabling powerful, offline-capable intelligence wherever it's needed. ## Table of Contents 1. Goal & Vision 2. Ideal Users 3. Cortex 1. Architecture 1. Current 2. Future 2. User/Dev Experience 1. Current 2. Desired 3. Competitors 4. Strategy Moving Forward 4. Monetization 5. Action Plan ## 1. Goal Cortex's goal and mission is to bring intelligence to metal without relying on an internet connection or a cloud provider. This includes Robots :robot_face:, edge devices, all pies :pie:, personal computers, and more. ## 2. Ideal Users ![image](https://hackmd.io/_uploads/S1_Lda3AJe.png) - Developers - Indie Hackers - Thinkerers - Personal use - ... - Teams looking to - Control individual robots with LLMs - Deploy intelligence onto devices - Create desktop apps where inference happens locally - ... - Cluster - Teams deploying Cortex within scalable cloud environments or on-premise servers. This involves packaging Cortex and specific models (e.g., within Docker containers) for reliable, high-throughput inference serving multiple users or applications, potentially managed via orchestration tools like Kubernetes. ## 3. Cortex The current version of Cortex is written in C++ and it provides three main components: - A wrapper around different model serving tools, which we called `engines`. The ones early January 2025 were `llama.cpp` (main, `onnx` (on Windows only), `tensorrt-llm` (dev work seems to have stopped), and a bespoke Python engine that was meant to enable the use of any model that can be loaded via a Python Library - A CLI with a docker-like style for managing models, engines, and the server. - An server with an OpenAI-compatible API to enable applications to talk to models in differ ### 3.1 Architecture/Blueprint #### 3.1.1 Current ![image](https://hackmd.io/_uploads/SkM-244yge.png) #### 3.1.2 Future **From Thien's Document (the rest is intertwine with other parts of the document)** The future architecture of Cortex will be modular, clearly separating the Server component (handling API requests, user interactions, orchestration) from the Model Inference component (responsible for loading, running, and unloading specific models). This separation allows independent optimization and evolution of each part. Initially, both Server and Inference components will be implemented in Python, communicating via IPC (e.g., Unix sockets or similar) to maintain separation. ### 3.2 User Experience 3.2.1 Current: Cortex offers a promising developer experience with its Docker-inspired CLI for model management and an OpenAI-compatible API simplifying integration. However, engine setup can be manual, configuration options via the CLI are limited, and discovering optimal models/hardware settings requires external research. Documentation, while functional, needs more guides and tutorials. 3.2.2 Desired: The goal is a best-in-class, seamless experience. This means effortless installation via standard package managers, an intuitive and comprehensive CLI covering the full lifecycle (run, quantize, evaluate, build, push), robust background server operation (including systemd integration), automatic engine management, rich documentation with practical guides, and potentially a TUI for interactive management and monitoring. The entire process from discovering a model to running it locally should feel cohesive and efficient, inspired by tools like uv or Docker. ### 3.3 Competition Criteria for selecting competitors - Ollama - The good - incredible developer experience - solid model hub - The Bad - Not scalable (with over 20 users it crashes) - No packaging solution for a model or image with a model - Docker Model Runner - The good - large user base - established product - less friction - docker style plug and play, build an image ready to be deployed - The bad - very new, - ZML - The Good - distribution - package into ready-to-go docker images - The bad - they are writing absolutely eeeeeeverything from scratch - Nexa AI - The good - bespoke, optimized models to run locally - different modalities optimized by them (this is similar to what thien mentioned) - they call their tool an SDK rather "a tool" - the way they showcase their capabilities on their website is quite good - The bad - Installation process is not straightforward and quite difficult IMO. - Although it is python tool, a straight `pip install` won't work - Their model hub is not up-to-date with latest models - the sdk is quite bloated - llamafile - incredible distribution mechanism - Cog + Replicate The most interesting thing about cog is how it abstract an environment directly through docker while allowing users to run arbitrary python code from inside a container - The good - parent company owns popular model hub - allow you to test models before downloading them - - llama.cpp - Nvidia NIM - Truss While other tools focus on specific aspects (like Ollama's ease-of-use, llamafile's distribution, or Nexa's optimization), Cortex aims to be the unified toolkit for the entire local AI lifecycle on edge devices, with a clear path towards robotics and the cloud. It uniquely combines: ### 3.4 Strategy ![image](https://hackmd.io/_uploads/SkMSmY4keg.png) :warning: **Please note that we don't need to keep the name Cortex, but I do like it** :relieved: > Cortex's core strategy is to deliver the most productive and streamlined developer experience for running generative AI locally, particularly on edge devices and as the intelligent engine for robots. Essentially, Cortex acts as a sophisticated integration layer, not another bespoke kernel in the llama farm. It wraps complexities of backend engines (vLLM, SGLang, llama.cpp, ONNX, etc.) behind a unified CLI and API, borrowing ideas from tools like `uv` and `docker`. Initial focus is on top-tier DX, model management, packaging (`docker` build, llamafile build), benchmarking, and multi-modal support. This lets developers use performant backends without the integration headache and move from local dev to deployment smoothly. The initial build will use Python for both Server and Model Inference components as this leads to faster development, easy access to the Python ecosystem, and it is simpler to call C/C++ code by gluying it with Python that it is to bring Python into C++. Unless we develop our own tools to read and run models, the gains from C++ don't outweight the speed-to-market wins of Python. For edge devices (ARM SBCs like Menlo Pi) where Python might choke, the plan is iterative porting. We start with Python on PCs and Pis. When components lag or fail on ARM (likely C extensions or pure PyTorch code), rewrite those specific parts in native code (C++, Rust) with Python bindings. If the native code is faster on PCs too, it stays. This balances dev speed against eventual edge performance. This applies to pre/post-processing too. Standard stuff gets native libraries; custom Python logic gets rewritten if it's slow or incompatible on edge. Cortex targets standard Operating Systems. Bare-metal or OS-less devices aren't on the roadmap unless a compelling reason emerges. **Modes of Use** Cortex will be usable via the **CLI** and **server** in a similar fashion to how it is done today but with a different suite of enhancements highlighted below. In addition, cortex will enable users to package one of its backends alongside the weights (these can be loaded later too) and deploy it to a scalable cluster as a container or to many devices. **Differentiation** - Packaging models in ready-to-run containers or executables for different harward and purpose - Incredible developer experience - Multi-model loading/usage - Can we load more than one local model at a time - load a local model and set up a remote one to compare responses - How should we handle them? - Should we implement smart loading and offloading of models? - create model from regular file - Enable remote Model providers - Provide a seamless way to collect evals (which will serve as training data for fine-tuning models for robot usage) :robot_face: **The rest of this strategy section highlights the new cortex from the developer's perpective.** :wave: #### Jump to a Specific Section 1. [Installation](####.Installation) 2. #### Installation The minimal installation of cortex, which uses `llama.cpp` as its backend, will be a straighforward `uvx` or `pipx` away. ```shell= pipx install cortex-ai # or uv tool install cortex-ai # or uvx install cortex-ai ``` To install cortex with a specific backend or all, we can use: ```shell= uvx install "cortex-ai[vllm]" # or sgla # to install all of them uvx install "cortex-ai[all]" ``` #### Backends/Engines Moving to Python gives us the ability to use best-in-class tools in a seamless fashion to serve models. This means that Cortex will now have the following backends. - Best for local inference and ease of use - `llama.cpp` ideally via `llama-cpp-python` as it bundles llama.cpp and installs it on the device of the user. **This will be the default engine.** - `litellm` or any SDK that enables us to easily connect to a SOTA cloud-based model. - `mlx` great for Apple computers - Best for scaling on cloud environments - `vLLM` - `SGLang` - Edge/Mobile devices - `executorch` - [`onnx` (maybe?)](https://github.com/microsoft/onnxruntime-genai) If an additional backend has not been downloaded, cortex could provide a similar functionality to that of the current cortex engines ```shell= cortex engines install vllm sglang # and so on # it could also be cortex backend install vllm sglang ``` I evaluated [mlc-llm](https://github.com/mlc-ai/mlc-llm) as it seems like a solid option for compiling models for different platforms but this one feels more like a full blown solution than something we can wrap around. I need to study the framework a bit more as I might be wrong. #### Model Management/Lifecycle Docker/Ollama style model management with slight adjustments **Pull** ```shell= # GGUF from HF cortex pull hf.co/model-author/model:3b # from blob storage cortex pull s3://a-bucket/model:3b # from personal storage cortex pull https://somewhere.com/model:3b # in any format cortex pull hf.co/model-author/model.safetensors ``` **Run** ```shell= # run continues to pull if not available locally # run creates a chat cortex run model-author/model:3b # detached mode starts the llama.cpp llama-server cortex run -d model-author/model:3b # models can run on a selected backend cortex run --backend vllm -d model-author/model:3b ``` If the backend is unavailable ``` You don't have vllm available, would you like to download it now (Y/n)? Y Downloading vllm 3% #######........................................................ 100% ``` **Load and Hold** Sometimes developers might want to test a model for a day or so to get a feel for it. Instead of leaving a chat on in a terminal for that amount, we could allow them to load them into the CLI rather than the server like ```shell= cortex load model-author/model:3b ``` This would enable them to call the model directly via cortex like ```shell= cortex "Tell me dad joke please." ``` Conversations will get appended to the same thread unless the user specifies a new thread, for example: ```shell= cortex "Tell me a story about a robot. Be sarcastic!" --new ``` In the case above, all new interactions with the model will go to the latest thread. Each thread has a hash attached to it and these can be viewed with `list` like ```shell= cortex list threads ``` ``` ┃ > 681bdb8 | Tell me dad joke please. | 2 days ago | model:3b ┃ 4381658 | Tell me a story about a robot... | 2 min ago | model:3b ┃ f7f65c4 | what's the difference between a Mbit vs a MB? | 7 days ago | claude-3-7-sonnet-20250219 ... ``` Models could have an alias ```shell= cortex load -m model-author/model:3b --alias coollama # and be called with the alias coollama "Tell me dad joke please." ``` There could be more than one model loaded at once. ```shell= cortex load \ -m model-author/model:3b --alias smollama \ -m model-author/model:14b --alias bigllama # prompts could be sent to both by default cortex "What do you think of the movie Batman?" # be called with the alias bigllama "Tell me dad joke please." # be called individually cortex "Tell me dad joke please." --smollama --bigllama ``` The loading above will set the groundwork for the evaluation's module. **Model Storage** The way we store models should follow the same format as other models in the HF hub, this will enable the team to stop focusing on whether a model lands in the `cortex.so` or the huggingface directory, after all, they will all come from the same place. Instead of: ``` ├── models │ ├── cortex.so │ │ ├── deepseek-r1-distill-qwen-1.5b │ │ │ └── 1.5b │ │ │ ├── metadata.yml │ │ │ ├── model.gguf │ │ │ └── model.yml │ │ ├── gemma3 │ │ │ ├── 1b │ │ │ │ ├── metadata.yml │ │ │ │ ├── model.gguf │ │ │ │ └── model.yml │ │ │ └── 4b │ │ │ ├── metadata.yml │ │ │ ├── model.gguf │ │ │ └── model.yml │ │ └── tinyllama │ │ └── 1b │ └── huggingface.co │ └── unsloth │ └── Phi-4-mini-instruct-GGUF │ ├── Phi-4-mini-instruct.Q8_0.gguf │ └── Phi-4-mini-instruct.Q8_0.yml ``` We go to: ``` ├── models │ ├── huggingface.co │ │ ├── deepseek-r1-distill-qwen-1.5b │ │ │ └── 1.5b │ │ │ └── ... │ │ ├── gemma3 │ │ │ ├── 1b │ │ │ │ └── ... │ │ │ └── 4b │ │ │ └── ... │ │ ├── gemma3 │ │ │ ├── 1b │ │ │ │ └── ... │ │ │ └── 4b │ │ │ └── ... │ │ └── unsloth │ │ └── Phi-4-mini-instruct-GGUF │ │ ├── ...gguf │ │ └── ....Q8_0.yml │ │ │ └── TheirOwnHub │ └── ... ``` #### Memory For meaningful interactions and context-aware behavior (especially in robotics and ongoing conversations), Cortex needs robust state management and memory capabilities. - The Cortex server needs to manage the immediate state of active models, including loaded parameters, KV caches for ongoing inference, and session information for connected clients. This is typically handled in-memory during the server's runtime. - Local Persistence: Cortex should leverage its local database (or something else) and store user's interactions by default (~/.cortexcpp or similar). This database can store: - Configuration: Persisted model configurations (model.yaml overrides), engine settings, user preferences. - Interaction History: Logs of prompts and responses for specific sessions or models, enabling basic recall. - Robot State (via Platform): Key-value storage for persisting relevant robot state variables between operations or restarts, managed perhaps via the Robot Platform interacting with Cortex. - To enable a more robust local testing experience (e.g., building local agents or manipulating robots from a laptop) we could add a memory layer for overall past interactions across extended periods. Cortex could integrate with long-term memory solutions or implement concepts from MemGPT or mem0, which would involve: - Storing interaction history (potentially summarized or embedded) in the local database or a vector store. An LLM would be able to override a memory if necessary. DB with vector capabilities could be lancedb as it is a file. - Implementing mechanisms to retrieve relevant past interactions based on the current context (e.g., semantic search over embeddings of past interactions). - Injecting retrieved memories back into the model's prompt context during inference. Model interactions by default get recorded in the 'cortex.db' ```shell= # to disable this cortex run llama3:7b --memory false # to change the storage location cortex run llama3:7b --memory s3://warehouse-1-bucket....` # embed conversations cortex run llama3:7b --embed # goes to lancedb ``` #### Evals Evals are automatically stored on the SQLite DB inside `cortexcpp` directory. If the user chooses to, the results can be stored elsewhere (e.g., S3 bucket, remote database, etc.). <details> <summary>Run Single Eval</summary> ```shell= cortex evals run -m llama3:1B \ -p "Move box to the left" \ -i "Image of the area" \ -a "action taken successfully" \ --sql-table "sqlite://..." # save results here ``` </details> <details> <summary>Run a batch job of Evals</summary> ```shell= cortex evals run -m llama3:1B \ --batch "my-evals.json" \ # expects specific format --parallel 4 # optionally ``` </details> <details> <summary>Run Evals on Different Models</summary> These evals can be run on the same model with different quantization method, or completely different models ```shell= cortex evals run -m llama3:1B -m llama3:8B \ --batch "my-evals.json" \ # expects specific format --parallel 4 # optionally ``` </details> #### Edge/ARM strategy For running the core ML inference efficiently on ARM SBCs (replacing pure PyTorch), Cortex will flexibly employ two main approaches depending on the model and performance needs: - Automatic Conversion: Utilizing frameworks like ONNX, Executorch, or NCNN to capture PyTorch computation graphs and run them via optimized platform-specific runtimes. This is suitable for simpler models or when rapid porting is needed, though fine-grained optimization can be challenging. - Manual Native Implementation: Rewriting model logic directly in native code (leveraging libraries like ggml, Candle, or custom C++/Rust code) for maximum control and optimization potential, similar to llama.cpp's approach. This requires more effort but allows deep optimization for performance-critical models. Cortex will not be locked into one method; the best approach will be chosen per model or modality. #### Embeddings #### Batched Runs - `cortex batch embeddings 'Hey There!'` - "move new boxes to the back" --> [1,124 ,23 42,3 5,23 524] - "check inventory is correct" - ... - move-box - `cortex batch inference list_of_prompts.csv` #### Tool Usage Tools can be defined as APIs with clearly defined JSON schema. <details> <summary>Interact with Tools Already Saved</summary> ```shell= # single tool cortex models run -m llama3:1B \ --tools tool-1 # multiple tool cortex models run -m llama3:1B \ --tools `[tool-1, tool-2, tool-3, ...]` \ --detached # optionally ``` </details> <details> <summary>Dynamic tool usage</summary> ```shell= cortex models run -m llama3:1B \ --dynamic --tools-table `sqlite://...` ``` A table of tools will look as follows: | tool_name | descripion | schema | vectors | | ---------- | -------------- | --------- | ---------- | | web_search | search the web | {k:v, k:v, ...} | | </details> #### Quantization <details> <summary>Quantize a Single Model</summary> ```shell= cortex quantize -m llama3.safetensor \ # https://hf-hub..... --method "Q4_K_M" ``` </details> #### Optimization Imagine a user wants to download a model slightly larger than what's recommended for its current hardware, but there are futhee optimizations that could happen on device after the user downloads the model. Different frameworks have different flavors for this approach, but none, in my opinion, does it directly with the specs of the hardware of user in mind (maybe I'm wrong). One useful method we could combine with user's hardware specs is Quantization Aware Fine-tuning: > Quantization-aware fine-tuning (QAT) is a training technique where a model is fine-tuned with quantization in mind, resulting in a quantized model that retains high accuracy and efficiency. Some frameworks doing a flavor of this are: - `bitsandbytes` - [love the name](https://huggingface.co/docs/bitsandbytes/index) - `torchao` - [very elegant implementation](https://github.com/pytorch/ao) - `optimum` [by the HF team](https://huggingface.co/docs/optimum/quicktour). works only with the HF `transformers` library - `unsloth` - `torch.fx` - enables fine-grained control of the computational graph potentially allowing for meticulous optimization of a modle We could do the following for a user that wants sensible defaults picked for them. ```shell= cortex hardware optimize -m model:14B -o optimized_model:v1 ``` The result wouldn't just be a simple further quantized model but rather one optimized for the hardware of the user or a robot. For users that want a higher level of control, they could pass in as many flags as they wanted. ```shell= cortex hardware optimize -m model:14B \ -o optimized_model:v1 \ --params '{"key": val, "key": val}' # or each key:val pair could be a flag ``` Building on top of a framework would mean these are treated as components (see at the bottom for more info), and these could be downloaded as. ```shell= cortex install unsloth # or cortex component install bitsandbytes ``` #### CLI Tool Generator Alias creator for model running in detached mode, for example ```shell= cortex models run llama4:8b -d --alias coollama # then coollama "Hey what kind of llama are you? A Cool one?" \ # optional below --ctx_size 4096 \ --temp 0 \ ... ``` Then users could check available running models with ```shell cortex ps ``` ``` | alias | model | status | created at | last used | | --------- | --------- | ------ | ------------ | ----------- | | coollama | llama4:8b | active | 13-Apr-2025 | 14-Apr-2025 | ``` #### Build <details> <summary>Build a Docker Image</summary> This assumes the host has docker running in the platform. As an example, MLServer wraps the docker Python SDK to make this happens in a seamless fashion. ```shell= cortex build -t my-llama -m llama3:1B \ --backend vllm # or sglang or llama.cpp or other ``` </details> <details> <summary>Build a LlamaFile</summary> We do not need to assume the user has the llamafile requirements and can package the set up as an add-on, for example, ```shell= uv pip install "cortex[llamafile]" # or cortex install llamafile ``` ```shell= cortex build --llamafile -m llama3:1B --tag "v0.0.1"\ --name "coollama" \ # below is optional to override defaults --batch_size 1024 \ --ctx_size 4096 \ --keep -1 \ --temp 0 \ --host 0.0.0.0 \ ... ``` To run it: ```shell= ./coollama --version ``` ```md v0.0.1 ``` </details> #### Push (or publish, update, share) The push functionality should work across docker images, llama files, and quantized or non-quantized models equally for our hub as well as other locations. ```shell= # model goes to menlo cloud cortex push -m llama4:8b # goes to docker hub cortex image push coollamma:latest # goes to hf hub cortex push --llama-file coollamma.llamafile:v0.0.1 --hf-token "..." # goes to user-defined location cortex push -m llama4:8b \ --url "https://..." \ --api-key "..." ``` #### GPU Compatibility Luckily, vLLM, sglang, and llama.cpp all provide support for different hardware and this level of development work can be leverage from these tools. For mobile devices, we can leverage executorch or onnx. The former seems more straightforward to work with than the latter where every example in the docs seem to be quite old. #### Benchmark The benchmark module servers two purposes, to allow uses to benchmark models in their desire device, and to provide data for the model hub. <details> <summary>CLI</summary> ```shell= # Standard benchmark cortex benchmark -m "llama4:8b" # With detailed metrics cortex benchmark -m "llama4:8b" --verbose # Initialization only cortex benchmark -m "llama4:8b" --type init # Runtime metrics cortex benchmark -m "llama4:8b"--type runtime # Long-running stability test cortex benchmark -m "llama4:8b" --type stability --stability-duration 24 # Custom benchmark prompts cortex benchmark -m "llama4:8b" --type workload --prompts my_prompts.json # Multi-model benchmarking cortex benchmark -m "llama4:8b" --type advanced \ --secondary-models "phi4:8b" "qwen2.5:7b" # Export results cortex benchmark -m "llama4:8b" --json results.json ``` </details> #### Support different modalities **Audio** We should bring Ichigo back to life or start with the kokoro implementation Thien completed back in January. With the dual-model loading functionality provided above, we could: ```shell= cortex load \ -m model-author/model:3b --alias smollama \ -m hexgrad/Kokoro-82M --alias kokoro cortex -llm smollama \ -p "Explain general relativity to a 10-year old." -tts kokoro \ -o "relative.wav" # or to run them without keeping them loaded cortex generate -llm model-author/model:3b \ # or run -p "Explain general relativity to a 10-year old." -tts hexgrad/Kokoro-82M \ -o "relative.wav" ``` for the server, it could be something like ```shell= POST v/tts/{kokoro}/ ``` **Image** For vision, we should prioritize supporting models with vision capabilities, in particular, state of the art models like moondream at their size-level. **Robotics** These models should ideally be coming from the research team. For example, AlphaSpace should not stay as an paper-wtiting exercise and rather make it into cortex. #### Task/Model Orchestrator Can cortex have a delegator model that assigns tasks to the best model for the task? Under specific criteria (e.g., 1B or 7B parameters, dense model, etc...), we could provide SLAs that guarantee, for example, loading time or time-to-first token under sepcific hardware. For instance, if an orchestrator model needs to send a task to a smaller model that's already on-device, - How fast can we load and unload the model? - What kind of ttft can cortex guarantee? - what other guarantees can we provide? Note that this would be different than having a mixture of expers (MoE) since an MoE requires that all parameters are loaded into memory, even those that are inactive until they are needed. #### Structured Outputs This is a crucial component for not only creating reliable applications but also for controlling robots in hyper-specif ways. Models could be created in Python code as follows ```python= from pydantic import BaseModel, model_validator from openai import OpenAI import instructor client = instructor.from_openai(OpenAI()) class RobotMovement(BaseModel): finger_1: int finger_2: int finger_3: int finger_4: int finger_5: int action_strength: float self_checkup: Annotated[ str, BeforeValidator( llm_validator( "don't grab too hard bastard", client=client, allow_override=True ) ), ] @model_validator(mode='after') def validate_action(self) -> "RecursiveSelfCheckUp": pass def baseball_pitcher_robot(text: str) -> Action: return client.chat.completions.create( model="gpt-4", messages=[ { "role": "system", "content": "You are an expert baseball pitcher robot.", }, { "role": "user", "content": f"Grab the ball and throw a fast ball.", }, ], response_model=RobotMovement, ) ``` The code above will be mapped to specifications like the ones in the table below. Which I believe is already being implemented by the research team in poseless. | Finger | Up | Down | Action Strength | |----------|-----|------|-----------------| | Finger 1 | 1 | 0 | 0.8 | | Finger 2 | 0 | 1 | 0.1 | | Finger 3 | 0 | 1 | 0.4 | | Finger 4 | 1 | 0 | 0.1 | | Finger 5 | 0 | 1 | 0.76 | #### Components These components can be additional tooling that can be downloaded as additional tools, for example, ```shell= uv pip install `cortex[tts]` ``` or we could have extensions in a similar fashion to how the [llm](https://github.com/simonw/llm) tool by [Simon Wilson](https://simonwillison.net/) does it, for example, ![image](https://hackmd.io/_uploads/Bk25o63Ake.png) so cortex could, ```shell= cortex install tts asr # and so on ``` Some aditional components could be: - text-to-speech - automated speech recognition - structured outputs - instructor - outlines - pydantic-ai - guidance - ... #### (Work)Flows A`cortex flow` takes in a yaml file with actions - Step 1: Scan QR codes in boxes - code: str - box_number: int - location_in_warehouse: str - - Step 2: Create Email with specs for new boxes - email: str - date: date - ... - ... This would look somewhat like this: ```yml= name: "flow-1" date-created: 13-12-11 stages: step-1: prompt: "..." response-schema: "some.json" tries: 4 tools: - tool-1 - tool-2 ... step-2: prompt: "..." response-schema: str tries: 2 tools: - tool-1 - tool-2 ``` #### Schedules A `cortex schedule` command plans to run a flow at predetermined times (requires `systemd` or something else) ```shell= cortex schedule --name 'flow-1:warehouse-2' ``` #### Distribution While a Python package, cortex can and still should, be available for download via the major OS package managers. My recommendations on this can be found in the first Cortex strategy doc I created at the link below: [Cortex Roadmap 🧠](/3ZnyJmmmRWe2gsY2y7OyoQ) #### Model Hub(?) The new website for the model hub will follow some inspiration from the model hubs of OpenRouter and Ollama ans these are quite good and simple. Initial ideas can be found in the first cortx strategy doc below. [Cortex Roadmap 🧠](/3ZnyJmmmRWe2gsY2y7OyoQ) #### Sandbox (Optional) Potentially an [open interpreter-style](https://github.com/openinterpreter/open-interpreter) sandbox where models could be tested in a lightweight environment that enables running python code. #### Terminal User Interface (Optional) A simple TUI (leveraging [textual](https://textual.textualize.io/)) could be highly leverage to give develoers and app creators a taste as to what it would be like to run, manage, and use models via cortex. This tool could allow them to manage tables in the SQL database, data models for structured output, the model loaded on cortex, and benchmark results of their model. ## 4. Monetization The purpose of Cortex is to provide a nice on-ramp to the robot platform rather than becoming a money-making machine. It will, however, provide opportunities to test add-ons for which we could charge, for example: - Provide a seamless API connector for talking to remote models e.g., `cortex menlo login` and sends them to our website to sign up for a cloud service. - After capturing a bunch of evals, they might want to fine-tune their model in different ways. We could provide this as a service with one command. ## 5. Action Plan