owned this note
owned this note
Published
Linked with GitHub
# Deploy-Time Modularity for Llama Stack Providers
Jul 31, 2025
**PUBLIC DOCUMENT**
| Contributor(s) | Roland Huß | rhuss@redhat.com |
| :---- | :---- | :---- |
| **Main Github Issue** | [https://github.com/llamastack/llama-stack-k8s-operator/...](https://github.com/llamastack/llama-stack-k8s-operator/...) | |
| **Status** | Draft | |
| **AI Attribution** | [AIA HAb Ce Hin R o3 v1.0](https://aiattribution.github.io/statements/AIA-HAb-Ce-Hin-R-?model=o3-v1.0) (exceptions marked inline) | |
## Motivation / Abstract
Llama Stack (LLS) currently supports a **build-time composition** of providers: all providers (both in-tree and external BYOP) must be declared in a `build.yaml` and baked into the distribution container image, as well as a `run.yaml` that references those providers. At runtime, and when overwriting the baked-in `run.yaml`, only the providers that originally had been included in the distribution image can be configured via a custom configuration. There is no mechanism to add completely new providers after the image is built. This is a significant limitation for several key use cases:
* **Platform Engineers:** Enterprise platform teams deploying LLS need to integrate custom, company-specific providers. They do not want to rebuild a curated LLS distribution image for each new provider or each release. Instead, they need a way to plug in their own provider implementations at deployment time, without custom image builds.
* **ISV Partners:** Over time the LLS ecosystem will involve many third-party (ISV) providers (e.g. various database vendors providing vector stores) that are not directly part of LLS. We need an easy way for partners to inject their provider code so that LLS can pick it up dynamically. This should happen *during deployment* of LLS, not via upstream code changes or custom image builds for every new integration.
Today, the **only** way to add new providers is to build a custom LLS distribution image (including the provider code) and reference it in the `LlamaStackDistribution` (LLSD) CRD. Creating, testing, and maintaining such custom images is burdensome and undermines user experience. We want to eliminate this overhead by enabling **deploy-time composition** of providers – i.e. allowing additional provider plugins to be introduced when LLS is deployed and started, rather than only at build-time.
## Background
A recent addition to LLS introduced a [*Bring Your Own Provider (BYOP)*](https://llama-stack.readthedocs.io/en/latest/providers/external.html#external-providers-guide) model, encouraging a modular, out-of-tree approach to providers. In other platforms like OpenStack and Kubernetes, pluggable external modules are used to keep the core lean. Similarly, BYOP lets users develop and maintain providers outside the core LLS codebase, improving modularity and maintainability. The goal is to allow flexible extension of Llama Stack without bloating the core or requiring upstream changes for every new provider.
However, current BYOP support is still limited to build-time inclusion. Out-of-tree providers can be packaged separately, but they must be present in the LLS image (or added during image build) to be usable. At deploy-time, one can only configure the providers that were baked in; it is not possible to add new provider code after the fact. In other words, LLS supports external providers in principle, but *all external provider code and dependencies must still be in the distribution image* when the container starts.
This proposal aims to extend BYOP to support deploy-time composition of providers. We intentionally avoid any “hot-plugging” of code into a running process (which would be complex and risky). Instead, we focus on injecting providers at startup/deploy time. The idea is that any new provider and its libraries are added *before* the LlamaStack server fully starts, e.g. during Pod initialization. This avoids the instability of live code loading and keeps runtime operation predictable. With this model, when you deploy or update the LLS instance, you can include additional providers at that time, and LLS will discover them on startup. This approach covers the vast majority of extensibility use cases without the need for true runtime hot-loading.
In summary, we need a mechanism for LlamaStack to setup provider plugins at deploy-time (container startup) rather than requiring them at build time. The rest of this document explores and proposes how to achieve this in a Kubernetes context.
## Proposal Design / Approach
We propose enabling dynamic deploy-time composition of LLS providers via Kubernetes deployment patterns. In particular, we leverage well-known Kubernetes mechanisms (Init Containers and Sidecars) to inject provider code and services into the LLS Pod at startup. We outline three possible approaches, and then describe the recommended combination of these approaches to cover different use cases.
### Design
Here we structure deploy-time composition into *levels* that build on each other. **Level 1** injects a small, pure-Python adapter into the main LLS process that directly talks to the target backend; it minimizes dependencies and works entirely in-process. **Level 2** retains that local Python stub but delegates the heavy lifting to an external adapter service: either a **sidecar within the Pod (Level 2a)** for process isolation without extra networking hops beyond localhost, or an **independent Deployment (Level 2b)** when stronger isolation, separate scaling, or distinct lifecycle is required. Level 1 is intentionally valuable on its own and should be the first milestone; it delivers immediate extensibility with the simplest operational footprint. Level 2 variants are strictly additive: the sidecar or external service exists *because* the Level 1 stub speaks the provider contract and forwards calls.
Our incremental plan addresses first the use cases that are cleanly solved by Level 1; we then tackle the more demanding scenarios that require Level 2a/2b.
#### Level 1 - Library blending with init-containers
This approach uses an **Init Container** in the Kubernetes Pod to “blend in” the provider’s code/libraries into the main LLS container’s filesystem before LLS starts. We mount a shared volume (e.g. an `emptyDir`) into both the init container and the main container at the LLS providers directory (e.g. `~/.llama/providers.d`). The init container is responsible for placing the provider’s files into this volume (and installing any necessary Python packages there), then it exits. The main LLS container then starts, with the volume already populated with the new provider code and spec.
**Pros:**
* **Simple model:** It’s easy to understand – effectively we are assembling the needed files in the correct location before launch. LLS then naturally picks up the provider from the filesystem on startup.
* **No runtime overhead:** The init container runs to completion before LLS starts, so there is no ongoing resource cost (unlike sidecars which run in parallel). The main container remains the only long-running process (aside from any sidecars for other purposes).
* **Deterministic startup:** The provider code is added in a controlled way. If something fails in the init step (e.g. a package fails to install), the Pod will fail to start, which is easier to catch and handle.
**Cons:**
* **Pure-Python only:** This works best for providers implemented in pure Python (or at least not requiring compiled binaries outside the base image). The main container’s environment is fixed (OS, Python version), so any native code in the provider must be compatible. Providers that need platform-specific binaries, GPU drivers, etc. cannot be simply copied in – those would likely fail to run if the main image is missing necessary system libraries. (Rule of thumb: use inline providers only for Python-based logic).
* **Dependency conflicts:** All inlined providers share the main Python environment. If two providers (or a provider and the core system) require different versions of the same library, this can cause version clashes. The init container could install one provider’s dependencies that inadvertently override or conflict with another’s. We’d have no isolation in this case. (In practice, careful version pinning or vendoring may be needed, but this complicates usage.)
* **No isolation for failures:** A buggy provider running in-process can affect the entire LLS process (e.g. memory leaks or crashes propagate to LLS). There’s no containment beyond what Python itself provides although the way LLS is architected internally should prevent us from these issues, as each API and provider implementation runs in its coroutine. So one failure should not affect the program globally. For highly complex or untrusted provider code that might destabilize the main app though this method might be dangerous.
This approach is ideal for lightweight providers that can be cleanly dropped into LLS’s runtime. It avoids extra moving parts and keeps performance optimal (direct function calls, no networking overhead). But it has limits when providers have heavy or conflicting requirements.
#### Level 2a - Adapter to remote providers via sidecars
In this approach, each new provider runs as an **external service in a sidecar container**, and LLS communicates with it over localhost (e.g. via HTTP or gRPC). The provider is configured in LLS as a `remote` provider with a small in-process adapter stub that forwards requests to the sidecar. Essentially, the provider is treated as a microservice running alongside LLS.
**Pros:**
* **Full isolation:** The provider runs out-of-process, so any crashes or memory leaks in the provider won’t bring down the main LLS server (might not happen very often). The sidecar can also have a completely different environment (OS packages, Python version, native libraries, even a different programming language) without impacting the main container. This makes it suitable for providers that require specialized system setup (GPU drivers, large C++ libs, etc.). Isolation also helps on the data-level so that providers can’t read other provider’s state stored in local databases.
* **No dependency conflicts:** Because the provider’s dependencies are in its own container, there’s no chance of Python package version collisions with LLS or other providers. Each sidecar can use whatever library versions it needs.
* **Flexible technology stack:** Sidecar providers could be written in other languages or use frameworks not available in the LLS image. As long as they expose a network API and conform to LLS’s expectations, they can be integrated. This opens the door to integrating virtually any system or service as a provider (e.g. a Java-based service, a database server, etc.).
* **Managed lifecycle:** The sidecar containers are managed by Kubernetes, which handles starting, restarting on failure, health checking, etc., independently of the main LLS process. This can improve reliability (each sidecar can have its own readiness/liveness probes, scaling, etc., as needed).
**Cons:**
* **Increased resource usage:** Each provider sidecar is a separate container, consuming memory/CPU. Having many providers means a larger Pod with multiple processes. This could be a concern in resource-constrained environments; admins need to allocate resources for each sidecar.
* **Operational complexity:** More containers per Pod means more things to manage (monitoring multiple logs, ensuring all containers are healthy). If dozens of sidecars are added, the deployment becomes complex. There’s also a need to manage port assignments and inter-container coordination (though using localhost and default ports can simplify this).
* **Performance overhead:** Calls to a sidecar involve IPC (typically HTTP over localhost). This adds minor latency and overhead compared to an in-process call. In practice, localhost communication is quite fast, but it’s still a hop with serialization/deserialization cost. For extremely latency-sensitive operations, this could be a factor. If latency is a concern, consider e.g. the usage of gRPC over UNIX domain sockets so that no TCP is involved.
* **Startup ordering:** We must ensure the sidecar service is ready to accept requests when LLS tries to use it. This might require readiness checks or retry logic, since Kubernetes will start containers in parallel after the init phase. It’s a solvable issue (e.g. LLS adapters can handle connection errors by retrying), but it adds another thing to handle in design.
This sidecar approach is powerful and will be the go-to solution for providers that are not suitable to run in-process. It essentially extends LLS with arbitrary services. The trade-off is the additional overhead of maintaining those services.
#### Level 2b - External Pods (Dedicated Pods)
This approach involves deploying providers as entirely separate, dedicated Pods (managed by a Deployment) within the Kubernetes cluster, outside of the LLS Pods. LLS would then communicate with these external provider Pods over the cluster network via Kubernetes Services so that they have a stable DNS name that can be directly used in the LLS configuration.
**Pros:**
* **Maximum Isolation and Scalability**: Each provider runs in its own Pod, offering the highest level of isolation. They can be scaled independently, have their own resource quotas, and be managed by separate Deployments or StatefulSets. This is ideal for very resource-intensive providers or those that need to be scaled differently than the main LLS application.
* **Flexible Deployment and Management**: Providers can be deployed and updated completely independently of the LLS application. This allows for specialized lifecycle management and integration with other cluster-level tools or services. Also it allows for an independent scaling of those pods in high-load scenarios.
* **Decoupled Failure Domains:** A failure in an external provider Pod will not directly affect the LLS Pod, improving overall system resilience.
**Cons:**
* **Increased Operational Complexity:** Managing separate Pods for each provider significantly increases the number of Kubernetes resources to track (Deployments, Services, etc.). This makes the overall LLS deployment more complex to monitor and troubleshoot.
* **Network Communication Overhead**: Communication between LLS and external provider Pods goes over the cluster network, which introduces higher latency compared to localhost communication with sidecars. This overhead could be a factor for very high-throughput or low-latency provider interactions.
* **Connection Management:** LLS needs to discover and connect to these external services. This requires more involved setup for connection parameters (e.g., service names, ports) to be propagated to the LLS configuration (e.g., run.yaml). This could involve mechanisms like Kubernetes Service discovery or environment variables set by an operator.
* **Advanced Security Considerations:** Since communication is over the cluster network and not constrained to localhost, more advanced network security measures (e.g., network policies, mTLS) might be required to secure the communication paths between LLS and the provider Pods. This adds another layer of configuration and complexity.
While offering the highest degree of isolation and scalability, deploying providers as external Pods introduces significant operational complexity and additional networking/security considerations compared to sidecars. For initial deploy-time modularity, we will focus on the sidecar approach, as it offers a good balance of isolation and simplicity without requiring complex network configuration or advanced security measures for inter-container communication within the same Pod. The external Pod approach might be considered for future enhancements or specific use cases requiring extreme decoupling and independent scaling.
### Implementation
The **library** (init-container) and **adapter** (sidecar) approaches are not mutually exclusive – in fact, our proposal is to **combine them** to get the best of both. The typical deployment for a dynamic provider will use **both an init container and a sidecar** working in tandem:
* **Provider Image and CRD:** We introduce a way to specify external providers in the `LlamaStackDistribution` CRD, including referencing a container image for the provider. For example, a provider might be defined with an image `my-inference-provider:1.0`. The LLS Operator will recognize this and use that image to augment the LLS deployment (Pod).
* **Init Container (Code Injection):** The operator adds an init container (using the provider’s image) to the LLS Pod. This init container’s responsibility is to **populate the shared volume** with the provider’s spec and code. For instance, the provider image could include a directory (say `/lls-provider/`) containing the provider’s YAML spec and Python package. The init container can copy these files into the mounted volume (e.g. into the appropriate `providers.d` subdirectory for the spec, and into a libs folder for code). It may also perform any installation step (e.g. if the provider’s Python library needs to be placed into a target directory or `PYTHONPATH`). After copying/installing the files, the init container exits. At this point, the main container’s volume has everything needed for LLS to load the provider. For simple providers we are already done after this step. For more complex providers that have complex dependencies, an additional side-car container is created.
* **Sidecar Container (Provider Service):** If needed, the operator also injects a sidecar container using the **same provider image**, but this time it runs the provider’s service. The provider image should have an entrypoint or command to start its service (e.g. a server listening on `localhost:port`). The sidecar container will run concurrently with the LLS container. The main LLS process will communicate with this sidecar via the provider’s adapter (the code that was injected by the init container). For example, if `my-inference-provider` exposes an HTTP API on port 5000, the adapter code (now present in LLS) will know how to send requests to `localhost:5000`. The port should be set from the operator and configured via an env variable `PROVIDER_PORT` that is set for the sidecar container by the operator. This is needed, so that multiple sidecars don’t get into a port conflict (as sidecars all share the same network space).
The LLS container is configured to use the shared volume and recognize the new provider. Concretely, the operator will mount the shared `emptyDir` volume at LLS’s expected external provider directory (by default `~/.llama/providers.d`). It will also set `PYTHONPATH` (or use `.pth` files) so that any libraries installed in that volume are importable. When LLS starts, it scans `providers.d`, finds the new provider spec, and loads the provider into its registry (just as if it were built-in). Because we used the same image for both init and sidecar, we ensure the code versions match and there’s no drift.
By designing the provider image to contain both the *in-process adapter* code and the *service implementation*, we allow a **single image** to deliver the provider. This simplifies distribution: a vendor or developer only needs to ship one OCI image. The LLS operator will use that image in two ways (init and sidecar) to “blend” it into the deployment. This avoids having to coordinate two separate images for one logical provider. It also ensures compatibility – the adapter and service come from the exact same build.
In summary, the operator orchestrates the following when a new provider is declared in the LLSD CR: it adds a shared volume, an init container to copy/install provider files, and a sidecar container to run the provider’s service. The main LLS container remains untouched except for mounting the volume and environment tweaks. From the user’s perspective, they simply specify the provider image in the CR and the system takes care of wiring it in at deploy time.
### Prerequisites / Dependencies
This feature builds on LLS’s existing BYOP mechanism.The current BYOP support that leverages a module: spec in the provider definition that is stored in LLS needs to be extended to be able to the those provider specification also from files within a directory so that each external provider image can copy over this file into the shared directory. I.e. assume a directory providers.d that has the following content:
```
providers.d/
|
+-- ramalama.yml
|
+-- redis.yml
```
with `ramalama.yml` containing
```
- provider_id: ramalama
provider_type: remote::ramalama
module: ramalama_stack
```
and the code for the `ramalama_stack` including its dependency is available in the `PYTHONPATH` (the exact location is still tbd). No live injection API is needed – just the files on disk and packages in place before startup.
Deploy-time composition will be implemented via **Kubernetes** constructs (init containers, sidecars, shared volumes). This requires that LLS is running on Kubernetes which is the primary target deployment for LLS. The cluster must support init containers (any modern Kubernetes does) and the LLS Operator must have permissions to modify the Pod spec accordingly.
Provider developers must build their container images to be compatible with the LLS deployment:
* The image should target the same CPU architecture and a similar base OS environment as the LLS container, if the provider code will be used in-process. For instance, if LLS runs on UBI 9 with Python 3.11, the provider image should ideally also use UBI 9 and Python 3.11 for its adapter code. This ensures any compiled wheels or binaries in the provider package will work properly in the main container. Using the same base image for LLS and the provider is a good practice to avoid compatibility issues.
* The image should contain all necessary dependencies for the provider. It will run in an offline context (no internet access in the cluster), so any Python packages or system libraries needed by the provider’s code or service should be included in the image. No runtime downloading should be required (consistent with the goal of self-contained providers and no pip installs at startup).
* The image should include the provider’s spec file and any adapter module code in a known location (so the init container can find and copy them). We may establish a convention (e.g. everything under `/lls-provider/` as mentioned). Documentation will specify this layout.
* The provider’s service in the image should listen on a predefined port (or a configurable port) and ideally only serve localhost (since it’s meant to be used internally by LLS). The protocol (HTTP, gRPC, etc.) and endpoints should be documented as part of the provider’s integration.
The LlamaStack Operator needs to be enhanced to support this feature. This includes CRD updates (to allow specifying provider images and any config like ports), and logic to construct the Pod spec with the appropriate volume and containers. These operator changes are a dependency for the feature to work (the core LLS alone can’t do this – it requires orchestration).
Cluster administrators should account for the additional resource needs of provider sidecars. For example, if a provider needs 2GB RAM for a vector DB sidecar, the Pod’s resource quota must accommodate that. This isn’t a “dependency” per se, but a prerequisite for successful operation. There should be a possibility to specify the resource requirements in the `LlamaStackDistribution` CR.
## Integration Checklist
[AIA PAI Nc Hin R o3 v1.0](https://aiattribution.github.io/statements/AIA-PAI-Nc-Hin-R-?model=o3-v1.0)
**CRD Schema Updates:** Update the `LlamaStackDistribution` CRD to allow declaring additional providers to load at deploy-time. This could be a new field such as `spec.externalProviders[]` where each entry includes details like `name`, `image`, and perhaps `mode` (inline or remote) and any specific configuration (like port or env vars). Ensure the CRD documentation clearly defines how to specify providers.
**Operator Controller Logic:** Extend the LLS Operator’s reconciliation logic to handle the new CRD fields
When external providers are specified, modify the Deployment (or StatefulSet) for LLS to include:
* A shared `emptyDir` volume mounted at the appropriate path (default `~/.llama/providers.d` in the container, or another configured path if needed).
* An init container for **each** provider. The init container uses the provider’s image and runs a command to copy/install the provider files into the shared volume.
* A sidecar container for each provider that requires one (for inline-only providers, a sidecar is not needed). The sidecar uses the provider image’s default entrypoint (running the provider’s service). It should be named clearly (e.g. include the provider name) for identification.
* Mount the shared volume into each init container (at the source path where it should place files) and into the main LLS container (at the target path where LLS will read them). Sidecars don’t need access to `providers.d` (they run independently), unless the provider service itself needs to read some config \- usually not, as config would be passed via environment or the communication with LLS.
* Set environment variables in the LLS container: e.g., `PYTHONPATH` to include the path on the shared volume where packages were installed (if not directly under `providers.d`). Also ensure any LLS config (like `external_providers_dir`) is pointing to the mounted directory if it’s different from the default.
* Optionally, configure a *readiness probe* or startup script for LLS container to wait for sidecar readiness. This might be as simple as having LLS retry connections, but could also use Kubernetes probes (for example, mark LLS container ready only after sidecars report ready). The operator could set an env var or script in the LLS container entrypoint to check sidecar status.
**Status Reporting:** Implement logic to update the \`status:\` of the LLSD CR to reflect provider injection results:
* After deployment, once the Pod is running, update `status.providers` (for example) with a list of providers successfully loaded. This could include each provider’s name, type (inline/remote), and status (e.g. Loaded, Running, Error).
* If an init container failed (Pod didn’t start), the operator should catch this (possibly via Pod status or events) and surface an error condition in the CR status (e.g. “ProviderInstallFailed” with details of which provider failed). This helps users quickly diagnose issues.
* Indicate sidecar health as well \- if a sidecar is crash looping or not ready, the CR status could reflect that (though Kubernetes itself will show in Pod status, mirroring it in CR is useful for a single pane of glass).
Ensure that removal of providers (if a user edits the CR to remove one) is handled gracefully \- the operator should update the Deployment to remove the corresponding sidecar and init, and status should remove that provider after the pod is reconciled.
**Testing & CI:** Add unit tests for the operator changes:
* Verify that given a spec with a provider, the generated Pod template has the correct volume and containers.
* Test that multiple providers are handled (e.g., two providers with different images result in two init containers and two sidecars).
* If possible, write integration tests in the operator’s e2e suite to deploy a dummy provider and verify LLS picks it up (this might require a fake provider image available in test).
* Simulate failure cases (bad image pull, init container error) and verify operator sets CR conditions appropriately.
### Observability
[AIA PAI Nc Hin R o3 v1.0](https://aiattribution.github.io/statements/AIA-PAI-Nc-Hin-R-?model=o3-v1.0)
Deploy-time providers extend Llama Stack without changing core images, but they add new moving parts that operators must understand and monitor. When the **LlamaStack Operator** reconciles an updated `LlamaStackDistribution` (LLSD) CR it rolls the Pod to inject the shared volume, init container(s), and \- when required \- sidecar containers. From an operational perspective that means adding or removing providers always triggers a short restart window; high-availability installations should use multiple replicas so that rolling updates keep the service available.
Every injected component is surfaced through the **LLSD status**: after a successful reconciliation the operator lists each provider, its mode (`inline` or `remote`), the container name (for sidecars), and a readiness indicator. If an init container fails or a sidecar repeatedly crash-loops, the operator sets a “ProviderDegraded” condition that points directly to the failing container, making `kubectl describe <llsd>` the single entry point for troubleshooting. In addition, the operator emits Kubernetes Events during reconciliation (e.g. *“Injecting provider myvector”*, *“Provider myvector injection failed”*), so `oc get events` shows a live audit trail.
At runtime, **observability is integrated with ordinary Pod inspection**. Each sidecar can expose a liveness/readiness endpoint which the operator wires into the final Deployment. As long as the sidecar stays healthy the overall Pod remains *Ready*; a failing sidecar automatically makes the Pod *Unready*, and the main LLS container continues to retry calls to the sidecar until the sidecar recovers. All containers inherit the cluster’s standard logging and metrics collection, so administrators can scrape sidecar metrics with Prometheus or forward logs to a log aggregator exactly as they do for the main application. If we want the LLS Pod continue to server even when a provider sidecar is failing, then the provider sidecars should always return *Ready* for the readiness probe (but a real liveness prober to allow restart of the container)
Because **inline providers run inside the LLS process**, they inherit its resource limits and share its failure domain. Any crash in inline code brings down the whole Pod and triggers Kubernetes restart logic; conversely, remote providers in sidecars fail in isolation. Administrators should therefore reserve inline mode for small, trusted helpers and use sidecars when sandboxing or additional system libraries are required. Resource requests and limits for each sidecar are declared in the LLSD manifest, letting capacity planners size Pods predictably.
### Test Plan
[AIA PAI Nc Hin R o3 v1.0](https://aiattribution.github.io/statements/AIA-PAI-Nc-Hin-R-?model=o3-v1.0)
Testing this feature requires covering both the deployment mechanics and the functional outcomes:
* **Unit Tests (Operator):** Develop unit tests for the reconciliation logic that adds volumes/containers. For example, given a CR with one inline provider and one remote provider, verify the resulting Pod spec contains the expected volume, one init container per provider, and one sidecar for the remote provider. Test edge cases like no providers, or duplicate provider entries, to ensure the operator handles them gracefully (likely by rejecting duplicates or overriding a previous spec).
* **Functional Test – Inline Provider:** Create a simple dummy inline provider (pure Python) for testing. For instance, a provider that provides a trivial tool or dummy inference. Package it as an image. Deploy LLS with this provider via the new mechanism. Verify that:
* The init container ran successfully and injected the provider (check logs for any error, check CR status for provider listed).
* The LLS server log or API indicates the provider is registered (for example, if there’s a way to list active providers, it should show up; or a specific behavior of that provider can be invoked to see if it works).
* No negative side effects on LLS startup time or stability (the service should start normally with minimal delay from the init step).
* **Functional Test – Sidecar Provider:** Create a dummy remote provider service (e.g., a small web server that LLS can query). For test purposes, this could be a simple HTTP server that echoes input. Package it with an adapter library. Deploy LLS with this remote provider. Verify:
* The sidecar container comes up and is healthy (readiness passes).
* LLS is able to communicate with it (perhaps simulate an LLS query that the provider would handle and see the correct response).
* Try bouncing the sidecar (kill the process) and ensure it recovers, and that LLS handles the temporary unavailability gracefully (retries or at least logs error without crashing).
* **Multiple Providers Scenario:** Test deploying LLS with **multiple** external providers at once (mix of inline and remote). This ensures the operator can handle several init containers and sidecars together. Verify that all providers initialize correctly. Also intentionally create a scenario with conflicting dependencies in two inline providers to see what happens (likely one will override the other’s library version – ensure this is documented or detected). The expectation might be that whichever runs last “wins”, which could break the first one. While we may not fully solve that in code, the test will confirm the nature of the conflict so we can document it (and in the future, we might add warnings in such cases).
* **Offline/Disconnected Test:** Run a test in an environment with no internet access. Use a provider image that includes all needed wheels (simulate a customer in a restricted environment). Deploy LLS with that provider and ensure that at no point does the system attempt to reach out to external repos. This means the init container should install from local files only. If our procedure accidentally tries a pip install from the internet, that test will catch it.
* **Performance Test:** Measure the overhead introduced by this feature:
* Time added to Pod startup due to init containers (e.g. if installing a large package, how many seconds delay).
* Latency of calls to a sidecar vs an in-process call. We can instrument a simple loop calling a function provided by an inline provider vs the same provided by a sidecar to get a ballpark overhead. This will help validate that using sidecars does not introduce unacceptable latency. The expected result is that localhost calls are very fast (few milliseconds).
* Resource usage: run stress tests where the sidecar is busy (e.g., doing a heavy computation) and see that it doesn’t degrade the main LLS responsiveness beyond expected (since CPU is partitioned by limits if set).
All new tests will be integrated into the CI/CD pipeline. We will also likely do a **field trial** with an internal or partner-developed provider to ensure the process is smooth end-to-end (from building the provider image, to deploying it with LLS Operator, to using it in an application).
### Documentation
[AIA PAI Nc Hin R o3 v1.0](https://aiattribution.github.io/statements/AIA-PAI-Nc-Hin-R-?model=o3-v1.0)
To support this feature, comprehensive documentation must be added or updated:
* **User Guide for Deploy-Time Providers:** A section in the LlamaStack documentation explaining the concept of deploy-time composition. This will cover when and why to use it (reiterate use cases: custom providers, partner integrations). It should contrast with the old method (building custom images) to show the benefits.
* **Step-by-Step Setup:** Provide a tutorial or example on adding a provider. For instance, “Adding a Custom Vector Store Provider to LLS” – showing how a user would package their provider (maybe using a simple example like a fake provider), build the container, configure the LLSD CR, and deploy. Include yaml snippets and `kubectl` commands to illustrate the process. Ensure to highlight how to verify that it worked (checking CR status, logs, doing a test query).
* **Provider Image Specification:** Document the requirements for provider container images. Define the expected filesystem layout (e.g., “Place your provider spec in `/lls-provider/providers.d/{type}/{name}.yaml` and your Python package in `/lls-provider/lib/` or similar”), how to make the image copy-friendly (perhaps provide a base or an example Dockerfile). Also, define how the sidecar service should behave (e.g., “your container entrypoint should start the service and not exit”). If we have a base image or SDK for building providers, mention it.
* **Configuring Provider Connections:** Explain how configuration values are passed to providers. For inline providers, it might just be via the YAML spec and LLS config. For remote providers, if the sidecar needs certain env vars or command-line args (like to set a port or mode), document how the operator can pass those. Perhaps the CRD allows specifying env for the provider sidecars, or the provider spec has fields.
* **Operational Guidelines:** Document considerations such as resource allocation for sidecars, how to monitor them (point to logs and metrics as in Observability), and how to troubleshoot common issues (e.g., provider not appearing – check init container logs, check that image was pulled, etc.). Include a FAQ for things like “What if two providers need different versions of X library?” (Answer: they can’t both be inline in that case; use sidecar for at least one of them).
* **Examples and References:** Maintain a repository or catalog of sample providers (if possible) to help users get started. Even if just on GitHub, having a few reference implementations (like how to wrap an existing service as a provider) would be invaluable. Documentation should link to these.
# Exit Criteria [AIA PAI Nc Hin R o3 v1.0](https://aiattribution.github.io/statements/AIA-PAI-Nc-Hin-R-?model=o3-v1.0)
## Alpha
* **Basic Functionality Complete:** The deploy-time provider injection mechanism is implemented in the LLS Operator and has been tested in a dev/pre-release environment. You can successfully declare an external provider in the CR and see it come alive in the LLS Pod (for at least one provider type, e.g., an inference or vector store provider).
* **Limited Scope:** In alpha, the feature may support only a subset of use cases. For example, we might support only pure-Python (inline) providers initially, or only sidecar mode providers, to get core pieces working. The combination/hybrid approach is designed, but maybe not all variations are fully tested. It’s acceptable in alpha that some manual steps or rough edges exist (e.g., requiring the user to set an environment variable manually).
* **Behind Feature Gate (if applicable):** If we choose to gate this feature, the operator might require an explicit enable flag or annotation on the CR to activate deploy-time providers in alpha. This ensures only intrepid users opt in.
* **Known Limitations Listed:** The alpha release notes should clearly state what is not yet supported (e.g., “conflicting dependencies not handled”, “only works on Kubernetes, not standalone Docker”, etc.).
* **Basic CI Coverage:** At least unit tests and maybe one happy-path e2e test are in place, though comprehensive testing might not yet cover all edge cases. The feature should not break existing functionality (confirmed via regression tests).
## Beta
* **Stabilization and Broader Support:** By beta, the feature is more robust. Support for both inline and sidecar providers is fully implemented and tested. Multiple providers can be injected simultaneously. Most known bugs from alpha have been resolved (e.g., issues with volume mounting, ordering, etc.).
* **Performance Optimizations:** Any performance issues discovered in alpha are addressed. For instance, if alpha revealed that large init-containers were too slow, beta might introduce improvements (like supporting pre-bundled wheels to speed it up).
* **API Refinement:** The CRD and configuration schema might be refined based on alpha feedback. By beta, we aim to finalize the API so that scripts and YAML written against it won’t need changes later. If any breaking changes were needed, they are done at beta start, and thereafter the interface is stable.
* **Documentation Complete:** Full user-facing documentation is published (not hidden as experimental). This includes guides for partners on building provider images. The documentation is reviewed and tested by someone outside the development team to ensure clarity.
* **Broader Testing:** Integration tests now cover more scenarios, including some partner use-cases. We might have worked with one or two design partners (for example, an internal team or a friendly ISV) to integrate their provider using this system and gather feedback. Their success in using it is a good beta exit criterion.
* **Security Review:** A threat model and security review of the feature is completed. Any high-priority security concerns (like privilege escalation via init containers, etc.) are mitigated by beta.
* **Upgrade/Downgrade Handling:** Test and document how the feature behaves on operator upgrade or if a user disables it (perhaps by removing providers). Ensure that disabling the feature (if needed) doesn’t leave stale sidecars running, etc. By beta, it should handle transitions cleanly.
## GA
* **Enterprise Hardened:** The feature is proven in diverse environments and load conditions. GA means it’s ready for production use. We have resolved any remaining significant issues. For instance, if multiple providers were causing memory bloat, by GA that’s fixed or documented with guidance.
* **Full Test Coverage:** Automated tests cover all aspects of the feature (including chaos testing like sidecar crashes). We have also possibly run scale tests (e.g., how does the system behave with, say, 5-10 sidecar providers? Does memory usage scale linearly and remain within limits?).
* **Documentation and Examples Polished:** GA documentation includes troubleshooting sections, best practices, and maybe even a list of known compatible provider images (if we or partners provide some out-of-the-box). All references to “Tech Preview” are removed. If there’s a Quick Start or Demo, it’s updated to include showing off adding a provider.
* **Backward Compatibility:** The feature has been in use during the beta, so GA means we won’t make breaking changes to how it’s configured. Users can rely on it long-term. We ensure that even as LLS core evolves, the external provider interface remains stable (or at least deprecation policies apply).
* **GA Criteria Met:** All acceptance criteria defined (e.g., “provider injection must succeed 100/100 times in test, sidecars start within 5 seconds, etc.”) are met. Essentially, the feature is reliable and stable enough for mission-critical use.
At this point, platform engineers and ISVs should feel confident using deploy-time composition of providers in production. The mechanism will be a fully supported part of LlamaStack.
# Alternatives Considered
* **Continue Build-Time Only (Status Quo):** We considered sticking with the current model (i.e., requiring users to build a custom LLS image for any new providers). This was rejected because it places too high a burden on users and does not scale for an ecosystem of many providers. Every new integration would mean a bespoke image build, which is slow and cumbersome, negating a lot of the flexibility BYOP is supposed to offer.
* **Start-time Installation via pypi.redhat.com**: This method involves the LLS container installing additional provider packages from a package repository (e.g., pypi.redhat.com) at startup, potentially triggered by an entrypoint script reading a `requirements` file. Its advantages include simplicity and being self-contained, as no new images or additional containers are needed. However, it suffers from significant drawbacks such as non-deterministic startup times, reliability issues due to remote downloads, late failure discovery (only at runtime), potential dependency conflicts, and a lack of offline support, requiring network access at each startup. Due to these limitations, it is considered fragile and unsuitable for production LLS deployments.
* **Dynamic Runtime Plugin Loading:** Another alternative was to allow LLS to load provider modules *at runtime* via an API or plugin manager (for example, have LLS process accept a new provider package and load it without restarting). This approach was deemed too risky and complex. It would require a secure plugin interface, dynamic module loading, and unloading, etc. Given the potential for dependency conflicts and instability when injecting code into a running process, we favored a deploy-time model (restart to add) which is much simpler and aligns with Kubernetes best practices. Hot-reloading also complicates state management and upgrade paths. We ruled this out in favor of the safer deploy-time composition.
* **Only Sidecars and external Pods (No Inline Option):** We considered whether we should *only* allow external providers as sidecar services, and disallow the inline injection for simplicity. This would force every custom provider to run as a separate service. We decided against this strict approach because there are many simple use cases (small logic, custom tools, passthrough proxy to backends) that are easier to implement as a quick Python plugin running in-process. Requiring a whole sidecar for a few lines of code would over-complicate those scenarios. Thus, we kept both modes: use inline providers for lightweight cases, and sidecars for heavier cases.
In conclusion, the chosen solution (init containers \+ sidecars / external deployments) provides a balanced middle ground, avoiding the pitfalls of pure runtime loading and the inflexibility of build-time only. It aligns with Kubernetes patterns and gives us the modular extensibility needed for LlamaStack’s ecosystem. The alternatives above were either too rigid or too complex to meet our goals.