## LLM deployment End-point availability
### Possible Resources
- Pilot on CS kubernetes
- Possible explicit resources either in the CS cluster or
### Requirements
- langchain compatible (ease of use for Researchers - optimally OpenAI compatible)
- Resource Management (CPU/GPU/Memory) depending on model up to 60GB per instance
- authentication (API tokens, or JWT if e.g. accessed via gpt.aalto.fi)
- Usage Monitoring (Tokens per user/key etc) if usage gets high enough that we need some compensation mechanisms.
### Possible Frameworks:
- OpenLLM
- advantages:
- direct support via langchain (might be limited though),
- multiple models per instance
- direct integration for huggingface models
- provides API-key auth ootb
- disadvantages:
- Not openAI compatible
- cannot (I think) load local models, relies on huggingface
- Unsure how easy it would be to
- Doesn't quantize models (from my understanding), i.e. requires a lot of memory
- authentication might be problematic via langchain (not sure if langchain can set additional auth properly)
- not clear if it can actually work with e.g. codelama.
- Absoltely requires a GPU (at least I couldn't get it to run locally without one).
- privateGPT
- advantages:
- openAI compatible API.
- easy to set up
- Has tools that provide document ingestion (data supplemented searches)
- disadvantages:
- no support for parameters (yet)
- Document support based solely on local access (i.e. not retrieved from secondary source, but directly integrated in the server)
- llama-cpp-python[server]:
- advantages:
- Very simplistic
- Allows parameter usage
- relatively direct access to the interface (not a lot of stuff around it)
- disadvantages
- needs auth mechanism via middleware
- restricted to llama.cpp compatible models.