LLM deployment End-point avail

## LLM deployment End-point availability ### Possible Resources - Pilot on CS kubernetes - Possible explicit resources either in the CS cluster or ### Requirements - langchain compatible (ease of use for Researchers - optimally OpenAI compatible) - Resource Management (CPU/GPU/Memory) depending on model up to 60GB per instance - authentication (API tokens, or JWT if e.g. accessed via gpt.aalto.fi) - Usage Monitoring (Tokens per user/key etc) if usage gets high enough that we need some compensation mechanisms. ### Possible Frameworks: - OpenLLM - advantages: - direct support via langchain (might be limited though), - multiple models per instance - direct integration for huggingface models - provides API-key auth ootb - disadvantages: - Not openAI compatible - cannot (I think) load local models, relies on huggingface - Unsure how easy it would be to - Doesn't quantize models (from my understanding), i.e. requires a lot of memory - authentication might be problematic via langchain (not sure if langchain can set additional auth properly) - not clear if it can actually work with e.g. codelama. - Absoltely requires a GPU (at least I couldn't get it to run locally without one). - privateGPT - advantages: - openAI compatible API. - easy to set up - Has tools that provide document ingestion (data supplemented searches) - disadvantages: - no support for parameters (yet) - Document support based solely on local access (i.e. not retrieved from secondary source, but directly integrated in the server) - llama-cpp-python[server]: - advantages: - Very simplistic - Allows parameter usage - relatively direct access to the interface (not a lot of stuff around it) - disadvantages - needs auth mechanism via middleware - restricted to llama.cpp compatible models.