Currently llama cpp is using # Limitation of llama cpp server (29-Nov) ### Drogon feature - BlockingIO- https://github.com/yhirose/cpp-httplib ![image](https://hackmd.io/_uploads/Hkd4uzNB6.png) - Static theading models giving total control over OS threads: https://github.com/drogonframework/drogon/wiki/ENG-FAQ-1-Understanding-drogon-threading-model - Possibility to extend usecase based on built-in DB egnine ![image](https://hackmd.io/_uploads/H1t80zErT.png) ### Model feature - llama.cpp server does not have in-flight load and unload model # What upstream merge from nitro can add ### What we can add in a "llama drogon" PR - non-blocking IO - static threads control models (fixed threads will improve performance of webserver significantly) - bigger framework and built-in toolset (drogon) to extend the native c++ server capabilities ### What we might consider - Model management: - Model warmup - Model loading - Model unloading - Model list (not yet) - Model folder (not yet) # How it works on an overall landscape The model of contributing will be like the below chart, except for functionalities will be something like the above exhaustive list. ![image](https://hackmd.io/_uploads/SJtP7QVSp.png)