Import Models - Research Scratchpad

# Import Models: A Research Scratchpad --- ## Appendix This document is a more granular spec on how users can import **uncatalogged** models from various sources. ## Questions - How do we catalog the models from HuggingFace? See Catalog Options ### Catalog Option 1 **Dynamically scrape HF at runtime**, i.e. User pastes in url path`https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF`, then we scrape the contents of the site to generate model card. - Pro: Not as tedious as (2); users see newest models - Con: Slower to load results, user needs to provide complete & valid URL, we can't suggest all models under TheBloke ### Catalog Option 2 **Mirror HuggingFace and pre-index all models** (e.g. LM-Studio) - Pro: Fast to load results. Users can search for partial terms, not full URL, e.g. typing `TheBloke` will suggest all TheBloke models. Builds moat. - Con: Have to build daily/hourly HF scraper, overcome HF IP throttling - See further thoughts: https://hackmd.io/4R2aN42GR-GlT516_uixOQ ### Catalog Option 3 **Curate a Library of recommended, popular models** (e.g. Ollama) - Pro: Curated, good tags/descriptions. Builds moat. - Con: Users might not see newest models, tedious to maintain ## UX Principles - Users should be able to import and use a model in less than 3 clicks, in under a minute. - Don't gatekeep new models. Users shouldn't depend on us to "support" new models from TheBloke, for example. - Maintain a handful of `Recommended Models`, which are popular open source models, with optimal parameters already configured. ## UX: Model import flows (assuming option 1) ### 1. User downloads a model via GUI (using URL) 1. User makes a post request: `POST /models` with parameter `$USER_SOURCE_URL` (see [valid url paths](#SOURCE_URL)) 2. Model Importer infers the Model Format based on `SOURCE_URL`. If it is a custom format, then handle it accordingly. 3. User sees a `model card` based on Model Format (designs pending), i.e. if GGUF, then render the various quantizations 3. User chooses a model/variation to download **Initial model file** - It doesn't exist **Final model file** ```json "source_url": $USER_SOURCE_URL, "state": "ready", // Q: How to express the place where model binary was actually saved? // "downloaded_path": "/models/$FOLDERNAME" // Alt names: binary, model/binary/file_location // Alternate // "entry_point": "./models/$FOLDERNAME/$BINARYFILE", // In case of multiple model binaries "metadata": { "format": "gguf" // For custom supported formats "custom_format_tags_here": "tba" } ``` ### 2. User downloads a Recommended Model via GUI (via model.json) 1. Jan ships with a few Recommended GGUF models with deferred download 2. They render as Recommended Model Cards in the UI, and users have the choice of actually downloading the models Initial model file ```json "state": null, "parameters": {...fully_defined} ``` Final model file ```json "state": "ready", ... ``` ### 3. (KIV) User imports a model from local filesystem (using model.json) - User drags and drops a complete model package (with model.json and binaries into /models) - Handle this later. ### SOURCE_URL Valid URL paths we handle: 1. `Huggingface/$ORG_NAME/$MODEL_NAME*` - https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF - https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/tree/main - https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/blob/main/llama-2-7b-chat.Q3_K_M.gguf ## Supported Model Formats The following model formats have custom import logic. - [GGUF](#GGUF) - (KIV) AWQ - (KIV) GPTQ - (KIV) Pytorch - (KIV) Safetensors - (KIV) TensorRT ## Benchmarking ### Ollama - Import UX: Ollama maintains a Library of supported models. - Users just do `ollama run mistral:variant` to use it out-of-the-box - Supported formats: `GGUF`, `PyTorch` or `Safetensors` - Shaping up to be a HF competitor, letting users upload custom models: https://github.com/jmorganca/ollama/blob/main/docs/import.md#publishing-your-model-optional--early-alpha ### LMStudio - Most nontechnical-user friendly UX - They scrape HuggingFace daily and index all models and variants - Users can search for any terms and get partial matches, e.g. `TheBloke/llama` returns a list of many `TheBloke/*llama*` models ### Faraday - Similar to Ollama, maintains a curated Library of models. - Thus not all/latest models are shown ### Ooba ### SillyTavern - Depends on you running a separate inference server - Main, recommended way to use it is via remote API - Q: What is @dan-jan referring to when he says ST has a good models experience ### KoboldCPP -