--- library_name: transformers --- [![HuggingFace](https://img.shields.io/badge/HuggingFace-Model-yellow?logo=huggingface)](https://huggingface.co/wynd/wynd-vidcap-12b) [![Docs](https://img.shields.io/badge/Docs-Read-blue?logo=readthedocs)](https://docs.inference.net/models/wynd-vidcap-12b) [![Inference.net](https://img.shields.io/badge/Inference.net-Launch-orange?logo=rocket)](https://inference.net) [![GitHub](https://img.shields.io/badge/GitHub-Repo-black?logo=github)](https://github.com/wyndlabs/wynd-vidcap-12b) [![Guide](https://img.shields.io/badge/Guide-Read-green?logo=book)](https://docs.inference.net/models/wynd-vidcap-12b) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) ## Introduction & Highlights Welcome to **Wynd-VidCap-12B**, an **FP8-optimized, Gemma-12B–based** open-weight vision–language model for **frame-level, schema-consistent JSON captioning**. It’s designed for production pipelines that process millions to billions of frames where **strict JSON outputs** and **temporal consistency** matter. **Highlights** * **Permissive Apache-2.0 license:** Build and deploy commercially without copyleft risk. * **Schema-consistent JSON:** Stable fields every frame for reliable search, filtering, and temporal analysis. * **Production-ready prompts:** Canonical SYSTEM/USER prompts enforce “JSON-only” outputs. * **FP8 throughput:** Tuned for **RTX 40-series/H100**; **A100 not supported** (no native FP8). * **Teacher-labeled training:** Distilled from a high-quality teacher on **1M single-frame samples**. * **Managed inference:** Run serverlessly on **Inference.net** with Group API & webhooks for video pipelines (see **Docs**). > End-to-end video flow (brief): **Extract keyframes → Group submit (≤50 frames/group) → Poll or webhook → Sort by `metadata.frame_index`**. Details: <a href="https://docs.inference.net/models/wynd-vidcap-12b">docs.inference.net/models/wynd-vidcap-12b</a>. --- ## Output example ```json { "description": "A wooden boardwalk extends through a field of tall green grass. The sky above is blue with scattered white clouds.", "objects": ["Wooden boardwalk", "Tall green grass", "Blue sky", "White clouds", "Trees"], "actions": [], "environment": "Outdoor marshland on a clear day with bright sunlight", "content_type": "real-world footage", "specific_style": "landscape photography", "production_quality": "professional photography", "summary": "A wooden boardwalk winds through a vibrant green field under a blue sky.", "logos": [] } ``` --- ## Benchmarks FrameCap-12B was trained on the Gemma-12B architecture with 1 million curated video-frame samples. | Model | Samples | Judge | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU | | --------- | ------- | ----- | ------- | ------- | ------- | ----- | | Gemma 12B | Normal | 3.00 | 0.490 | 0.198 | 0.299 | 0.074 | | Gemma 12B | 100K | 3.29 | 0.649 | 0.367 | 0.490 | 0.232 | | Gemma 12B | 1M | 3.53 | 0.674 | 0.404 | 0.520 | 0.267 | * **Normal** = base model. * **100K / 1M** = trained on 100k / 1M **teacher-labeled** frames with a fixed JSON schema prompt. * **FP8 quantization** showed **no measurable quality loss** vs. bf16 for these metrics. **Training notes (relevant):** * **Input granularity:** single **video frame** per request (not multiple frames). * **Serving knobs:** recommended `temperature=0.1`. If you hit OOM on a 4090, reduce `--max-num-seqs` and `--max-num-batched-tokens` (vLLM). * **Stop sequences (optional):** string-based guards like `"}\n"` can be used if your serving stack supports them. --- ## Canonical prompts (use verbatim) **SYSTEM** ``` You are an image annotation API trained to analyze video keyframes. You must follow the output rules exactly and return only valid JSON. ``` **USER** ``` You must respond with a valid JSON object matching the exact structure below. { "description": "A detailed, factual account of what is visibly happening (max 4 sentences). Only mention concrete, visible elements or actions. Do not include camera/composition commentary. Do not start with 'This image shows...'; write the content directly.", "objects": ["object1 with relevant visual details", "object2 with relevant visual details", "..."], "actions": ["action1 with participants and context", "action2 with participants and context", "..."], "environment": "Detailed factual description of the setting based on visible cues.", "content_type": "One of: real-world footage, video game, animation, cartoon, CGI, VTuber, etc.", "specific_style": "Specific genre or platform style, e.g., news broadcast, vlog, anime, mobile gameplay.", "production_quality": "Visible production level: professional studio, amateur handheld, webcam recording, TV broadcast, etc.", "summary": "One clear sentence summarizing the visual content.", "logos": ["logo1 with visual description", "logo2 with visual description", "..."] } Rules: - Be literal and specific. Only describe what is explicitly visible. - No speculation, mood, or artistic analysis. - Include the language of any visible text (e.g., 'English text'). - Maximum 10 objects and 5 actions. - If no logos, return [] for 'logos'. - Output ONLY the JSON object — no extra text. ``` --- ## Inference examples ### Transformers You can use **Wynd-VidCap-12B** with **Transformers**. When using image inputs, pass the **canonical SYSTEM/USER prompts** and the image alongside the chat template (see processor docs in the repo). To get started, install the necessary dependencies: ```bash pip install -U transformers torch pillow ``` ```python from transformers import AutoModelForCausalLM, AutoProcessor from PIL import Image import json, torch model_id = "wynd/wynd-vidcap-12b" processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype="auto", device_map="auto", trust_remote_code=True, ) SYSTEM = "You are an image annotation API trained to analyze video keyframes. You must follow the output rules exactly and return only valid JSON." USER = "<paste the full USER prompt from above>" image = Image.open("frame.jpg").convert("RGB") messages = [ {"role": "system", "content": SYSTEM}, {"role": "user", "content": [ {"type": "text", "text": USER}, {"type": "image"} # image passed via processor ]}, ] text_inputs = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device) vision_inputs = processor(images=image, return_tensors="pt").to(model.device) inputs = {**text_inputs, **{k: v for k, v in vision_inputs.items() if k != "input_ids"}} gen = model.generate(**inputs, temperature=0.1, max_new_tokens=2000) out = processor.batch_decode(gen[:, text_inputs["input_ids"].shape[-1]:], skip_special_tokens=True)[0] result = json.loads(out) # strict JSON per contract print(json.dumps(result, indent=2)) ``` Alternatively, you can run a server with **Transformers Serve** (OpenAI-compatible webserver): ```bash transformers serve transformers chat localhost:8000 --model-name-or-path wynd/wynd-vidcap-12b ``` Learn more about how to use vision chat models with **Transformers** in their documentation. ### vLLM You can use **vLLM** to spin up an OpenAI-compatible server. The following command will download the model and start the server. ```bash pip install -U vllm vllm serve wynd/wynd-vidcap-12b \ --host 0.0.0.0 --port 8000 \ --dtype bfloat16 \ --max-model-len 8192 \ --tensor-parallel-size 1 \ --trust-remote-code ``` > For production video flows (groups/webhooks/`metadata.frame_index`), see the pipeline guide: <a href="https://docs.inference.net/models/wynd-vidcap-12b">docs.inference.net/models/wynd-vidcap-12b</a>. --- ## Contact Us If you’re looking to run this model at massive scale or want white-glove support for custom model distillation and serving, our team can help. Reach out to us at **[support@inference.net](mailto:support@inference.net)**.