---
library_name: transformers
---
[](https://huggingface.co/wynd/wynd-vidcap-12b)
[](https://docs.inference.net/models/wynd-vidcap-12b)
[](https://inference.net)
[](https://github.com/wyndlabs/wynd-vidcap-12b)
[](https://docs.inference.net/models/wynd-vidcap-12b)
[](https://opensource.org/licenses/Apache-2.0)
## Introduction & Highlights
Welcome to **Wynd-VidCap-12B**, an **FP8-optimized, Gemma-12B–based** open-weight vision–language model for **frame-level, schema-consistent JSON captioning**. It’s designed for production pipelines that process millions to billions of frames where **strict JSON outputs** and **temporal consistency** matter.
**Highlights**
* **Permissive Apache-2.0 license:** Build and deploy commercially without copyleft risk.
* **Schema-consistent JSON:** Stable fields every frame for reliable search, filtering, and temporal analysis.
* **Production-ready prompts:** Canonical SYSTEM/USER prompts enforce “JSON-only” outputs.
* **FP8 throughput:** Tuned for **RTX 40-series/H100**; **A100 not supported** (no native FP8).
* **Teacher-labeled training:** Distilled from a high-quality teacher on **1M single-frame samples**.
* **Managed inference:** Run serverlessly on **Inference.net** with Group API & webhooks for video pipelines (see **Docs**).
> End-to-end video flow (brief): **Extract keyframes → Group submit (≤50 frames/group) → Poll or webhook → Sort by `metadata.frame_index`**. Details: <a href="https://docs.inference.net/models/wynd-vidcap-12b">docs.inference.net/models/wynd-vidcap-12b</a>.
---
## Output example
```json
{
"description": "A wooden boardwalk extends through a field of tall green grass. The sky above is blue with scattered white clouds.",
"objects": ["Wooden boardwalk", "Tall green grass", "Blue sky", "White clouds", "Trees"],
"actions": [],
"environment": "Outdoor marshland on a clear day with bright sunlight",
"content_type": "real-world footage",
"specific_style": "landscape photography",
"production_quality": "professional photography",
"summary": "A wooden boardwalk winds through a vibrant green field under a blue sky.",
"logos": []
}
```
---
## Benchmarks
FrameCap-12B was trained on the Gemma-12B architecture with 1 million curated video-frame samples.
| Model | Samples | Judge | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU |
| --------- | ------- | ----- | ------- | ------- | ------- | ----- |
| Gemma 12B | Normal | 3.00 | 0.490 | 0.198 | 0.299 | 0.074 |
| Gemma 12B | 100K | 3.29 | 0.649 | 0.367 | 0.490 | 0.232 |
| Gemma 12B | 1M | 3.53 | 0.674 | 0.404 | 0.520 | 0.267 |
* **Normal** = base model.
* **100K / 1M** = trained on 100k / 1M **teacher-labeled** frames with a fixed JSON schema prompt.
* **FP8 quantization** showed **no measurable quality loss** vs. bf16 for these metrics.
**Training notes (relevant):**
* **Input granularity:** single **video frame** per request (not multiple frames).
* **Serving knobs:** recommended `temperature=0.1`. If you hit OOM on a 4090, reduce `--max-num-seqs` and `--max-num-batched-tokens` (vLLM).
* **Stop sequences (optional):** string-based guards like `"}\n"` can be used if your serving stack supports them.
---
## Canonical prompts (use verbatim)
**SYSTEM**
```
You are an image annotation API trained to analyze video keyframes. You must follow the output rules exactly and return only valid JSON.
```
**USER**
```
You must respond with a valid JSON object matching the exact structure below.
{
"description": "A detailed, factual account of what is visibly happening (max 4 sentences). Only mention concrete, visible elements or actions. Do not include camera/composition commentary. Do not start with 'This image shows...'; write the content directly.",
"objects": ["object1 with relevant visual details", "object2 with relevant visual details", "..."],
"actions": ["action1 with participants and context", "action2 with participants and context", "..."],
"environment": "Detailed factual description of the setting based on visible cues.",
"content_type": "One of: real-world footage, video game, animation, cartoon, CGI, VTuber, etc.",
"specific_style": "Specific genre or platform style, e.g., news broadcast, vlog, anime, mobile gameplay.",
"production_quality": "Visible production level: professional studio, amateur handheld, webcam recording, TV broadcast, etc.",
"summary": "One clear sentence summarizing the visual content.",
"logos": ["logo1 with visual description", "logo2 with visual description", "..."]
}
Rules:
- Be literal and specific. Only describe what is explicitly visible.
- No speculation, mood, or artistic analysis.
- Include the language of any visible text (e.g., 'English text').
- Maximum 10 objects and 5 actions.
- If no logos, return [] for 'logos'.
- Output ONLY the JSON object — no extra text.
```
---
## Inference examples
### Transformers
You can use **Wynd-VidCap-12B** with **Transformers**. When using image inputs, pass the **canonical SYSTEM/USER prompts** and the image alongside the chat template (see processor docs in the repo).
To get started, install the necessary dependencies:
```bash
pip install -U transformers torch pillow
```
```python
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import json, torch
model_id = "wynd/wynd-vidcap-12b"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
SYSTEM = "You are an image annotation API trained to analyze video keyframes. You must follow the output rules exactly and return only valid JSON."
USER = "<paste the full USER prompt from above>"
image = Image.open("frame.jpg").convert("RGB")
messages = [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": [
{"type": "text", "text": USER},
{"type": "image"} # image passed via processor
]},
]
text_inputs = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
vision_inputs = processor(images=image, return_tensors="pt").to(model.device)
inputs = {**text_inputs, **{k: v for k, v in vision_inputs.items() if k != "input_ids"}}
gen = model.generate(**inputs, temperature=0.1, max_new_tokens=2000)
out = processor.batch_decode(gen[:, text_inputs["input_ids"].shape[-1]:], skip_special_tokens=True)[0]
result = json.loads(out) # strict JSON per contract
print(json.dumps(result, indent=2))
```
Alternatively, you can run a server with **Transformers Serve** (OpenAI-compatible webserver):
```bash
transformers serve
transformers chat localhost:8000 --model-name-or-path wynd/wynd-vidcap-12b
```
Learn more about how to use vision chat models with **Transformers** in their documentation.
### vLLM
You can use **vLLM** to spin up an OpenAI-compatible server. The following command will download the model and start the server.
```bash
pip install -U vllm
vllm serve wynd/wynd-vidcap-12b \
--host 0.0.0.0 --port 8000 \
--dtype bfloat16 \
--max-model-len 8192 \
--tensor-parallel-size 1 \
--trust-remote-code
```
> For production video flows (groups/webhooks/`metadata.frame_index`), see the pipeline guide: <a href="https://docs.inference.net/models/wynd-vidcap-12b">docs.inference.net/models/wynd-vidcap-12b</a>.
---
## Contact Us
If you’re looking to run this model at massive scale or want white-glove support for custom model distillation and serving, our team can help. Reach out to us at **[support@inference.net](mailto:support@inference.net)**.