# Segmentation
## Segmentation Model (Non Specific)
### Diffusion 계열
- DiffSeg [CVPR2024]: https://arxiv.org/pdf/2308.12469
- OVAM [CVPR2024] (찬영 친구가 쓴 논문): https://openaccess.thecvf.com/content/CVPR2024/papers/Marcos-Manchon_Open-Vocabulary_Attention_Maps_with_Token_Optimization_for_Semantic_Segmentation_in_CVPR_2024_paper.pdf
- ConceptAttention [ICML2025]
- Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers
### CLIP 계열
- CLIPSeg
- ZegCLIP: https://arxiv.org/pdf/2212.03588
- ViCLIP:
### DINO (Self-supervised DINO)
- TokenCut [CVPR2022]: https://openaccess.thecvf.com/content/CVPR2022/papers/Wang_Self-Supervised_Transformers_for_Unsupervised_Object_Discovery_Using_Normalized_Cut_CVPR_2022_paper.pdf
- :video_camera: SSL-VOS [WACV2023]: https://openaccess.thecvf.com/content/WACV2023/papers/Ponimatkin_A_Simple_and_Powerful_Global_Optimization_for_Unsupervised_Video_Object_WACV_2023_paper.pdf
- :video_camera: Betrayed by Attention [ECCV2024]: https://arxiv.org/pdf/2311.17893
- :video_camera: VideoCutLER [CVPR2024]: https://openaccess.thecvf.com/content/CVPR2024/papers/Wang_VideoCutLER_Surprisingly_Simple_Unsupervised_Video_Instance_Segmentation_CVPR_2024_paper.pdf
- :video_camera: Dual Prototype Attention for UVOS [CVPR2024]: https://openaccess.thecvf.com/content/CVPR2024/papers/Cho_Dual_Prototype_Attention_for_Unsupervised_Video_Object_Segmentation_CVPR_2024_paper.pdf
## Segmentation Benchmark (Dataset)
- Davis
- YouTube-VOS: https://openaccess.thecvf.com/content_ECCV_2018/papers/Ning_Xu_YouTube-VOS_Sequence-to-Sequence_Video_ECCV_2018_paper.pdf
- MOSE https://arxiv.org/pdf/2302.01872
<!-- ## Table (Davis)


---

-->
<!-- ### DAVIS 2017 (Interpretability)
> “CONCEPTATTENTION outperforms a variety of Diffusion, DINO, and CLIP ViT interpretability methods on ImageNet-Segmentation and PascalVOC (Single Class)”
|Model|Video|ViT|DAVIS|
|:--:|:--:|:--:|:--:|
|ViCLIP*|||
|Franca|o|g/14|61.8|
|DINOv2|o|g/14|63.9|
|Web-DINO|o|7B/14|57.2|
|DINOv3|o|7B/16|71.1|
|CrossAttention|x|||
|DAAM|x|||
|OVAM|x|||
|ConceptAttention|x|||
-->
---
# VSS Dataset
- VSPW


- VIPSeg Datset
- OVIS
- YTVIS19
- YTVIS21
- Cityscapes , Camvid
# VSPW Benchmark
- mIOU metric
- mIOU-present / short / mid (시간 구간별 정확도) : 특정 시간 간격에 따라 나눠서 mIOU 계산
- mVC8, mVC16: Temporal Consistency (프레임 간 라벨 일관성)
**Zero-Shot Video Semantic Segmentation based on Pre-Trained Diffusion Models** [CVPR 2025]

---
# ViCLIP
## MeViS (Motion)
```
{
"video_path": "/scratch2/mu06363/cvpr2026/datasets/MeViS/Videos_49/00099fdb8d89_1.mp4",
"caption": "Two people are dancing hand in hand in a room.",
"concepts": "person, mirror, bed, shirt, pants, dress",
"camera": "static",
"object": "dancing"
},
```
### 문제점
```
video → [T, H, W, 3] → ViCLIP (image encoder) → video_feature (1 × 512)
text → "A dog running" → text_feature (1 × 512)
Cosine Simiarity = video feature * text feature
```
```
[Frame 1] → Frame Features → Similarity → Heatmap 1
[Frame 2] → Frame Features → Similarity → Heatmap 2
...
[Frame T] → Frame Features → Similarity → Heatmap T
```
### 방법
- Concept에 대해서 sentence 로 "A video of a <concept>"
- `Object` 라고 되어있는 부분에 대한 map "Two people are **dancing** hand in hand in a room."
---