# Segmentation ## Segmentation Model (Non Specific) ### Diffusion 계열 - DiffSeg [CVPR2024]: https://arxiv.org/pdf/2308.12469 - OVAM [CVPR2024] (찬영 친구가 쓴 논문): https://openaccess.thecvf.com/content/CVPR2024/papers/Marcos-Manchon_Open-Vocabulary_Attention_Maps_with_Token_Optimization_for_Semantic_Segmentation_in_CVPR_2024_paper.pdf - ConceptAttention [ICML2025] - Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers ### CLIP 계열 - CLIPSeg - ZegCLIP: https://arxiv.org/pdf/2212.03588 - ViCLIP: ### DINO (Self-supervised DINO) - TokenCut [CVPR2022]: https://openaccess.thecvf.com/content/CVPR2022/papers/Wang_Self-Supervised_Transformers_for_Unsupervised_Object_Discovery_Using_Normalized_Cut_CVPR_2022_paper.pdf - :video_camera: SSL-VOS [WACV2023]: https://openaccess.thecvf.com/content/WACV2023/papers/Ponimatkin_A_Simple_and_Powerful_Global_Optimization_for_Unsupervised_Video_Object_WACV_2023_paper.pdf - :video_camera: Betrayed by Attention [ECCV2024]: https://arxiv.org/pdf/2311.17893 - :video_camera: VideoCutLER [CVPR2024]: https://openaccess.thecvf.com/content/CVPR2024/papers/Wang_VideoCutLER_Surprisingly_Simple_Unsupervised_Video_Instance_Segmentation_CVPR_2024_paper.pdf - :video_camera: Dual Prototype Attention for UVOS [CVPR2024]: https://openaccess.thecvf.com/content/CVPR2024/papers/Cho_Dual_Prototype_Attention_for_Unsupervised_Video_Object_Segmentation_CVPR_2024_paper.pdf ## Segmentation Benchmark (Dataset) - Davis - YouTube-VOS: https://openaccess.thecvf.com/content_ECCV_2018/papers/Ning_Xu_YouTube-VOS_Sequence-to-Sequence_Video_ECCV_2018_paper.pdf - MOSE https://arxiv.org/pdf/2302.01872 <!-- ## Table (Davis) ![image](https://hackmd.io/_uploads/BJnBEmLJZx.png) ![image](https://hackmd.io/_uploads/rJPOV7I1Wg.png) --- ![image](https://hackmd.io/_uploads/SyDsq7UJbe.png) --> <!-- ### DAVIS 2017 (Interpretability) > “CONCEPTATTENTION outperforms a variety of Diffusion, DINO, and CLIP ViT interpretability methods on ImageNet-Segmentation and PascalVOC (Single Class)” |Model|Video|ViT|DAVIS| |:--:|:--:|:--:|:--:| |ViCLIP*||| |Franca|o|g/14|61.8| |DINOv2|o|g/14|63.9| |Web-DINO|o|7B/14|57.2| |DINOv3|o|7B/16|71.1| |CrossAttention|x||| |DAAM|x||| |OVAM|x||| |ConceptAttention|x||| --> --- # VSS Dataset - VSPW ![image](https://hackmd.io/_uploads/rkrqaQvkbg.png) ![image](https://hackmd.io/_uploads/S1jhaXvyWe.png) - VIPSeg Datset - OVIS - YTVIS19 - YTVIS21 - Cityscapes , Camvid # VSPW Benchmark - mIOU metric - mIOU-present / short / mid (시간 구간별 정확도) : 특정 시간 간격에 따라 나눠서 mIOU 계산 - mVC8, mVC16: Temporal Consistency (프레임 간 라벨 일관성) **Zero-Shot Video Semantic Segmentation based on Pre-Trained Diffusion Models** [CVPR 2025] ![image](https://hackmd.io/_uploads/Bk6xnY_Jbx.png) --- # ViCLIP ## MeViS (Motion) ``` { "video_path": "/scratch2/mu06363/cvpr2026/datasets/MeViS/Videos_49/00099fdb8d89_1.mp4", "caption": "Two people are dancing hand in hand in a room.", "concepts": "person, mirror, bed, shirt, pants, dress", "camera": "static", "object": "dancing" }, ``` ### 문제점 ``` video → [T, H, W, 3] → ViCLIP (image encoder) → video_feature (1 × 512) text → "A dog running" → text_feature (1 × 512) Cosine Simiarity = video feature * text feature ``` ``` [Frame 1] → Frame Features → Similarity → Heatmap 1 [Frame 2] → Frame Features → Similarity → Heatmap 2 ... [Frame T] → Frame Features → Similarity → Heatmap T ``` ### 방법 - Concept에 대해서 sentence 로 "A video of a <concept>" - `Object` 라고 되어있는 부분에 대한 map "Two people are **dancing** hand in hand in a room." ---