or
or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up
Syntax | Example | Reference | |
---|---|---|---|
# Header | Header | 基本排版 | |
- Unordered List |
|
||
1. Ordered List |
|
||
- [ ] Todo List |
|
||
> Blockquote | Blockquote |
||
**Bold font** | Bold font | ||
*Italics font* | Italics font | ||
~~Strikethrough~~ | |||
19^th^ | 19th | ||
H~2~O | H2O | ||
++Inserted text++ | Inserted text | ||
==Marked text== | Marked text | ||
[link text](https:// "title") | Link | ||
 | Image | ||
`Code` | Code |
在筆記中貼入程式碼 | |
```javascript var i = 0; ``` |
|
||
:smile: | ![]() |
Emoji list | |
{%youtube youtube_id %} | Externals | ||
$L^aT_eX$ | LaTeX | ||
:::info This is a alert area. ::: |
This is a alert area. |
On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?
Please give us some advice and help us improve HackMD.
Do you want to remove this version name and description?
Syncing
xxxxxxxxxx
Language-based Semantic Segmentation
Open Vocabulary Scene Parsing
What is it?
Model predictions not being limited to a fixed set of categories (
COCO
: 80 classes), and instead being part of a large open dictionary (WordNet
: 100,000 synsets).Example : If the model has never seen
tricycle
, it still should give a plausible prediction asvehicle
.They take each class in
ADE20K
dataset and relate it with a synset(synonym set) fromWordNet
, end up with 2019 unique synsets forming a DAG withentity
being the common root.concept map
created (The leaves are the specific objects and inner nodes are general concepts). The root isentity
, since everything is an entity.Problem Settings
Framework overview
A
max-margin
loss is used to learn the embedding function \(f(\cdot)\) for mapping the concept space to joint embedding space.They argue that since label retrieval is a ranking problem, negative labels should be introduced to push scores of positive labels to be larger than those of negative.
Initially, they use a
max-margin
loss for learning the mapping \(g(\cdot)\) from pixel feature space to the joint embedding space, but find that usingsoftmax
in the form of a triplet loss performs better.\[ \begin {align} \newcommand\ddfrac[2]{\frac{\displaystyle #1}{\displaystyle #2}} \mathcal{L}_{image}(x_{i,j}) &= -\log\bigg(\ddfrac{e^{S_{image}(f(y_{i,j}), g(x_{i,j}))}}{e^{S_{image}(f(y_{i,j}), g(x_{i,j}))} + \sum_{y'_{i,j}}{e^{S_{image}(f(y'_{i,j}), g(x_{i,j}))}}}\bigg) \\ where,\ x_{i,j} &= pixel\ features\ of\ the\ (i,j)^{th}\ pixel\\ y_{i,j} &= label\ of\ the\ (i,j)^{th}\ pixel\\ y'_{i,j} &= negative\ labels\ for\ the\ (i,j)^{th}\ pixel \end{align} \]
Their 'Image Stream' uses an adapted version of VGG-16 (to make the embeddings have a dimension equal to the word concept embeddings).
Also, (in the latent space), they fix the norms of the image embedding pixels to
30
to improve numerical stability (since pixel embeddings are the most specific concept in the joint embedding space).The 'Concept Stream' is trained first and the trained word embeddings are used as initializations for training loop.
Metrics
They use standard metrics (per-pixel accuracy, mean accuracy, mIOU, weighted IOU), alongwith
Results/Conclusion
Supervised: They were not able to beat the baseline score of multi class classification using the same CNN (
Softmax
).Interestingly, another baseline (
Conditional Softmax
) which was specifically designed for hierarchical classification was also less thanSoftmax
.Only standard metrics (accuracy, mean accuracy, mIOU, wIOU) were used to compare models.
Zero-shot: Here, however, they were able to consistently perform better than the baselines.
They also find that using the asymmetric scoring function gives a significant improvement.
Only hierarchical metrics and information content ratio were used for comparisons here.
Qualitatively, they show that in places where the model is unsure of the specific object, it correctly predicts a more general concept.
For example, in a rocking chair, the top part looks like a chair so it classifies that correctly, but the bottom part is not like a normal chair, and since it hasn't seen that particularly, it classifies it as ' furniture', which is plausible and human-like.
They also do a 'concept search' in the embedding space to show that though baseline models can learn specific objects equally well, when more abstract terms are 'searched' for in the joint embedding space, their model is still able to detect them in images whereas baseline models aren't.
They also show that because objects like
chair
andbench
are close in the joint embedding space, so by looking in the vicinity ofchair
, they hypothesize that they will findsittable
objects.End.
CLIP (Contrastive Language-Image Pretraining)
What is it
They show that transformers are not good at zero-shot learning. So, they improve it by employing a bag-of-words objective and employ a contrastive objective, showing improvements over simply predictive objective.
They pretrain a large scale model that can perform multiple tasks.
Contrastive Objective
In a batch of \(N\) (image, text) pairs, they take all possible pairings of images and text (\(N^2\)) and train
CLIP
to predict which out of those possible pairings actually occurred.They do this by maximizing the agreement (via cosine similarity) of the \(N\) correct pairs, and, pushing away/reducing agreement between the \(N^2 - N\) negative pairs.
Framework Overview
They train
CLIP
from scratch on theirWebImageText
dataset containing ~400 million images.Image Encoder: Because of the wide variety of architectures and designs available, they ended up choosing two architectures.
One is based on ResNet50, with a modifications to the layers and replacing global average pooling with 'attention pooling'. They mention
The other one is based on the recent Vision Transformer (ViT). They make only minor changes to this architecture.
Text Encoder: The text encoder is taken as a transformer with some previously published modifications.
They only scale the width of this encoder as they find that
CLIP
is less sensitive to the text encoder.Training Details
ResNets: They train 5 models (
ResNet50
,ResNet101
,"Efficient-Net" style RN50x4
,RN50x16
,RN50x64
)ViT: They train 3 models (
ViT-B/32
,ViT-B/16
,ViT-L/14
)Zero-shot performance
They use this pre-trained (on
WebImageText
dataset)CLIP
model and test the zero-shot transfer ability on other CV datasets likeImageNet
,aYahoo
andSUN
, showing a significant improvement aboveVisual N-Grams
.Also, they test it against a fully supervised logistic regression trained on the features of
ResNet50
and beat it on 16 out of 27 datasets. They note thatCLIP
performs worse in more specialized datasets like satellite images, lymph node tumors, traffic sign recognition, etc.Further, they also compare their zero shot results with few-shot linear probes and show that they outperform them.
Discussion
They talk about natural distribution shift and deep models exceeding accuracy on ImageNet, while in reality, more robust/better metrics show that that is not the case.
They also use Effective robustness and Relative robustness, which are made to measure improvements in accuracy under distribution shift, and out-of-distribution accuracy respectively. They also argue that because a zero-shot model cannot exploit the patterns of a specific dataset/ distribution, they empirically have more effective robustness than few shot models.
They showed that the overlap in the datasets was also very low (average 3.2%), and the maximum improvement in accuracy is only 0.6%, which is in line with other large scale pre-trained models.
Other than this, they briefly talk about the societal impact and privacy/risk implications because of
CLIP
etc.LSeg : Language-Driven Semantic Segmentation
Problem setting: Zero-shot segmentation
One-line approach: Use the text encoder of models like
CLIP
, train a separate visual encoder to produce pixel embeddings close to the label embeeddings in a joint embedding space.Advantage: flexibility, i.e. being able to segment different classes within the same image given a different label set. (It can also segment with a label that is close to another label in the embedding space, i.e. given
pet
as a label, it classifies thedog
aspet
)Framework
They use only the text part of CLIP, discarding the image encoder and training their own image encoder architecture based on
Dense Prediction Transformers (DPT)
.\(F\) is calculated as the dot product of the image embeddings \(I\) and label embeddings \(T\).
\[ \begin {align} F_{i,j,k} &= I_{i.j}\cdot T_{k} \\ dimensions:\ I_{i,j} &\in \mathbb{R}^{C},\ \{i,j\}\ represent\ pixels\\ \ T_{k} &\in \mathbb{R}^{C},\ k \in \{1..N\} \\ \ F_{i,j} &\in \mathbb{R}^{N} \end {align} \]
So, they want to maximize the dot product \(F_{i,j,k}\) for those pixels \(\{i,j\}\) where \(y_{i,j} = k\) (GT label). They do this by applying softmax over \(k\) on \(F_{i,j,k}\) and taking a
CrossEntropy
loss.For the final step, The softmaxed feature block \(F\) (equivalent to predictions) is then 'spatially regularized' using a
DepthwiseBlock
(Depthwise Conv) or aBottleneckBlock
(Depthwise Conv augmented with max-pooling), and is upsampled to the input image's resolution using bilinear interpolation.Training Details: They use pretrained weights on
ImageNet
forResNet
andViT
image encoders, and take random initialization forDPT
. They freeze the text encoder(theViT-B/32
fromCLIP
) while training.They show results that are comparable with 1-shot state-of-the-art(
HSNet
) results, and significantly higher than previous zero-shot models onPASCAL-5i
andCOCO-20i
. They outperformHSNet
onFSS-1000
.They use different text encoders from
CLIP
and compare them. (The text encoder is always a simpleTransformer
, the difference is the image encoder it is co-trained with in theCLIP
pretraining step).Qualitative Analysis
Related but unseen labels
They show that
LSeg
is able to predict objects belonging to unseen classes close to the points in embedding space.They show the same behavior with hierarchical unseen labels (i.e. being able to predict correctly when a parent category is present in label set instead of the specific object).
Failure Cases
They mention that since
LSeg
is trained only on positive examples of classes (unlikeCLIP
which had a contrastive objective), it can give wrong predictions sometimes. For example,In this image, it predicts the
dog
astoy
(when onlytoy
andgrass
are provided) because adog
is probably closer to atoy
thangrass
visually and semantically.RegionCLIP
Extract regions and their text descriptions from images and use language-image training similar to
CLIP
on these (contrastively).Need?
Acc. to authors, we cannot directly apply
CLIP
to regions and have it work well because there is a major domain shift (?) and thus has unsatisfactory performance. This is becauseCLIP
is trained to match an image with its image-level description, and does not know about the alignment between local image regions and text descriptions of those regions.Problems:
Solution:
Bootstrap from a pretrained language-vision model (CLIP) and fill in the missing region descriptions and then align them with proposed regions based on a metric.
Framework
They make region descriptions by filling 'object concepts'(from concept pool) into prompts and then, using a
teacher
model \(\mathcal{V}_t\) (fromCLIP
), and sees which region (proposed by the Region Proposal Network,RPN
, pretrained) aligns with the region description the most, and assigns it to that.Once these region-text pairs are generated, the new encoder can be contrastively trained on these, similar to
CLIP
's contrastive language-image pretraining.They use
RoIAlign
to extract the region's visual features from the encoder \(\mathcal{V}\), which pools regional features from the image's feature map using interpolation.\(\mathcal{V}\) takes initial weights from \(\mathcal{V}_t\) for a good start in the visual-semantic space.
Details
The
CC3M
(contextual captions dataset) was used for training.The region descriptions are made by filling the concepts from concept pool into prompts, i.e.
kite
is filled into the prompta photo of a ....
to make the descriptiona photo of a kite
. These are then passed through the pretrained language encoder (CLIP
) to get the semantic text embedding.Cosine Similarity
is used as the metric of how much the region proposed aligns with some region description for the contrastive loss between region-text pairs \(L_{cntrst}\).Losses
They use a distillation loss \(L_{dist}\) in addition to a contrastive loss, which is defined as:
\[ L_{dist} = \frac{1}{N} \sum_{i}{L_{KL}(q_i^t, q_i)} \]
where, \(q_i^t\) is a 'soft target' = \(softmax_j(distance(v_i^t, l_j))\),
\(v_i^t\) is region's visual features from teacher \(\mathcal{V}_t\)
and, \(v_i\) is region's visual features from \(\mathcal{V}\)
They also added the contrastive loss at image level \(L_{cntrst-img}\) (with negative samples being labels of different images), like
CLIP
to their final loss. So, final loss is\[ L = L_{cntrst} + L_{dist} + L_{cntrst-img} \]
Extensions to object detection and open-vocabulary object detection
They extend this framework to object detection by simply using the
RPN
to generate regions and find which one matches the target object class the most, and simple output that as the localization/bounding box for the object. However, no work is done in segmenting the target object.For openvoc object detection, they evaluate the model on 48 base and 17 novel categories for
COCO
and 866 base and 337 novel categories fromLVIS
(general classes are termed as base and specific object classes are termed as novel).OPEN-SET RECOGNITION: A GOOD CLOSED-SET CLASSIFIER IS ALL YOU NEED?
ICLR '22
They show that the closed set accuracy is highly correlated to the open set performance.

Performed multiple experiments using a variety of models:
ViT
,ResNet
,EfficientNet
,VGG
.ViT
doesn't overfit its representation to the training classes and outperforms other methods.Good closed-set performance => Better
OSR
To enhance the closed set performance, they leverage existing techniques from image recognition:
They also try changing the open set scoring rule to
Maximum Logit Score (MLS)
.Using
MLS
gives better performance inOSR
but softmax normalization is better in combined (OSCR
) (because softmax normalization cancels the effect of the feature norm)Extract Free Dense Labels from CLIP
ECCV '22
Using CLIP features for dense prediction
CLIP
for segmentation tasks.DeepLabv2
in conjunction withCLIP
's text fails to segment novel classes.MaskCLIP
Doesn't modify the CLIP feature space
Comparative Analysis between MaskCLIP and Our Results
Base Class Performance
Novel Class Performance
Summary: