Notes on "[Cost-Effective REgion-based Active Learning for Semantic Segmentation](http://bmvc2018.org/contents/papers/0437.pdf)"

# Notes on "[Cost-Effective REgion-based Active Learning for Semantic Segmentation](http://bmvc2018.org/contents/papers/0437.pdf)" ###### tags: `notes` `segmentation` `active-learning` BMVC '18 paper; no official code release Author: [Akshay Kulkarni](https://akshayk07.weebly.com/) ## Brief Outline This work proposes an active learning strategy for semantic segmentation by 1. using an information measure combined with an estimate of human annotation effort (inferred from a learned cost model). 2. exploiting the spatial coherency of an image. ## Introduction * Active Learning (AL) aims to query only the data for annotation which is more likely to lead to more accurate models when used for training than any other data. * This work is one of the first for active learning in general multi-class semantic segmentation. Other previous works were for bio-medical binary segmentation. * They aim to iteratively find the *minimal* set of highly *informative* data while minimizing the annotation effort, in order to achieve a desired high quality performance with minimal cost. * The proposed framework reduces the labeling effort by * utilizing spatial estimates about annotation costs inferred from a learned cost prediction CNN. * focusing on image regions promising high information content and low annotation costs in a global context. ## Methodology * Typical pool-based AL scenario: 1. A large unlabelled pool of data exists, from which a small, randomly sampled subset (*seed set*) is initially extracted and labeled by the *oracle*. 2. Model is trained on currently labeled pool. 3. Some measure of information on each individual unlabeled sample is computed. 4. A subset of pre-specified amount of elements maximizing the *acquisition* function is annotated by the oracle, and added to the labeled pool. 5. This process is repeated until labeling budget or desired performance is reached. * Acquisition functions aim to select samples maximizing not only information content but also diversity. This work focuses on the acquisition process. * They propose to design acquisition to focus on image regions inside the entire unlabeled pool and further to not only consider information during region selection, but also annotation costs. ![Visual Overview](https://i.imgur.com/JtXWnby.png) ### Training * They uniformly sample $n$ images for the seed set to be labeled fully by the oracle. They learn 2 deep CNNs * First, the semantic segmentation model based on FCN8s ([CVPR '15](https://ieeexplore.ieee.org/document/7298965)). * Second, a cost model based on this work ([CVPR '17](https://arxiv.org/abs/1611.08272)). It utilizes the semantic segmentation network's learned knowledge as prior information to estimate the clicks an annotator would have needed to execute for densely annotating an image. ### Information extraction * They consider two classical heterogenous information measures for each pixel-level location individually given the output softmax probabilities. * Define $P^{(u, v)} = P^{(u, v)}(f_\theta(x))$ as the probability class distribution at pixel $(u, v)$ from a model $f$ parameterized by $\theta$ for given image $x$. A specific class out of a set of considered classes is denoted by $c$. * The resulting *information map* contains the information content for each pixel of an image at a current acquisition step. #### Entropy ([Shannon 2001](https://dl.acm.org/doi/10.1145/584091.584093)) * Widely used information measure in AL literature. Here, data with the highest positive impact on the model's performance is estimated as one where posterior probability distribution produces the highest entropy. * Entropy is also an uncertainty measure which is maximized when model assigns each class the same probability and minimized when model is sure about it's decision. * It is computed as $$ H^{(u, v)} = - \sum_{c}P_c^{(u, v)}\log(P_c^{(u, v)}) \tag{1} $$ #### Vote Entropy ([ICML '95](http://www.nzdl.org/cgi-bin/library.cgi?e=d-00000-00---off-0cltbibZz-e--00-1----0-10-0---0---0direct-10---4-------0-1l--11-en-50---20-about---01-3-1-00-0-0-11-1-0utfZz-8-00&a=d&d=HASH013b29ffe107dba1e52f1a0c_954&showrecord=1)) * Vote entropy information measure entails constructing a committee $E$ of $N_E$ different classifiers that ideally are all consistent with the labeled pool. Each committee member $e$ votes on vector $P_e^{(u, v)}$. * Then, a disagreement factor among the members is calculated. They adapt vote entropy for semantic segmentation as $$ V^{(u, v)} = -\sum_c \frac{\sum_e D(P_e^{(u, v)}, c)}{N_E} \log\frac{\sum_e D(P_e^{(u, v)}, c)}{N_E}; \\ \text{where} \hspace{1.5mm} D(a, c)=\begin{cases} 1, & \text{if} \arg \max(a)=c \\ 0, & otherwise \end{cases} \tag{2} $$ * Instead of training $N_E$ different classifiers, they leverage the stochasticity of dropout layers in the segmentation network and construct a Monte-Carlo dropout ensemble ([ICML '16](http://proceedings.mlr.press/v48/gal16.html)). * The most informative data points are those having the highest disagreement factor among the committee members. The aim of such *Query by Committee* approaches is to sample data that reduces the version space of given committee members. ### Cost extraction * They approximate cost by the number of clicks necessary to annotate an image. Cityscapes ([CVPR '16](https://www.cityscapes-dataset.com/)) is the only dataset (at the time of paper writing mostly) that provides such information. * This information is unknown for the unlabeled data. Thus, they train a *cost model* on the click data produced by human annotators at previous acquisition steps. * During actual cost extraction, they perform a forward pass for each image in the unlabeled pool through the cost model to retrieve an estimate about clicks. The result is denoted as *cost map*. ### Region aggregation and fusion * They leverage the varying information content and cost of regions within an unlabeled pool to query the highest density samples for labeling by the oracle. They aim to maximize a trade-off such that maximum performance is achieved at minimum cost. * They use a sliding-window approach to select the most informative regions in the acquired *information maps* for each image in the unlabeled pool. * At each sliding-window location $(u, v)$, they accumulate the values in the information map under the window and store this density in a matrix denoted as *region information map* having same dimensionality as the considered image. * Similarly, *region cost maps* are generated from the estimated *cost maps*. Then, these maps are linearly scaled w.r.t. the whole dataset and restrict the values to $[0, 1]$. * They fuse the region information map $I$ and region cost map $C$ using 3 different techniques: $$ g_1 = \frac{I}{1 + C} \\ g_2 = (1 - C) I \\ g_3 = \alpha I + (1 - \alpha)(1 - C) \tag{3} $$ * where $\alpha$ is a hyperparameter that can set a trade-off between both region maps. * After fusing for all images in the current unlabeled pool, they perform non-maximum-suppression (NMS) to retrieve fixed-size region candidates for each individual image and store them in a *region proposal* pool. ### Acquisition * From the region proposal pool, they extract as many top scoring regions as would correspond to extracting $m$ images out of a pool of equally sized images regarding their amount of pixels for a fair comparison to image-based acquisition of labels. * Instead of human annotator, oracle uses the ground truth annotation of the considered training dataset. Then, they update the labeled and unlabeled pool and learn the segmentation model and cost model from scratch. ## Conclusion * This work proposed a novel method for cost-effective active learning for semantic segmentation tailored to fully convolutional neural networks. * They show that combining information content and cost estimates is a suitable approach to cost-effectively build new training datasets.