owned this note
owned this note
Published
Linked with GitHub
# TallyQA Model
###### tags: `VQA`, `paper`
paper link : https://arxiv.org/abs/1810.12440
There are 2 types of counting questions:
1. simple: just require object detection.
2. complex: relationshups between objects, identifying attributes and reasoning.
- VQA algorithms which solve the task as a classifier work poorly in open ended counting tasks.
- Due to rarity of complex questions in normal datasets, they need to be analyzed separately to determine the model's capacity of answering them.
#### Contributions of the paper:
1. Describe TallyQA, world's largest open-ended counting dataset designed to study both, simple and complex questions.
2. Relational Counting Network(RCN)
3. Show that RCN surpasses SOTA for open-ended counting on both TallyQA and HowManyQA benchmark.
## RCN Model :
- It is a modified relation network (RN) (Santoro et al. 2017)
- Uses question to guide processing of n foreground proposals $O = \{ o_1, o_2, ... , o_n \}$ and m background regions $B = \{ b_1, b_2, ....b_m \}$ with $o_i \in \mathbb{R}^k$ and $b_j \in \mathbb{R}^k$
- RN is a combinatin of 2 RN subnetworks:
$Count(O, B, Q) = h_{\gamma}(RN(O, O) \oplus RN(O, B))$
- $RN(O, O)$ is responsible for infering the relation between the foreground regions in context of question Q :
$RN(O, O) = f_{{\phi}_1}(\sum_{i,j} g_{{\theta}_1} (o_i, o_j, s_{ij}, Q))$
here $f_{\phi_1}$ and g_{theta_1} are neural networks that each output a vector and the vector $s_{ij}$ encodes spatial information about the $i$-th and $j$-th proposals.
- Like the original RN model, the sum is computed over all $n^2$ pairwise combinations.
- $RN(O, B)$ is responsible for infering the relation between the background and foreground proposals.
$RN(O, B) = f_{{\phi}_2}(\sum_{i,j} g_{{\theta}_2} (o_i, b_j, s_{ij}, Q))$
- However RCN has 2 major innovations over the original RN approach.

1. RCN uses region proposals: RN used raw CNN feature maps. Problem with it:
- this worked well with CLEVR, but work poorly with real world VQA datasets which require processing of higher resolution.
- Computational cost: As input, the RN model used $d^2$ elements in a d x d convolutional feature map, which were tagged with their spatial coordinates. This means it computed $d^4$ pairwise relations.
2. Explicit incorporation of background: some questions need knowledge of background to be considered.
- A note on $s_{ij}$:
- it is critical in ensuring each object is counted only once during prediction.
- It enables RCN to learn to do non-maximal supression to cope with overllaping proposls.
$$s_{ij} = [l_i, l_j, \xi_{ij} , IoU{ij}, \frac{IoU_{ij}}{A_i} ,\frac{IoU_{ij}}{A_j}]$$
where,
$l_i$ and $l_j$ encode spatial information of each proposal individually.
$l_i = [\frac{x_{min}}{W}, \frac{y_{min}}{H}, \frac{x_{max}}{W}, \frac{y_{max}}{H}, \frac{x_{max} - x_{min}}{W}, \frac{y_{max} - y_{min}}{H}]$
where,
$(x_{min}, x_{max})$ and $(y_{min}, y_{max})$ represent the top-left and bottom-right corners of proposal $i$, and $W$ and $H$ are the width and height of the image respectively.
$A_i$ and $A_j$ are area of proposals
$\xi_{ij}$ is the dot product between each proposals CNN features to model how visually similiar they are. ( According to me this helps to cope with overlapping proposals since they wouldn't be visually similiar )
$IoU_{ij}$ is the intersection over union between 2 proposals.
NOTE: Experiments, Conclusion and Discussion remaining