TallyQA Model - HackMD

# TallyQA Model ###### tags: `VQA`, `paper` paper link : https://arxiv.org/abs/1810.12440 There are 2 types of counting questions: 1. simple: just require object detection. 2. complex: relationshups between objects, identifying attributes and reasoning. - VQA algorithms which solve the task as a classifier work poorly in open ended counting tasks. - Due to rarity of complex questions in normal datasets, they need to be analyzed separately to determine the model's capacity of answering them. #### Contributions of the paper: 1. Describe TallyQA, world's largest open-ended counting dataset designed to study both, simple and complex questions. 2. Relational Counting Network(RCN) 3. Show that RCN surpasses SOTA for open-ended counting on both TallyQA and HowManyQA benchmark. ## RCN Model : - It is a modified relation network (RN) (Santoro et al. 2017) - Uses question to guide processing of n foreground proposals $O = \{ o_1, o_2, ... , o_n \}$ and m background regions $B = \{ b_1, b_2, ....b_m \}$ with $o_i \in \mathbb{R}^k$ and $b_j \in \mathbb{R}^k$ - RN is a combinatin of 2 RN subnetworks: $Count(O, B, Q) = h_{\gamma}(RN(O, O) \oplus RN(O, B))$ - $RN(O, O)$ is responsible for infering the relation between the foreground regions in context of question Q : $RN(O, O) = f_{{\phi}_1}(\sum_{i,j} g_{{\theta}_1} (o_i, o_j, s_{ij}, Q))$ here $f_{\phi_1}$ and g_{theta_1} are neural networks that each output a vector and the vector $s_{ij}$ encodes spatial information about the $i$-th and $j$-th proposals. - Like the original RN model, the sum is computed over all $n^2$ pairwise combinations. - $RN(O, B)$ is responsible for infering the relation between the background and foreground proposals. $RN(O, B) = f_{{\phi}_2}(\sum_{i,j} g_{{\theta}_2} (o_i, b_j, s_{ij}, Q))$ - However RCN has 2 major innovations over the original RN approach. ![](https://i.imgur.com/WRYJmm0.png) 1. RCN uses region proposals: RN used raw CNN feature maps. Problem with it: - this worked well with CLEVR, but work poorly with real world VQA datasets which require processing of higher resolution. - Computational cost: As input, the RN model used $d^2$ elements in a d x d convolutional feature map, which were tagged with their spatial coordinates. This means it computed $d^4$ pairwise relations. 2. Explicit incorporation of background: some questions need knowledge of background to be considered. - A note on $s_{ij}$: - it is critical in ensuring each object is counted only once during prediction. - It enables RCN to learn to do non-maximal supression to cope with overllaping proposls. $$s_{ij} = [l_i, l_j, \xi_{ij} , IoU{ij}, \frac{IoU_{ij}}{A_i} ,\frac{IoU_{ij}}{A_j}]$$ where, $l_i$ and $l_j$ encode spatial information of each proposal individually. $l_i = [\frac{x_{min}}{W}, \frac{y_{min}}{H}, \frac{x_{max}}{W}, \frac{y_{max}}{H}, \frac{x_{max} - x_{min}}{W}, \frac{y_{max} - y_{min}}{H}]$ where, $(x_{min}, x_{max})$ and $(y_{min}, y_{max})$ represent the top-left and bottom-right corners of proposal $i$, and $W$ and $H$ are the width and height of the image respectively. $A_i$ and $A_j$ are area of proposals $\xi_{ij}$ is the dot product between each proposals CNN features to model how visually similiar they are. ( According to me this helps to cope with overlapping proposals since they wouldn't be visually similiar ) $IoU_{ij}$ is the intersection over union between 2 proposals. NOTE: Experiments, Conclusion and Discussion remaining

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.