# [Watching the News: Towards VideoQA Models that can Read](https://arxiv.org/abs/2211.05588)
The authors propose a VideoQA model that also considers textual information for News Video QA task.
### Proposed Method :

- The authors propose their own News VideoQA dataset with 3083 videos from news channel around the world in english.
- The proposed model builds upon the SINGULARITY model.
- It contains three components, a vision encoder, a language encoder and a multimodal encoder.
- It is pretrained with image/video paired with corresponding caption and the multimodal encoder applies cross-attention to collect information from visual representation using text as the key.
- Three training objectives are used
- Vision-Text Contrastive- It aligns vision and text representations
- Masked Language Modelling- Predicts masked visual and text contexts
- Vision-Text Matching - Predicts matching score of vision0text pair with multimodal encoder
- The proposed method adds OCR tokens instead of captions and uses image/video+OCR token pairs
-
**Loss:**
- The losses assesed in the paper are perceptual loss and pixel difference.
- Perceptual loss is used as pixel difference makes it difficult to reconstruct intra-subject variations of a template.
### Experiments :
- The datasets used for training is News VideoQA
- The evaluation metrics used are accuracy and avg Normalised Levenshtein Similarity
- The model is run on several heuristics and upper bound baselines like
- Majority answer- most frequent answer in train set is answer for all questions in test
- Biggest OCR token - Largest OCR token is the answer
- Vocabulary Upper Bound - Answer is picked from vocabulary of most common answers in train
- OCR substring of single frame UB - Vocabulary is restricted to list of OCR tokens of the fram on which question was definec
- OCR substring of all frames- Answer is a substring in list of OCR tokens from frames of video
- The vocab UB presents best results with the UB methods having much better accuracy
- The model was compared with BERT-QA, SINGULARITY and M4C
- The BERT and M4C models were tested on single and 2 frames setting along with the 12 frame setting used for all models.
- The models have better accuracy on fewer frames(frames at which question is taken from)
- The proposed model achieves SOTA results.