# [Watching the News: Towards VideoQA Models that can Read](https://arxiv.org/abs/2211.05588) The authors propose a VideoQA model that also considers textual information for News Video QA task. ### Proposed Method : ![](https://hackmd.io/_uploads/S1InLbp92.png) - The authors propose their own News VideoQA dataset with 3083 videos from news channel around the world in english. - The proposed model builds upon the SINGULARITY model. - It contains three components, a vision encoder, a language encoder and a multimodal encoder. - It is pretrained with image/video paired with corresponding caption and the multimodal encoder applies cross-attention to collect information from visual representation using text as the key. - Three training objectives are used - Vision-Text Contrastive- It aligns vision and text representations - Masked Language Modelling- Predicts masked visual and text contexts - Vision-Text Matching - Predicts matching score of vision0text pair with multimodal encoder - The proposed method adds OCR tokens instead of captions and uses image/video+OCR token pairs - **Loss:** - The losses assesed in the paper are perceptual loss and pixel difference. - Perceptual loss is used as pixel difference makes it difficult to reconstruct intra-subject variations of a template. ### Experiments : - The datasets used for training is News VideoQA - The evaluation metrics used are accuracy and avg Normalised Levenshtein Similarity - The model is run on several heuristics and upper bound baselines like - Majority answer- most frequent answer in train set is answer for all questions in test - Biggest OCR token - Largest OCR token is the answer - Vocabulary Upper Bound - Answer is picked from vocabulary of most common answers in train - OCR substring of single frame UB - Vocabulary is restricted to list of OCR tokens of the fram on which question was definec - OCR substring of all frames- Answer is a substring in list of OCR tokens from frames of video - The vocab UB presents best results with the UB methods having much better accuracy - The model was compared with BERT-QA, SINGULARITY and M4C - The BERT and M4C models were tested on single and 2 frames setting along with the 12 frame setting used for all models. - The models have better accuracy on fewer frames(frames at which question is taken from) - The proposed model achieves SOTA results.