## Multimodal classification of textbook illustrations
### Context
The MALIN research project (“MAnuels scoLaires Inclusifs”) aims to build a computational system that automatically adapts textbooks to make them accessible to children with disabilities. Indeed, textbooks should take into account disabled children's difficulties and should be adapted in digital format, but without changing the content of the activities and their instructional intent.
### Objective
Textbook activities and lessons are often associated with images that can have different roles. Some of them are necessary for the understanding or the resolution of an activity (1). On the contrary, some illustrations are strictly illustrative and have no pedagogical purpose (2). As such, they should be removed from the adaptation for children with disabilities, in order to simplify as much as possible the adapted interface. Finally, the distinction between useful and illustrative illustrations is sometimes more subtle: an illustration may not be necessary for the activity but may have an informative purpose, for example by giving clues on how to solve the exercise or by depicting a concept unknown to the pupil (3).
| (1) | (2) | (3) |
| -------- | -------- | -------- |
| |  | |
The objective of the internship consists in the classification of images into three categories: necessary, illustrative and informative illustrations. To this end, several leads can be followed. The first one relies on the semantic comparison between the text of the activity and its illustration, to evaluate to which extent the semantic carried by the image is redundant with the one carried by the text. Several recent cross-modal models can be explored to achieve this goal: CLIP [1], that can be used to project text and image into the same representation space and thus estimate their semantic proximity, LAION-5B [2], designed for text-guided image generation and its counterpart on image-captioning [3]. Finally, the estimation of correspondences between text and image can also be a starting point for textbook illustration classification [4, 5].
However, as mentioned earlier, if the difference between necessary and illustrative images may be carried by a semantic redundancy estimation within the text-image couple, the challenge raised by the classification between necessary and informative illustrations is higher, as the information carried by both modalities is not strictly redundant, the image aiming at adding extra information. In this case, two lines of research can be explored. The first one relies on the estimation of the multimodal entailment that exists between the activity and its illustration. The notion of multimodal entailment has been the subject of recent research to identify the extent to which the image is necessary to understand the text associated with it, and is mainly used in the context of fact checking [6] or visual question answering systems. The second notion that can be explored is the notion of semantic completion, introduced in [7]. Intuitively, considering the paired vision and text data as two views of the same semantic information, the idea behind the semantic completion learning task is that the missing semantics of masked data can be
completed by capturing information from the other modality.
### Profile and Skills Required
Completed or in-progress Master's degree in Computer Science
• Proficiency in Python programming (especially Deep Learning libraries)
• Knowledge in Natural Language Processing or Image/Video Processing
• Autonomy and analytical skills
• Good written and spoken English proficiency
**Duration:** 4-6 months
**Supervision:** [Pr. Shin'ichi Satoh](http://www.satoh-lab.nii.ac.jp/) and [Dr. Camille Guinaudeau](https://sites.google.com/view/camille-guinaudeau/en?authuser=0)
### References
[1] Radford et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, 2021.
[2] Schuhmann et al. LAION-5B: An open large-scale dataset for training next generation image-text models. In Proceedings of the 36th Conference on Neural Information Processing Systems, 2022.
[3] Huang et al. Attention on attention for image captioning. In Proceedings of the IEEE/CVF international conference on computer vision, 2019.
[4] Sheng et al. Learning Hierarchical Semantic Correspondences for Cross-Modal Image-Text Retrieval. In Proceedings of the 2022 International Conference on Multimedia Retrieval, 2022.
[5] Kansal, Subramanyam, Wang and Satoh, Hierarchical Attention Image-Text Alignment Network For Person Re-Identification. In Proceedings of the IEEE International Conference on Multimedia & Expo Workshops, 2021.
[6] Patwa et al. Benchmarking multi-modal entailment for fact verification. In Proceedings of De-Factify: Workshop on Multimodal Fact
[7] Ji et al. Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning. arXiv preprint arXiv:2211.13437, 2022.