# Multimodal Dataset
1. [Challenges in Representation Learning: Multi-modal Learning \| Kaggle](https://www.kaggle.com/c/challenges-in-representation-learning-multi-modal-learning/data) - Small ESP Game Dataset 2013
- 100k images, each image has a set of similar tags (each tag is a word) (12 tags per images in the training set).
- A separate private dataset is available
- Task: Given an image, and two sets of tags, choose which set of tags is true for the image
- Cons:
- In many cases, even though the proposed method has predicted relevant tags to the image, those tags are missing in the ground truth. That is because the tag lists are not complete and are generally a subset of relevant tags.
- Data might be affected by the format of the data collection method (crowd sourcing game)
2. [SVCL - Cross-Modal Multimedia Retrieval](http://www.svcl.ucsd.edu/projects/crossmodal/) 2014
- Wikipedia articles, available in full or small versions:
- Full - 2,866 multimedia documents (image + text) and features (matlab format) (1.4GB)]
- Small - just the feature files (matlab format)
- The articles generally have multiple sections and pictures. We have split them into sections based on section headings, and assign each image to the section in which it was placed by the author(s).
- Each section containes a single image and at least 70 words.
- The final corpus contains 2,866 multimedia documents. The median text length is 200 words.
3. [ImageCLEF IAPR TC-12](https://www.imageclef.org/photodata) 2006
- 20000 images taken from LOCATIONS around the world. There are also images about actions and objects.
- Each image is associated with a text caption from UP TO 3 languages (English, German and Spanish).
- Example annotation:
- Image ID: annotations/00/25.eng
- Title: Plaza de Armas
- Description: a yellow building with white columns in the background; two palm trees in front of the house; cars are parking in front of the house; a woman and a child are walking over the square;
- Notes: The Plaza de Armas is one of the most visited places in Cochabamba. The locals are very proud of the colourful buildings.
- Location: Cochabamba, Bolivia
- Date: 1 February 2002
- Originator: Michael Grubinger
4. [Attribute Discovery Dataset](http://tamaraberg.com/attributesDataset/index.html) 2010
- 37795 images in four classes: shoes, handbags, earrings, and ties
- Images were collected from over 150 web sources.
- Texts related to an image is the descriptions. They can be unconstrained and NOISY (ie. texts might not be the right representation of the image).
5. [Multimodal Document Intent Dataset](https://www.ksikka.com/document_intent.html)
- Paper link: https://web.stanford.edu/~jurafsky/document_intent.pdf
- 1299 public Instagram posts (seems kinda small).
- 1 sample contains one image and its caption
- For each of the sample, there can be 3 kinds of labels:
- Post’s intent (advocative, promotive, exhibitionist, expressive, informative, entertainment, provocative),
- Contextual relationship (minimal, close, transcendent)
- Semiotic relationship (divergent, parallel, additive)