# [The Amazing Mysteries of the Gutter:Drawing Inferences Between Panels in Comic Book Narratives](https://arxiv.org/pdf/1611.05118.pdf)
They introduce a new COMICS dataset with nearly 1 million panels.
They design 4 deep learning models to test on three novel cloze style tasks
### Dataset:
The dataset is characterised into 4 intrapanel chategories:
- Word-specific, 4.4%: The pictures illustrate, but do not significantly add to a largely complete text.
- Picture-specific, 2.8%: The words do little more than add a soundtrack to a visually-told sequence.
- Parallel, 0.6%: Words and pictures seem to follow very different courses without intersecting.
- Interdependent, 92.1%: Words and pictures go hand-in-hand to convey an idea that neither could convey alone.
The interpanel transitions are classified into five categories:
- Moment-to-moment, 0.4%: Almost no time passes between panels, much like adjacent frames in a video.
- Action-to-action, 34.6%: The same subjects progress through an action within the same scene.
- Subject-to-subject, 32.7%: New subjects are introduced while staying within the same scene or idea.
- Scene-to-scene, 13.8%: Significant changes in time or space between the two panels.
- Continued conversation, 17.7%: Subjects continue a conversation across panels without any other changes.
**Tasks**
*Text Cloze*:
- In this task, the model is required to predict what text out of a set of candidates belongs in a particular textbox, given both context panels (text and image) as well as the current panel image.
- The textboxes are blacked out and panels with only single textbox is chosen for prediction.
*Visual Cloze*:
- It is similar to text cloze but here images are used with the only difference being the text of final panel is not provided.
*Character Coherence*:
- Given a jumbled set of text from textboxes in a panel the model must match the candidate to correpsonding textbox.
- This is restricted to panels with two textboxes.
The visual and text cloze are trained on two settings easy and hard.
In easy setting the panels are chosen at random and in hard they come from nearby pages.
### Proposed Method :

*Text Only:*
- The text only models takes textboxes from each panels and is encoded and then combined with textboxes within the same panel using an intrapanel LSTM
- These panel level representations are fed into interpanel LSTM and the final hidden state is taken as the context reprsentation
- In text cloze the answer candidates are encoded using word embedding sum and for visual cloze the 4096-d fc7 layer of VGG 16 is project to word embedding dimensionality using fc layers.
*Image Only:*
- The image only model feeds the fc7 features of each context panel to an LSTM and uses same function as text only.
- For visual cloze, both context and answer representations are project to 512d with fc layers before scoring
*Image-Text:*
- The previous methods are combined by concatenating output of intrapanel LSTM and fc7 and passing through a fc layer beforing feeding into interpanel LSTM
*NC Image-Text:*
- This is used for text cloze and character coherence where the model has no access to context panels
### Analysis :

- The models are trained on COMICS with the above settings
- The findings show that image text models dominate text close but for visual cloze additional training on text doesnt improve performance
- The reduced performance in hard setting implies improvment is required and is hypothesized can be done by transfer learning methods.
- The models severely fall behind human baseline