# [Towards Flexible Multi-modal Document Models](https://arxiv.org/abs/2303.18248)
The proposed network FlexDM solves various design tasks via single transformer model.
### Proposed Method :

- The proposed model works such that given an incomplete document X containg [MASK] as context, the model should predict values for all fields filled with [MASK] to generate a complete document.
- These multi modal elements are processed in an orderless manner without any positional embeddings.
- The architecture has three modules, encoder,transformer and decoder.
- The encoder takes document input X and embeds it into a latent vector($\theta$ is the model parameters)

- The transformer block are stacked to process complex inter element relations.

- The decoder decodes the transformer output back into a document.

- Loss function is used for each element-attribute (x<sub>i</sub><sup>k</sup>) using softmax cross entropy for categorical variables and MSE for numerical variables, where k is attribute, i is the element.

- The model is pretrained on the same in domain dataset.
- The authors employ a multi-task learning scheme where a task is randomly sampled from the target tasks, a document is sampled and then model is trained using the masking pattern associated with the task.
### Experiments :
- The datasets used for training are Rico aand Crello
- The model is tested for tasks of Element Filling and Attribute Prediction
- The scoring function uses a cosine similarity and the score is given by

- The benchmark models are BERT,BART,Canvas VAE,CVAE.
- Three setups are used for proposed model
- IMP: Randomly 15% fields are masked during training
- EXP: All tasks are explicit and jointly trained by sampling masks to corresponding tasks.
- EXPFT: The weights of IMP model is used and then finetuned and trained similar to EXP.
- The EXPFT model outperforms most of the benchmarks on the datasets.