# [Towards Flexible Multi-modal Document Models](https://arxiv.org/abs/2303.18248) The proposed network FlexDM solves various design tasks via single transformer model. ### Proposed Method : ![](https://hackmd.io/_uploads/Byk88uWKh.png) - The proposed model works such that given an incomplete document X containg [MASK] as context, the model should predict values for all fields filled with [MASK] to generate a complete document. - These multi modal elements are processed in an orderless manner without any positional embeddings. - The architecture has three modules, encoder,transformer and decoder. - The encoder takes document input X and embeds it into a latent vector($\theta$ is the model parameters) ![](https://hackmd.io/_uploads/SkuPPdWFh.png) - The transformer block are stacked to process complex inter element relations. ![](https://hackmd.io/_uploads/H1KRw_bF2.png) - The decoder decodes the transformer output back into a document. ![](https://hackmd.io/_uploads/SJcxd_-t3.png) - Loss function is used for each element-attribute (x<sub>i</sub><sup>k</sup>) using softmax cross entropy for categorical variables and MSE for numerical variables, where k is attribute, i is the element. ![](https://hackmd.io/_uploads/BJkKdOWYn.png) - The model is pretrained on the same in domain dataset. - The authors employ a multi-task learning scheme where a task is randomly sampled from the target tasks, a document is sampled and then model is trained using the masking pattern associated with the task. ### Experiments : - The datasets used for training are Rico aand Crello - The model is tested for tasks of Element Filling and Attribute Prediction - The scoring function uses a cosine similarity and the score is given by ![](https://hackmd.io/_uploads/ryDoFdbKh.png) - The benchmark models are BERT,BART,Canvas VAE,CVAE. - Three setups are used for proposed model - IMP: Randomly 15% fields are masked during training - EXP: All tasks are explicit and jointly trained by sampling masks to corresponding tasks. - EXPFT: The weights of IMP model is used and then finetuned and trained similar to EXP. - The EXPFT model outperforms most of the benchmarks on the datasets.