## LipNet implementation with Keras ### Overview Lipreading is the task of decoding text from the movement of a speaker’s mouth. LipNet is a model that maps a sequence of video frames to text, using spatiotemporal convolutions (STCNNs), a recurrent network (RNN), and the connectionist temporal classification (CTC) loss, trained entirely end-to-end. LipNet was first proposed in the paper ['LipNet: End-to-end sentence-level lipreading'](https://arxiv.org/pdf/1611.01599.pdf) by Yannis M. Assael, Brendan Shillingford, Shimon Whiteson & Nando de Freitas in 2016. ### Implementation * LipNet architecture ![](https://i.imgur.com/hOmwV54.png) My version of LipNet will be implemented referring to this github repo: https://github.com/rizkiarm/LipNet. However, its model was built using Tensforflow 1.0+ so the task at hand is to adjust the codes to use with Tensorflow 2.0+ ### Dataset This model will use GRID corpus dataset (http://spandh.dcs.shef.ac.uk/gridcorpus), according to the paper. The dataset contains 34000 videos, 34 speakers, 1000 videos per speaker, the videos are 3s and 75 frames each. The dataset contains an align file for each video, the align file contains certain markings and the words being spoken in the respective video. ### Goals * Rebuild the architecture * Create a working baseline model * Make predictions with the pre-trained weight * Build a flask app for lipreading in real time * Try to train the model for new weight