PyMT5: Multi-mode translation of natural language and Python code with transformers

# PyMT5: Multi-mode translation of natural language and Python code with transformers Clement, Colin B., et al. "PyMT5: multi-mode translation of natural language and Python code with transformers." *arXiv preprint arXiv:2010.03150 (2020)*. Ricardo Jongerius (4379500) Marco Schouten (5352908) This summary might be too ML focused for Never Work in Theory!, but if he wants to put it up feel free to do so! ## TL;DR PyMT5 is a tool used to boost developers productivity and hence reduces the company's costs. It operates in the field of Machine Learning aiming to automate software development. It is a monolithic system that performs two distinct operations: (a) predicts the source code of a method given a docstring as input, (b) summarizes source code to a docstring (natural language documentation). The innovation lies in multi-modal training, that is training the model using different combinations of three chosen method features i.e. signatures, docstrings, bodies. PyMT5 yields better results than GPT-2 (similar model trained individually on docstring or method generation tasks). PyMT5 has 92.1% syntactically correct method bodies on the "CODE SEARCH NET" test set. ## Paper Summary ### Motivation and goal of the paper Automated software development has been the subject of extensive research in the last few years. Researchers have been trying to come up with a better understanding of software development to come up with tools that boost developer productivity, i.e. helping engineers write more efficient code and reducing development time and making it less prone to errors. To this end, intelligent source code documentation, code completion and code analysis have a societal impact both in terms of costs and developer satisfaction. In 2017 "Attention is all you need" by Vaswan was proposed innovative architecture called Transformer revolutionised the field of NLP. It is a deep neural network that is the current state-of-the-art in NLP to solve sequence-to-sequence tasks, based on a self-attention mechanism that deals with long-term dependencies. Clement et all. discuss in their paper the results of PyMT5: a tool based on Transformers that translates natural language to Python and vice versa. In other words, this model can generate source code bodies from a docstring, and it can generate docstrings from source code bodies. This sets a milestone that pushes the boundaries for automated software development powered by Machine Learning. ### Methodology In this section, the different aspects of the authors' methodology will be discussed. We take a look at what the input and output look like for PyMT5, as well as how the model is trained. #### Input and output For a given method we define three features: signature, body, docstring. PYM5 was trained on all possible combinations of different features for input and output. Note that since only 1/5 of the dataset methods had docstrings, the model could take advantage of all features regardless of whether they are present or not. Additionally, argument and method names can have a huge impact on body prediction, thus being a dominant signal the model might ignore the docstring. As a result, training is multi-modal because the body of the function is not generated exclusively from docstring, but rather by a combination of multiple features. ![](https://i.imgur.com/gCpx117.png) ![](https://i.imgur.com/Qm6wJBb.png) #### Dataset and preprocessing The dataset consisted of 27GB raw text from 10-star repos on GitHub. For each repository, the data was split: 90% for training, 5% for validation. They extracted individual and class methods from the Abstract Syntax Tree (AST) of each python file (using standard python "ast", 2to3 and autopep8 libraries to deal with different styles and tab conventions. Lastly, they converted each method AST and converted it back to source code with "astunparse". #### Training the network The majority of Python code lacked docstring, therefore they relied p5mt on unsupervised pre-training on generic language models to build a baseline understanding of language. This practice has become common in NLP and usually yields state of the art results as reported by Lewis et al., 2019. The architecture is an encoder-decoder T5 transformer (pre-trained by Raffel et al.,2019). Training is done by first tokenizing the input, then replacing some tokens (up to 3) with a "Mask Token", then letting the model learn to correctly restore the missing tokens substituted by the Mask. ![](https://i.imgur.com/Rodx8yG.png) ### Results PyMT5 has set new state-of-the-art results for both method generation and docstring generation. In order to obtain these results, and to test the performance of any system, one first needs a baseline. In this case, for all results, the authors have used two GPT-2 models as a baseline. One was initialised randomly, and one pre-trained on the English language. In addition to the test set they created themselves, they have also used some pre-existing ones. Firstly, the test sets from the CodeSearchNet python challenge[^CSN]. Secondly, the test set from Barone and Sennrich[^barone], which is a parallel corpus of Python functions and documentation strings for automated code documentation and code generation specifically created for tasks like this. The use of a variety of different test sets help to bolster the claims made, and shows the authors did not simply overfit to their own dataset. #### Metrics Another thing the authors did to expand their analysis is employing different metrics. We provide a short explanation of each of the metrics used so that the results can be interpreted more easily. ##### Syntax Syntax is probably the easiest metric to understand. It simply refers to the fraction of generated methods that have valid python syntax. Not all generated methods are actually valid python code, as the model doesn't validate this in any way. The syntax metric is only used for method generation, as it is not applicable for docstrings. ##### BLEU The BLUE score, which stands for bilingual evaluation understudy, was also used. This score is a little too convoluted to fully explain, but the central idea is that "the closer a machine translation is to a professional human translation, the better it is." [^bleu] In the context of this paper, the professional human translations was the ground truth from the test sets. Normally, the BLEU score lies between 0 and 1, but in this paper they have been scaled to scores between 0 and 100. ##### ROUGE Finally, the authors used the ROUGE, or Recall-Oriented Understudy for Gisting Evaluation metrics. This is the most widely used set of metrics used for evaluating automatic summarization and machine translation software in NLP. There are several evaluation metrics part of this 'family', of which the authors use three: 1. ROUGE-1 refers to the overlap of unigrams (each token) between the prediction and the ground truth. 2. ROUGE-2, as you might expect, refers to the overlap of bigrams between the prediction and the ground truth. 3. ROUGE-L is a Longest Common Subsequence based statistic. The longest common subsequence problem takes line level structure similarity into account naturally, and identifies the longest co-occurring sequence of n-grams automatically. All of these metrics have their own precision, recall and F1-scores calculated as normal. #### Method generation Below are the results for the comparison between the two baselines and PyMT5 on the task of method generation from a signature and natural language docstring. The highest scores are in bold. | Model | Syntax | BLUE | | ROUGE-1 | ROUGE-2 | ROUGE-L | | ------------------------- |:---------:|:--------:| ----- |:--------:|:--------:|:--------:| | GPT-2 random | 85% | 5.60 | Prec. | 25.8 | 12.3 | 26.8 | | | | | Rec. | 26.7 | 12.1 | 25.9 | | | | | F1 | 21.8 | 10.6 | 22.5 | | GPT-2 english | 86% | 5.63 | Prec. | 25.4 | 12.1 | 26.3 | | | | | Rec. | 26.9 | 12.2 | 26.1 | | | | | F1 | 21.7 | 10.6 | 22.5 | | PyMT5 | **93.6%** | **10.6** | Prec. | **33.8** | **21.5** | **33.6** | | | | | Rec. | **44.1** | **25.0** | **43.8** | | | | | F1 | **35.1** | **21.5** | **32.2** | | *CSN test set:* | | | | | | | | GPT-2 random | 77.2% | 2.80 | Prec. | **32.3** | 11.8 | **33.7** | | | | | Rec. | 19.6 | 7.0 | 19.3 | | | | | F1 | 20.9 | 7.6 | 21.9 | | PyMT5 | **92.1%** | **8.59** | Prec. | 25.6 | **12.5** | 25.3 | | | | | Rec. | **40.2** | **18.3** | **39.6** | | | | | F1 | **28.4** | **13.5** | **24.8** | | *Barone et al. test set:* | | | | | | | | PyMT5 | 91.1% | **20.2** | Prec. | 41.3 | 28.5 | 40.7 | | | | | Rec. | 52.2 | 34.7 | 51.3 | | | | | F1 | 43.2 | 29.8 | 39.7 | | Barone et al. | - | 10.9 | | - | - | - | As can be seen in these results, PyMT5 quite clearly outperforms the baselines in basically all cases. Its results are significantly better, sometimes almost doubling the BLEU score. PyMT5 achieves state-of-the-art performance for this task. #### Docstring generation Below are the results for the comparison between the two baselines and PyMT5 on the task of natural language docstring generation from a signature and method body. The highest scores are in bold. | Model | BLUE | | ROUGE-1 | ROUGE-2 | ROUGE-L | | ------------------------- |:--------:| ----- |:--------:|:--------:|:--------:| | GPT-2 random | 19.4 | Prec. | 32.6 | 19.3 | 33.6 | | | | Rec. | 36.2 | 19.4 | 34.7 | | | | F1 | 31.0 | 18.3 | 31.6 | | GPT-2 english | 19.6 | Prec. | 33.1 | 19.4 | 33.9 | | | | Rec. | 36.4 | 19.5 | 34.8 | | | | F1 | 31.4 | 18.3 | 31.8 | | PyMT5 | **25.2** | Prec. | **42.1** | **23.7** | **41.3** | | | | Rec. | **50.4** | **27.0** | **49.3** | | | | F1 | **43.3** | **24.4** | **39.8** | | *CSN test set:* | | | | | | | GPT-2 random | 9.50 | Prec. | 30.6 | 13.3 | 31.4 | | | | Rec. | 31.1 | 12.9 | 29.8 | | | | F1 | 26.3 | 11.5 | 27.2 | | PyMT5 | **16.3** | Prec. | **38.0** | **19.2** | **36.8** | | | | Rec. | **52.7** | **24.5** | **51.0** | | | | F1 | **41.3** | **20.4** | **36.7** | | *Barone et al. test set:* | | | | | | | PyMT5 | **17.4** | Prec. | 39.6 | 26.0 | 38.7 | | | | Rec. | 53.6 | 33.7 | 52.1 | | | | F1 | 43.1 | 27.8 | 39.1 | | Barone et al. | 13.8 | | - | - | - | As can be seen in these results, PyMT5 quite clearly outperforms the baselines in every single instance. Its results are significantly better, with scores up to ~50% higher. PyMT5 achieves state-of-the-art performance for this task as well. ### Implications The product of this paper is PyMT5, a tool to boost developers productivity by automatically generating either full methods, or docstrings. The concurrent modelling of source code and natural language has several enticing applications in automated software development. It can greatly improve a developer's efficiency, and aid in the automation of trivial tasks. On the other side, there are a few factors that might pose a threat to the validity of this paper. Firstly, the training required to achieve the presented results is quite expensive. The authors pre-trained PYMT5 on 27GB of raw source code in total, for 3 weeks on sixteen 32GB Tesla V100 GPUs. Even their GPT-2 baselines required 4 Tesla V100 GPUs with 16GB of memory each, for 7 days. These numbers are not insignificant, and make reproducing the results very costly. ### Future Developments For the future, the authors have several plans to continue with their work on PyMT5. They intend to extend the functionality of PyMT5 to also include fully-fledged code documentation, as well as method generation from general natural language statements instead of merely docstring. Additionally, they want to focus on the development of more model evaluation criteria, in order to leverage the distinct characteristics of source codes more effectively. Our own suggestion would be to attempt to extend this architecture to different programming languages, to see whether it could also work on different syntaxes. Python is generally regarded as a language with a simplistic syntax, so it would be interesting to see if these results transfer to more complex settings. ## Appendix ### General encode-decoder Transformer This model is composed of two blocks: an encoder and a decoder. The encoder takes words as inputs, and retrieves a numerical representation for each word (usually a vector). This numerical representation (combinations of vectors) holds the meaning of the sequence. The decoder gets as input the output of the encoder and a sequence. When prompting the decoder for an output for the first time we have to give it the start of the sequence. The decoder decodes the sequence and gives as output the first word (WORD_1). At this point the Encoder is no longer needed because the decoder is used in an auto-regressive manner: it gets as input the same encoded representation from the encoder and its previous output word (WORD_1) to generate the next word (WORD_2). The decoder repeats this process until it outputs a stopping value like "." The strength of this system lies in the following: * is that Encoder and Decoder are separate entities, which often do not share weights * the decoder can be specialized in a different language (e.g. French) or different modality like speech. * manage sequence-to-sequence tasks e.g. translation, summarization * input and output distribution are different Ref. [^huggingFAce1], [^hugginface2] ![](https://i.imgur.com/eXeV28f.jpg) ### T5 Transformer Training is done in a self-supervised manner. 15% of the tokens that make up the input are replaced by "Mask" tokens in order to form a corrupt sequence. Then the encoder gets as input the corrupted sequence (containing MASK tokens), whereas the decoders gets as input the uncorrupted sequence and the target input for the decoder are the MASK tokens. For example, if the input sentence is “My cat is very cute .”, and the tokens “dog”, “is” and “cute” gets masked, then the encoder input becomes “My <x> very <y> .” and the target input becomes “<x> dog is <y> cute .<z>” [^barone]: Barone, Antonio Valerio Miceli, and Rico Sennrich. "A parallel corpus of Python functions and documentation strings for automated code documentation and code generation." arXiv preprint arXiv:1707.02275 (2017). [^bleu]: Papineni, K. Roukos, S. Ward, T. Zhu, W. J. "BLEU: a method for automatic evaluation of machine translation." ACL-2002: 40th Annual meeting of the Association for Computational Linguistics. pp. 311–318. (2002). [^CSN]: https://github.com/github/CodeSearchNet [^huggingFAce1]: https://huggingface.co/blog/encoder-decoder [^hugginface2]: https://huggingface.co/transformers/model_summary.html#seq-to-seq-models!