Research proposal

# Research proposal Ricardo Jongerius (4379500) Marco Schouten (5352908) ## TL;DR Writing down efficient, concise and bug-free software is of utmost importance in nowadays world. It has a societal impact on companies' costs, and also implications on the environment (due to energy consumption). The goal of this research is to build a system capable of generating syntactically and semantically correct methods from general natural language statements. To this end, our proposal includes the extension of PyMT5 state-of-the-art generator method generation from docstrings. Our contribution is two-fold: 1. to build a T5 Transformer that learns mappings from natural language statements to docstrings, (which will then be fed as input to the PyMT5 block.) 2. to build a Generative Adversarial Neural Network to augment a hardcoded dataset of English to docstring pairs. ## Introduction Software is ubiquitous nowadays, billions of people benefit from it every day using various services and devices. The development of this software is a non-trivial task. When writing source code, developers frequently have to consult documentation pages or look up how to achieve certain things. Therefore, intelligent AI-based code assistant tools can be incredibly useful to increase the efficiency of developers. The field of NLP has seen some huge advances over the last few years, some of which can easily be exploited for software engineering-related problems. Text generation, and code generation by extension, is a perfect example of this. Generative models have seen an unbelievable increase in potency over the past few years, sometimes being indistinguishable from content generated by humans[^turing]. There has been impressive progress in the field of code generation, which we will discuss in more detail in the Related Work section. One model which stands out is PyMT5[^PyMT5]. This is a state-of-the-art model which can generate code segments based on docstring input. However, would it not be nice to have a less limited form of input to generate code segments? Wouldn't it be easy to just describe what you want to achieve? Effectively generating code based on some natural language input is an interesting prospect, and has the potential to dramatically improve the workflow of software developers. To this end, we aim to answer two main research questions in this study: - **RQ1:** Can GANs be leveraged to effectively generate code segments from a natural language input string? - **RQ2:** Can we augment the dataset used to train the generator network? ## Background Neural networks, and especially recurrent neural networks (RNNs), used to be at the core of many approaches to NLP tasks. From language generation, machine translation to question answering. RNNs were innovative in that they worked with sequential data[^rnn]. It was the first algorithm that was able to remember its input using an internal memory component. This memory attribute is what made it perform so well in ML problems involving sequential data. The Transformer, introduced in “Attention Is All You Need” by Vaswani et al.[^att], was a novel neural network architecture based on a self-attention mechanism that surpassed RNN (and other architectures) performance in almost all NLP tasks, and has since become the new basis for a majority of the state-of-the-art language models. Below is an overview of the transformer architecture, taken from the original paper. ![](https://i.imgur.com/LpANsFA.png) This architecture can handle long-range dependencies effectively. It is fully reliant on self-attention to compute representations of its input and output, without using sequence-aligned RNNs or convolution[^att]. Transformers have since become a basic building block of many new state-of-the-art models, which we will expand on in the following section. ## Related work Even though the field of NLP for code completion and automation is relatively young, there is a plethora of literature that is closely related to the subject. In this section, some of this related work is presented in order to provide an overview of the entire research area. ##### GPT-n GPT stands for Generative Pre-trained Transformer, and describes a family of autoregressive language models which use deep learning to produce any type of text data. It initially garnered infamy with the release of GPT-2, when Radford et al.[^gpt2] refused to release the trained model "Due to our concerns about malicious applications of the technology."[^1] The next generation, GPT-3, has even more (~175 billion) trainable parameters[^gpt3], and therefore performs even better than its predecessor. The strength of these GPT models is in the fact that they seem to perform well in basically any NLP task[^gpt2][^gpt3]. For example, GPT-3 can produce news articles, poetry, stories and comments with only a limited amount of input. Additionally, it excels at automated conversational tasks, producing believable context-aware responses. Essentially, GPT models can be used to generate anything with a text structure, including source code. These capabilities make GPT extremely interesting for our area of research, and also explain why they are oftentimes used as baselines in research. ##### PyMT5 PyMT5, introduced by Clement et al.[^PyMT5], was the main inspiration for our proposal. PyMT5 is a tool based on transformers that translates natural language to Python and vice versa. In other words, this model can generate source code bodies from a docstring, and it can generate docstrings from source code bodies. Clement et al. argue that it is possible to treat python and its docstrings as essentially a natural language, representing both source code and natural language docstrings as sequences of tokens sharing the same vocabulary. Therefore, PyMT5 was inspired by the text-to-text transfer transformer T5[^T5]. It appears this holds, as their results proved to be state-of-the-art, beating the GPT-2-based baselines. This research is very closely related to ours. It also provides a method to convert natural language to code, albeit with a limited form of input, namely docstring-like language. ##### Github co-pilot Another related system is the recently announced Github Copilot[^ghcp]. This is a code assistant with several features which can help developers work more efficiently. Its features include converting comments to code, auto-filling repetitive code patterns, generating tests and showing alternative approaches for the functionality of a given piece of code. Github Copilot is powered by a system from OpenAI, which presumably means it is powered by GPT-3 in some form especially as Github is owned by Microsoft who hold the exclusive license. Unfortunately, however, Github did not release a scientific paper on Copilot, just a rather simple blogpost[^ghblog] analysing some of its initial suggestions. ##### Other related approaches IntelliCode Compose, introduced by Svyatkovskiy et al.[^intelli], is a general-purpose multilingual code completion tool. It can predict code token sequences of any type, allowing it to generate entire lines of syntactically correct code. It is built around a multi-layer generative pre-trained transformer model for code, called GPT-C, which is a variant of the GPT-2 trained from scratch on source code data. It relates quite closely to our research, but focuses much more directly on single-line code completion. IntelliCode Compose also uses the beginning of a line as its input, instead of a general natural language statement. However, it is interesting to note that once again the GPT family of models returns in this area of research. This again reiterates that GPT appears appropriate and capable for any NLP-related task. An earlier system from Svyatkovskiy et al. is Pythia[^pythia], an AI-assisted code completion. Pythia uses an invocation-based Markov Chain model using often-occurring state sequences for suggestions. Wang et al. introduced TranS^3^ a transformer-based reinforcement learning system for code summarization. As this is the inverse of what we would like to achieve, their approach might also be of interest to us. TranS^3^ employs an actor-critic network, where the actor network encodes the collected code snippets via a tree-transformer-based encoder to generate its comment[^trans3]. Alon et al.[^c2s] introduced code2seq. They present a model that leverages the syntactic structure of multiple programming languages in order to better encode source code. This allows it to represent code snippets as its abstract syntax tree (AST), which describes the structure of the method. Code2seq uses attention to select the relevant paths while decoding. The code2seq model can also be used for code captioning, where a natural language sentence that describes the given (C#) snippet is predicted. This last capability is interesting to our research, as that basically does the inverse of what we would like to achieve. Alon et al. show that a sequence-to-sequence approach is capable of tasks such as ours. ##### Generative adversarial networks In 2014, Ian J. Goodfellow et al. introduced the Generative Adversarial Network[^gans] (GAN). In this paper, GANs were introduced as a new framework for estimating generative models through an adversarial process, in which two models were trained simultaneously. Firstly, a generative model G that captures the distribution of the data, and secondly, a discriminatory model D predicting whether a sample came from the training data instead of from G. This framework results in a minimax two-player game, where each model wants to beat the other. We would like to leverage this architecture in our research to generate convincing code segments based on a natural language input. ##### Self-attention GANs Building on the GANs, an extension to this architecture was proposed by Zhang and Goodfellow[^sagan] in 2019, namely the Self-Attention Generative Adversarial Network (SAGAN). SAGANs introduce a self-attention mechanism into convolutional GANs. The self-attention module is complementary to convolutions and helps with modelling long-range, multi-level dependencies[^sagan]. In the paper, SAGANs are used for images, but this process could also be exploited in an NLP setting. After all, the attention mechanism lies at the heart of transformer-based architectures. ## Research method The goal of this study is to build a system capable of generating Python code given a natural language description in English. To be more specific we are going to extend the PyMT5 architecture proposed by [^PyMT5]. PyMT5 is a Transformer Network based architecture trained to generate Python code from formal docstring input. The core idea of this research is to extend the pipeline of PyMT5 by adding a Network that generates docstring from a natural language description, subsequentially, that output (docstring) will be given as input to PYM5T, which will then generate the Python code. This pipeline schematic is summarized in the figure below: ![Fig1](https://i.imgur.com/4yUs6ux.png) In order to train such a network (the yellow box in the figure above), we need to get an appropriate dataset. Unfortunately, we do not have any sufficiently large dataset of mappings between natural language and docstrings. So to validate the correctness of our implementation we are looking to build a system able of generating such English-docstrings pairs that will thereafter be used to train the initial network. To summarize: * RQ1: Can we generate syntactically and semantically correct docstrings from a natural language description, thus extending the PyMT5 pipeline by giving it a new input? * RQ2: Can we augment existing datasets using GANs to train this system? #### RQ2 model, Generative Adversarial Neural Networks ![Fig2](https://i.imgur.com/8dZBkv3.png) We thought that the perfect model architecture to answer this research question is Generative Adversarial Neural Networks, a state-of-the-art generative model (that models probability implicitly) that has a wide range of applications. > “Generative Adversarial Networks is the most interesting idea in the last 10 years in Machine Learning”, Yann LeCun the idea is to sample from a simple distribution (e.g. white noise) and then learn the transformation (generation) to the data distribution. ![](https://i.imgur.com/2WNY4Sg.png) The architecture comprises two inner networks: the generator and the discriminator. * The Generator network's goal is trying to fool the discriminator by generating real looking images. * The Discriminator network's goal is to try to distinguish between real and fake images. Both networks compete against each other in a zero-sum game. The discriminator wants to tune their parameters such that "D(x)" (i.e. the likelihood of x being a real image) is close to 1, whereas "D(G(z))" (i.e. the likelihood of the fake image G(z) of being a real image) is close to 0. On the other hand, the generator wants to minimise the objective function such that D(G(z)) is close to 1, thus fooling the discriminator into thinking that the image generate is real. Their inner working makes them desirable models for data augmentation. This is because once the GAN is sufficiently trained, one could extract the generator block from the network and use it to generate new pairs (docstrings-English text). We are planning to hardcode 1000 English to docstrings pairs, and then use this dataset as training data for the GAN. In such a way we will be able to generate hundreds of thousands of new pairs. ![](https://i.imgur.com/fMyHUeN.png) #### RQ1 model, T5 Transformer ![](https://i.imgur.com/eXeV28f.jpg) Once the dataset is sufficiently large, we can leverage that large amount of data to train a Transformer Network to find a mapping between English description to a formal Python Docstring. To solve this task, we can use the state-of-the-art sequence to sequence models for NLP: Transformers. ##### A Transformer Network in general In general, a Transformer network is composed of two blocks: an encoder and a decoder. The encoder takes words as inputs, and retrieves a numerical representation for each word (usually a vector). This numerical representation (combinations of vectors) holds the meaning of the sequence. The decoder gets as input the output of the encoder and a sequence. When prompting the decoder for an output for the first time we have to give it the start of the sequence. The decoder decodes the sequence and gives as output the first word (WORD_1). At this point the Encoder is no longer needed because the decoder is used in an auto-regressive manner: it gets as input the same encoded representation from the encoder and its previous output word (WORD_1) to generate the next word (WORD_2). The decoder repeats this process until it outputs a stopping value like “.” ##### T5 Transformer Specifically to T5: Training uses corrupted tokens, by randomly removing 15% of the tokens and replacing them with individual sentinel tokens. The input of the encoder is the corrupted sentence, the input of the decoder is the original sentence and the target is then the dropped out tokens delimited by their sentinel tokens. ## Execution plan ### (1) Get a large dataset for the GAN As described earlier we are going to hardcode 1000 docstring-English pairs. Then train the GAN, then Extract the generator part and give to generate around 500.000 pairs. ![](https://i.imgur.com/KT19swc.png) ### (2) Train T5 transformer in combination with PyMT5 In order to train the PyMT5 block, we are going to filter Python code from 10 star GitHub repositories that had at least one commit in the past five years. As reported by Brown et al.[^PyMT5] this led to 27GB of raw text. We are going to extract source class methods from the Abstract Syntax Tree (AST), using native python ast library, 2to3, autopep8. Once the subpart of the Abstract Syntax Tree is extracted "thus containing the method", it is converted back to raw text before being fed into the Transformer network. Then we will be giving input English text description as input to the t5 network. Secondly, that output will be given as input to the PyMT5 network. ### Training on Distributed Systems Traning a Transformer network is a very resource-demanding task. For this reason, we are going to train the model on Google Cloud using Kubernetes. > Kubernetes is an open-source container-orchestration system for automating computer application deployment, scaling, and management. > Kubernetes is a portable, extensible, open-source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation. It has a large, rapidly growing ecosystem. Kubernetes services, support, and tools are widely available. We will therefore write down a configuration JSON file containing all the information for training, such as batch size, learning rate. Then we will rely on Habitat, to find the most cost-effective distributed settings and hardware, given our problem instance. > Habitat: A Runtime-Based Computational Performance Prediction https://www.usenix.org/conference/atc21/presentation/yu ## Evaluation We provide a short explanation of each of the metrics we are going to use so that the results can be interpreted more easily. All these metrics are also used in the original PyMT5 paper. ##### Syntax Syntax is probably the easiest metric to understand. It simply refers to the fraction of generated methods that have valid python syntax. Not all generated methods are actually valid python code, as the model doesn't validate this in any way. The syntax metric is only used for method generation, as it is not applicable for docstrings. ##### BLEU The BLUE score, which stands for bilingual evaluation understudy, was also used. This score is a little too convoluted to fully explain, but the central idea is that "the closer a machine translation is to a professional human translation, the better it is." [^bleu] In the context of this paper, the professional human translations was the ground truth from the test sets. Normally, the BLEU score lies between 0 and 1, but in this paper they have been scaled to scores between 0 and 100. ##### ROUGE We are going to use ROUGE, or Recall-Oriented Understudy for Gisting Evaluation metrics. This is the most widely used set of metrics used for evaluating automatic summarization and machine translation software in NLP. There are several evaluation metrics part of this 'family', of which the authors use three: 1. ROUGE-1 refers to the overlap of unigrams (each token) between the prediction and the ground truth. 2. ROUGE-2, as you might expect, refers to the overlap of bigrams between the prediction and the ground truth. 3. ROUGE-L is a Longest Common Subsequence based statistic. The longest common subsequence problem takes line level structure similarity into account naturally, and identifies the longest co-occurring sequence of n-grams automatically. All of these metrics have their own precision, recall and F1-scores calculated as normal. [^c2s]: Alon, Uri, et al. "code2seq: Generating sequences from structured representations of code." arXiv preprint arXiv:1808.01400 (2018). [^PyMT5]: Clement, Colin B., et al. "PyMT5: multi-mode translation of natural language and Python code with transformers." arXiv preprint arXiv:2010.03150 (2020). [^T5]: Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683. [^gpt2]: Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9. [^gpt3]: Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165. [^1]: https://openai.com/blog/better-language-models/ [^intelli]: Svyatkovskiy, A., Deng, S. K., Fu, S., & Sundaresan, N. (2020, November). Intellicode compose: Code generation using transformer. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (pp. 1433-1443). [^ghcp]: https://copilot.github.com/ [^ghblog]: https://docs.github.com/en/github/copilot/research-recitation [^gans]: Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems, 27. [^sagan]:Zhang, H., Goodfellow, I., Metaxas, D., & Odena, A. (2019, May). Self-attention generative adversarial networks. In International conference on machine learning (pp. 7354-7363). PMLR. [^trans3]: Wang, W., Zhang, Y., Zeng, Z., & Xu, G. (2020). Trans^ 3: A transformer-based framework for unifying code summarization and code search. arXiv preprint arXiv:2003.03238. [^pythia]: Svyatkovskiy, A., Zhao, Y., Fu, S., & Sundaresan, N. (2019, July). Pythia: AI-assisted code completion system. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 2727-2735). [^bleu]: Papineni, K. Roukos, S. Ward, T. Zhu, W. J. "BLEU: a method for automatic evaluation of machine translation." ACL-2002: 40th Annual meeting of the Association for Computational Linguistics. pp. 311–318. (2002). [^turing]: Elkins, K., & Chun, J. (2020). Can GPT-3 Pass a Writer’s Turing Test?. Journal of Cultural Analytics, 1(1), 17212. [^att]: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008). [^rnn]: Elman, J. L. (1990). Finding structure in time. Cognitive science, 14(2), 179-211.