--- title: 'When Deep Neural Net meets Code Review' disqus: hackmd --- When Deep Neural Net meets Code Review === <style tyle="text/css"> p { text-align: justify; } li { text-align: justify; } </style> Online Interactive Post: https://hackmd.io/@DeepestCodeReview/code_review_with_dnn Code Avaliable on: https://gitlab.com/deepestcodereview/deepest_code_review **Abstract**-Peer code review is an essential step in the development of large software. However, the manual process is sometimes repetitive and even tedious. The community has developed various tools to automate the process. Most of them are linter/static analysis tools. They solve parts of the problems but have several limitations. In this project, we intend to employ Deep Neural Networks (DNN) to automate the process. Specifically, we conduct some proof-of-concept works to examine the relationship between the code changes and the review comment. Our initial attempt shows that our models fail to find a useful link from code changes to review comments. However, we are able to generate something interesting reversely from review comments to code changes. Further work includes more investigations on the DNN for review comment prediction. In addition, more sophisticated tuning on our successful models will also be beneficial. <!-- // Give a high-level summary of your results. Make sure you summarize your results at the beginning and also at the end of your report. You can refer back to the "Results" section for elaborations. In particular emphasize any results that were surprising. --> [TOC] ## Introduction Code review is a critical process in software development to help developers discover issues that may have been overlooked by themselves. Interestingly, a large portion of the review process is often repetitive, with over 75% related to common issues such as documentation, style, and structure [^1]. Therefore, we see a lot of potential for improvements in code review by automating parts of the process. [^1]: Mäntylä, Mika V., and Casper Lassenius. "What types of defects are really discovered in code reviews?." *IEEE Transactions on Software Engineering* 35, no. 3 (2008): 430-448. Currently, in industry, the most popular method is using linters/static analysis tools. Linters are made up of rules related to coding best practices. Whenever these practices are violated, linters are activated in IDE (Integrated Development Environment). However, linters require a significant amount of manual customization and are often inefficient [^2]. Most of the reviews still need to be manual. [^2]: Gupta, Anshul, and Neel Sundaresan. "Intelligent code reviews using deep learning." (2018). The case below shows a typical scenario that a linter is able to check. ![](https://i.imgur.com/FQ9ZJWW.png) > [color=#907bf7] The example linter is called CheckStyle[^3], which is mainly used to enforce coding style checks. The error shows a coding style violation where the `{` symbol in the `Java` code should start at the previous line. This check in CheckStyle is relatively easy to write with the help of [Abstract Syntax Tree (AST)](https://en.wikipedia.org/wiki/Abstract_syntax_tree). In fact, most linters rely on AST to examine the structure of the code. [^3]: [CheckStyle](https://checkstyle.sourceforge.io/) is a static analysis tool to enforce coding style check. However, a linter cannot give the following comment below. ![](https://i.imgur.com/J3XR1Hy.png) > [color=#907bf7] The reviewer talks about the best practice in `Java`, which is to use [Builder Pattern](https://refactoring.guru/design-patterns/builder) to replace a monstrous constructor. Note that right now it is not a monstrous constructor but the reviewer foresees the occurrence in the future. To give this comment, one should have > - The overall knowledge of a project (e.g., how the class will be used (in the future) so that the Builder Pattern is suitable to be used) > - A decent Software Engineering knowledge (e.g., Using Builder Pattern rather than monstrous constructors is a best practice) > Theoretically, as the code is structured, by examining the AST tree, we can know how many arguments are there in the constructor and fire violations of Builder Pattern when they exceed a certain threshold. However, this will result in many false positive/negative as at least in the case shown above, a threshold bigger than two will result in a false negative. In a word, the setting of fixed threshold is not flexible enough. In brief, we see that linters/static analysis tools cannot produce review comments on code changes that require a deeper understanding of a project. As a result, there are still significant efforts put into the manual review process. Therefore, in this project, we want to know if we can automate the manual review process, specifically by knowing if there is any relationship between the code review comments and the code changes beyond what linters/static analysis tools can check. > [color=#907bf7] The term "review comments" referred from this point will be all manual reviews given by humans (not the automated reviews generated by linter/static analysis tools). - If there is a relationship from code changes to review comments, we could build an automated tool, which will be smarter than linters, to alleviate the manual review process. Even the produced comments might not be 100% correct, the generated message could give contributors some hints for code improvements before asking for manual review. - More realistically, even we are not able to generate the correct review comment due to limited abilities of models or lack of training data, if we could score the code changes, the generated score will at least tell contributors how good his/her changes are. - More interestingly, if there is a reverse relationship from review comments to code changes, we could build a sample code generator for new contributors to get on board to a project. These samples will help new contributors to know what are the good and bad practices specific to the project. There are not many attempts for the tasks we described above. Researchers in Microsoft had an excellent experiment on building a deep neural network to classify the relevance of code reviews[^2]. Also, they did a user study on the classifier to know the effectiveness of their models. Inspired by their work, we decide to continue using the deep neural network ([PyTorch](https://pytorch.org/)) to accomplish the above three mentioned goals. <!-- // Give a clear and complete statement of the problem. Don't describe methods or tools yet. Where does the data come from, what are its characteristics? Include informal success measures (e.g. accuracy on cross-validated data, without specifying ROC or precision/recall etc) that you planned to use. Include background material as appropriate: who cares about this problem, what impact it has, what implications better solutions might have. Included references to any related work you know about. --> ## Code Review from GitHub Specifically, we decide to examine the relationship between code review and code changes in **Java** for **one** specific project. Therefore, our models will be project-dependent. The project we choose is **[TEAMMATES](https://github.com/TEAMMATES/teammates)** and there are several reasons why we choose it. - Java is a very mature and structured programming language which has been developed for decades. There are tons of best practices in Java, and the language is usually used in Software Engineering classes. Hence we expect a significant impact of our model in the language community if we succeed. - [TEAMMATES](https://github.com/TEAMMATES/teammates) is an open-source project hosted at School of Computing (SoC), National University of Singapore (NUS), to be a model for Software Engineering classes. Therefore, we expect the code review to be accurate and representative. There are around 3K pull requests with an average of 10 comments per pull request. The review comments are not generated by linter/static analysis. All of them are from senior developers. - One of our team members is the team lead and a frequent code reviewer for [TEAMMATES](https://github.com/TEAMMATES/teammates). Therefore, we can conduct some user studies to test the effectiveness of our model, same as the Microsoft Paper [^2]. ### Data Preparation We use [GitHub Comment API](https://developer.github.com/v3/pulls/comments/#list-comments-in-a-repository) to fetch all review comments from TEAMMATES. :::info The code to retrieve comments can be found in [`fetch_from_github.py`](https://gitlab.com/deepestcodereview/deepest_code_review/-/blob/master/exploratory_data_analysis/raw_github/fetch_from_github.py). ::: There are two fields that we care most about: the `diff_hunk` field which is the code changes in the format of [Git Diff](https://git-scm.com/docs/git-diff) and the `body` which is code review comment. ``` --- AccountsLogic.java +++ AccountsLogic.java @@ -162,6 +164,16 @@ public InstructorAttributes joinCourseForInstructor(String encryptedKey, String return instructor; } + private void validateInstructorInstitute (InstructorAttributes instructor, String institute) + throws InvalidParametersException { + assert instructor != null : "Should have been checked in validateInstructorJoinRequest() method."; + AccountAttributes account = getAccount(instructor.email); ``` > There are two problems here: > - `getAccount` accepts `googleId` rather than `email` > - If instructor does not have account stored in the database (e.g. it is the first time he/she use TEAMMATES), account will be null. In this case, they can still modify the institute name and bypass the check. > > [name=Pu Xiao (Code Reviewer)] The above two boxes show a typical diff hunk and its associated review comment. Because the diff hunk is [generated by Git](https://git-scm.com/docs/git-diff). We can also view it visually on [code_diff.html](https://deepestcodereview.gitlab.io/visualization/code_diff.html) ![](https://i.imgur.com/2NhH1Yh.png) ### Exploratory Data Analysis (EDA) <!---// Start with data preparation. If you used an off-the-shelf dataset, you don't need to say much. But if you did any non-standard preparation, or if you spent any time iterating on your data prep., please describe it thoroughly.---> After running the [`comment fetching script`](https://gitlab.com/deepestcodereview/deepest_code_review/-/blob/master/exploratory_data_analysis/raw_github/fetch_from_github.py), we get `22984` pairs of diff hunk and review comments. We filter all comments that are replying to other comments as we do not want to introduce the complicated context understand in our task. Hence, after data cleaning, we have `10201` records of data and we split them into training (`74.8%`), validation (`13.2%`), and test datasets (`12%`). We plotted the distribution of length of diff_hunk and body fields. :::info The raw data (in `JSON` array) from GitHub API can be found in folder [`data_teammates`](https://gitlab.com/deepestcodereview/deepest_code_review/-/tree/master/exploratory_data_analysis/data_teammates). The notebook for EDA can be found in [eda_teammates.ipynb](https://gitlab.com/deepestcodereview/deepest_code_review/-/blob/master/exploratory_data_analysis/eda_teammates.ipynb). ::: ![](https://i.imgur.com/uvU1IXD.png/pic/pic1s.png =350x)![](https://i.imgur.com/81UKRPE.pngpic/pic1s.png =340x) The medium of the length of diff_hunk is `697`. The quartile is `387`. The medium of the length of body (review comment) is `82`. Due to these distributions, without losing most of the information, in our later models, we cut off the length of the diff hunk to contains only the first `300`/`400` characters. Similarly, we also set the length of the review comment to contain the first 100 words maximumly. ### Sentiment Analysis on Review Comment The raw data does not contain a review score for code changes obtained. Therefore, we decide to augment the data to give a rating to the diff hunk. Particularly, we are going to do a sentiment analysis of review comments firstly. Since the code reviews in the `body` field (in raw data) are natural languages, we can use pre-trained sentiment analysis tools to produce sentiment classes or sentiment scores. We have found four popular sentiment analysis tools: [`nltk vader`](https://www.nltk.org/howto/sentiment.html), [`textblob`](https://textblob.readthedocs.io/en/dev/), [`NLP flair`](https://github.com/flairNLP/flair) and [`sentiCR`](https://github.com/senticr/SentiCR). The `sentiCR` is specifically trained on code review domain, which is expected to be very helpful for our scoring purposes. However, `sentiCR` is trained only with around one thousand of data; we also expect it will not give a very representative score. Tools `nltk vader` and `textblob` will output scores. We make a short comparison of scores output by them below. As shown in the distribution, the two methods give a continuous score range from `[-1,1]`. ![](https://i.imgur.com/Jq5larl.png =340x)![](https://i.imgur.com/SPthPrB.png =350x) The two methods usually give different scores on one review. However, they hardly show opposite attitudes. :::info The analysis and visualization can be found in the sentiment score comparison section of [`code_to_score.ipynb`](https://gitlab.com/deepestcodereview/deepest_code_review/-/blob/master/code_to_score.ipynb). ::: The other two methods, `sentiCR` and `flair`, give sentiment labels: `[negative, positive]`. The result of them do not overlap much. `sentiCR` labels almost `86%` of our review records as `positive` while `flair` labels only `44%` of our review records as `positive`. ## Task 1: Predict Sentiment Score from Diff Hunk Our first attempt is to predict sentiment scores from diff hunks. This task is viewed as an intermediate task for Task 2. In addition, we believe that the predicted sentiment score from diff hunks will also be beneficial as it can serve as an indicator for contributors to know how good his/her code changes are. ### Regression on Sentiment Score Here, among the sentiment analysis methods mentioned in the exploratory data analysis, we first choose `textblob` as our scoring method for regression. :::info We also tried `vadar` for regression using similar models, which will be presented in Appendix [A2 - Task 1: Other Score Methods](#A2---Task-1-Other-Score-Methods). We obtained similar results. ::: For diff hunk, because the code in `diff_hunk` is unstructured, we cannot use the Abstract Syntax Tree (AST) to parser it or tokenize it. Therefore, we built a vocabulary library at the character-level for diff hunk. We have constructed a dictionary of most common symbols (`100` total) for diff hunk. We use Transformer Encoder[^transformer] to extract hidden states from the diff hunk and several affine non-linear layers to project the result to a single score for the regression task. The Figure below shows the overall architecture of the model. [^transformer]: The Transformer (Encoder/Decoder) are PyTorch libraries from [HuggingFace](https://huggingface.co/transformers/model_doc/bert.html). ![](https://i.imgur.com/ti4OPvB.png) The reason why we choose Transformer Encoder is mainly because of its self-attention on the input. We believe the self-attention mechanism will help us extract useful information from the diff hunk. ```python= code_to_score_model = CodeToScoreModelTransformer( code_max_length=300, code_characters_size=len(i2c), hidden_size=400, num_hidden_layers=4, num_attention_heads=4, intermediate_size=500) ``` For the encoder, we have `4` hidden layers with `4` attention heads each. The hidden size is set to `400` and the intermediate size[^intermediate] is set to `500`. [^intermediate]: `intermediate_size` is the dimensionality of the feed-forward layer in the Transformer encoder. As shown in the EDA section, we found that all `textblob` scores lay on the range between `-1` and `1`. Therefore, using MSE (Mean Square Loss) might not be appropriate as the loss will be too small. Instead, we use **L1** loss. :::info The model file can be found in [`code_to_score_transformer.py`](https://gitlab.com/deepestcodereview/deepest_code_review/-/blob/master/code_to_score/models/code_to_score_transformer.py). The training notebook can be found in [`code_to_score.ipynb`](https://gitlab.com/deepestcodereview/deepest_code_review/-/blob/master/code_to_score.ipynb). ::: The learning optimizer is set to `Adam` with a learning rate of `0.0002`. We also set `weight_decay` to `0.2` for regularization. Furthermore, an exponential learning rate decay is used with `gamma=0.995`. The learning rate will be multiplied by `gamma` after each epoch to stabilize the loss. ![](https://i.imgur.com/qKaiaCh.png) > [color=#907bf7] The dash line indicates the epoch. The training visualization in the rest of the report follows the same convention. The above Figure shows the training, validation, and test loss for `15` epoch with `64` batch size. Overall, there is no obvious overfitting in the case. In addition, we get around `0.1` L1 error on the prediction, which means the predicted sentiment score is within `0.1` from the ground truth. The loss does not look normal for a typical neural network training. Typically, it is because of a mediocre learning rate. However, we have tried other learning rates, such as `0.01` and `0.001`, and found that they end with similar results. Therefore, we believe this is the best effort we can make on the model by tuning the learning scheme. To know what the model is outputting, we plot the distribution of predicted scores and the ground truths. ![](https://i.imgur.com/0m1Yl33.png) We see that our model "learns" to predict some values around zero for all inputs, which explains why the loss gets fixed after several steps. The phenomenon might also indicates that our model finds there is no relationship between the diff hunk and the sentiment score. Hence, it decides to output the mode value `0` to minimize the L1 loss. However, it is too early to conclude this. We also find the same behavior for test data. ![](https://i.imgur.com/q3WEfd0.png) Nevertheless, despite the poor performance, one interesting thing we can do is to know what attention the encoder is paying to. We have built an interactive [attention visualizer](https://deepestcodereview.gitlab.io/visualization/attention_viz.html?data=code_to_score_attention_regression.json). ![](https://i.imgur.com/L1HQySA.png) Sadly, there is no distinct attention pays to different characters, which is somehow expected as our model does not perform well on the prediction. :::info We have tried to use a pre-trained encoder for diff hunk. The Encoder has a more interesting attention pattern, which has been documented in Appendix [A3 - CodeBert](#A3---CodeBert). ::: :::info The code to generate the attention visualization can be found on [`code_to_score_viz.ipynb`](https://gitlab.com/deepestcodereview/deepest_code_review/-/blob/master/code_to_score/code_to_score_viz.ipynb). ::: We guess the poor performance might due to the uneven distribution of scores. We might need to add weighted L1 loss. :::warning Due to time constraints, we do not have time to add weighted L1 loss to our model and decide to left it for future work. However, we manage to add weighted cross-entropy loss in the classification task below. ::: ### Classification on Sentiment Class We also tried classification on the sentiment classes. We would expect the result to be better than the regression task as we know that Deep Neural Network (DNN) is really good at classification problems. Here, we choose `flair` method, which has almost even positive (`44%`) and negative (`56%`) class distribution. The model remains almost the same, as shown in the regression task. However, we change the final affine layer to output probability of two classes `0` and `1`. We also change the loss to be weighted cross-entropy loss. :::info We also tried `sentiCR` scoring method, which will be documented in Appendix [A2 - Task 1: Other Score Methods](#A2---Task-1-Other-Score-Methods). The classes distribution in `senticr` is uneven (`86%` v.s. `14%`). Even with the help of weighted loss, the model still performs poorly. ::: ```python= code_to_score_model = CodeToScoreModelTransformer( is_classification=True, class_weight=[1, 1.3], code_max_length=300, code_characters_size=len(i2c), hidden_size=400, num_hidden_layers=4, num_attention_heads=4, intermediate_size=500) code_to_score_model = code_to_score_model.to(code_to_score_model.device) ``` The `class_weight` is there to prevent the model from keep predicting negative class (the majority). Generally, we want the positive class to have more weights on the loss so that the model is force to get more positive classes right. We have tuned the weights (`[1, 1.3]`) to avoid the model becomes a trivial predictor. The hyperparameters remain the same. The optimizer is set to `Adam` with a learning rate of `0.0002`. The `weight_decay` is set to `0.2`. The learning rate decay remains as `gamma=0.995`. ![](https://i.imgur.com/v2ySgAs.png) :::info The training code for the classification problem shares the same model ([`code_to_score_transformer.py`](https://gitlab.com/deepestcodereview/deepest_code_review/-/blob/master/code_to_score/models/code_to_score_transformer.py)) and same notebook ([`code_to_score.ipynb`](https://gitlab.com/deepestcodereview/deepest_code_review/-/blob/master/code_to_score.ipynb)) as the regression problem. ::: ![](https://i.imgur.com/eT596Ls.png) The result is relatively good than the regression problem. However, we see a lot of fluctuations in the accuracy plot. The test accuracy is the same as the baseline model. The baseline model can be a dummy predictor which always gives a positive label. It should have accuracy around `56%` (the fraction of the positive classes). ![](https://i.imgur.com/ZuY5Zoy.png) The ROC (Receiver Operating Characteristic) curve also shows the overfitting of the model, which does not reflect in the loss graph. The AUC (Area Under the Curve) is shown in the legend. However, from the distribution of the predicted scores, we can see, at least with weighted cross-entropy loss, the model does not always choose to predict the same class (unlike the regression problem). ![](https://i.imgur.com/oFbcFV5.png) ![](https://i.imgur.com/iPBEIdw.png) We also constructed [a sample attention visualization](https://deepestcodereview.gitlab.io/visualization/attention_viz.html?data=code_to_score_attention_classification.json) for this task. In most of cases, it faces the same situation as the regression task: the attention to every character is the same. However, there are also some interesting findings. ![](https://i.imgur.com/AfMmISq.png) As shown in this visualization, some attentions have been paid to the whitespaces in the diff hunk. In brief, we make some attempts to make the predictions work but fail to tune it towards the best. Specifically, we find it is likely to get the model to work for classification problems. We believe it is too early to conclude there is no link from diff hunk to the sentiment scores and leave further explorations for future work. <!-- // Describe the tools that you used and the reasons for their choice. Justify them in terms of the problem itself and the methods you want to use. // Describe your baseline model. Give a graphical representation of it. Describe your final model. Explain the evolutionary process to get to it. What other models did you try? What worked and what didn't? // Give detailed results for baseline and final models. Make sure you're very clear about what dataset was used, how train/validate/test partitioning was done, what measures were used. Explore surprises. Whenever a method didn't behave as expected, explore and try to figure out why. Please use visualizations whenever possible. Include links to interactive visualizations if you built them. --> ## Task 2: Predict Review Comment from Diff Hunk In task 1, we observe that we do not get a good performance in prediction from code diff to sentiment scores. Nevertheless, in this task, we still decide built on top of task 1 and go one step further to predict the review comment itself. Note that in task 1, we rely on other tools for sentiment analysis, which might contain some errors and confuse our models. We can view the problem as a Natural Language Processing (NLP) translation task. Essentially, we want to “translate” the diff hunk to a review comment. Therefore, we choose to apply the state-of-the-art Transformer Encoder and Decoder models[^transformer]. The model is essentially the same as illustrated in the original ["Attention Is All You Need"](https://arxiv.org/abs/1706.03762) paper [^attention_paper] (the architecture is shown below), with differences in layer numbers and sizes. We have, for the inputs, the diff-hunks. For the outputs, we have review comments. [^attention_paper]: Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." In *Advances in neural information processing systems*, pp. 5998-6008. 2017. ![](https://i.imgur.com/s9vEb4N.png =400x) *Source: Model Illustration from ["Attention Is All You Need"](https://arxiv.org/abs/1706.03762) paper.* :::info We have also tried to use LSTM (Long Short-Term Memory) as decoder, which will be documented in Appendix [A4 - Task 2: LSTM as Decoder](#A4---Task-2-LSTM-as-Decoder). We choose to present the Transformer Encoder-Decoder model here because of its relatively good performance. ::: Similar to Task 1, we have constructed a dictionary of symbols (`100` total) such that we can embed our diff hunk when we feed them into the encoder. Besides, we also built our own vocabulary of words (the most common `400` words from all training dataset with the help of [`nltk.word_tokenize`](https://www.kite.com/python/docs/nltk.word_tokenize)), so that we can embed the review texts. An example of encoded text is shown below. As shown, it has some `UNK` as we have only `400` words in vocab, but it is a compromise for efficiency as a larger vocab will be infeasible to train on a single GPU. ``` Original: I wonder if it will be better if we just returned <code>getInstructorFeedbackEditCopyActionLink</code>, instead of making a new field?\n After Encoding: I UNK if it will be better if we just returned < code > UNK < /code > , instead of making a new field ? PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD ``` :::info The model can be found in [code_to_review_transformer.py](https://gitlab.com/deepestcodereview/deepest_code_review/-/blob/master/code_to_review/models/code_to_review_transformer.py). Both the tokenization and the training notebook can be found in [code_to_review.ipynb](https://gitlab.com/deepestcodereview/deepest_code_review/-/blob/master/code_to_review.ipynb). ::: ```python= code_to_review_model = CodeToReviewModelTransformer(startI_w=startI_w, padI_w=padI_w, code_characters_size=len(c2i), review_vocab_size=len(w2i), encoder_hidden_size=384, encoder_num_layers=4, encoder_num_attention_heads=4, encoder_intermediate_size=256, decoder_hidden_size=384, decoder_num_layers=4, decoder_num_attention_heads=4, decoder_intermediate_size=256) ``` We set the `hidden_size`, `num_layers`, and `attention_heads` for the encoder and decoder as shown in the code snippet above. The learning optimizer is set to `Adam` with a learning rate of `0.0002`. We also set the `weight_decay` to `0.0020` to avoid overfitting (We have tried that lower `weight_decay` will result in significant overfitting). Similar to the settings before, we also provided learning decay with `gamma=0.995`. The Figure below shows the loss for the training, validation, and test datasets, we managed to reduce the cross-entropy loss on the probability of words quite a bit. ![](https://i.imgur.com/m5DjC8z.png) From the loss graph above, we can see that loss gets stabilized after `30` epoch with `minibatch=64`. Thanks to the regularisation, we do not get obvious overfitting for the training data. We have the test loss, as long as the validation loss, to be around `3.5`. Below, we generate several texts from the training and test set using beam search (documented in Appendix [A1 - Beam Search](#A1---Beam-Search)). The `beam_size` is set to `4` and we only select the first in the `topK` candidates. The length to generate is set to `30`. **Train:** ``` Generated: <START> UNK UNK ( UNK ) UNK ( UNK ( UNK ( ) ) ) ) ) UNK ( UNK ( UNK ( UNK ( UNK ( ) ) ) ) Original: Is there any reason this is not sanitized ? PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD Generated: <START> UNK UNK ( UNK ) UNK ( UNK ( UNK ( ) ) ) ) ) UNK ( UNK ( UNK ( UNK ( UNK ( ) ) ) ) Original: Revert these as well . PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD Generated: <START> UNK UNK ( UNK ) UNK ( UNK ) UNK ( UNK ( UNK ( UNK ) ) ) ) ) UNK ( UNK ( UNK ( ) ) ) Original: Remove lines UNK instead and let this import be . You 'll also need to add UNK . to all the method calls in this file . PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD ``` **Test:** ``` Generated: <START> UNK UNK ( ) UNK ( ) ) UNK ( UNK ( UNK ( ) ) ) ; UNK ( UNK ( UNK ( UNK ( UNK ) ) ) Original: Can rename this class UNK - > UNK now that it has multiple uses ? PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD Generated: <START> UNK UNK ( ) UNK ( UNK ( UNK ) ) ) UNK ( ) ) UNK ( UNK ( UNK ( UNK ( UNK ) ) ) ) ) Original: I do n't think we should UNK validation checks just because we UNK it in one place . PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD Generated: <START> UNK UNK ( UNK ) UNK ( UNK ) UNK ( UNK ( UNK ( UNK ) ) ) ) ) UNK ( UNK ( UNK ( ) ) ) Original: Now this method has no difference from UNK ( ... ) . You can remove the file parameter and remove the three UNK tests in the method body , and name the method like how you did previously . Add : while you 're at it , maybe add a comment on why the tests are UNK . PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD ``` Sadly, we are facing the same issue as task 1. It looks like the model repeatedly generates the same text (with little variation), regardless of training or test datasets. We somehow expect the result as we know the task is inherently tricky based on the experience in task 1. Again, it might also be possible that there is no relationship between the diff hunk and the review comment under our simple Transformer Encoder-Decoder model. Nevertheless, we also apply the same visualization technique to know the masked self-attention in the decoder layer when generating the repeated text. :::info The [attention visualization on the encoder side](https://deepestcodereview.gitlab.io/visualization/attention_viz.html?data=code_to_review_encoder_attention.json) faces the same issue as task 1 where the attention to every character is almost the same. The code to generate the attention on the encoder side can be found in [`code_encoder_viz.ipynb`](https://gitlab.com/deepestcodereview/deepest_code_review/-/blob/master/code_to_review/code_encoder_viz.ipynb). ::: :::info The code to generate the attention on the decoder side can be found in [`code_to_review_transformer_viz.ipynb`](https://gitlab.com/deepestcodereview/deepest_code_review/-/blob/master/code_to_review/code_to_review_transformer_viz.ipynb). ::: ![](https://i.imgur.com/6YMB86c.png) From the [above visualization](https://deepestcodereview.gitlab.io/visualization/attention_viz.html?data=code_to_review_decoder_attention.json), we first notice that indeed for the generation task, the encoder has masked self-attention (only attention to the word that has been generated). Also, we find different attention pattern in different heads. We found that there are many `UNK` symbols. We suspect the code segment in the review comment result in many `UNK` symbols and thus result in predicting a lot of `UNK`. Therefore, we remove all the code segments in the training data and train another model with the same hyperparameters. > [color=#907bf7] Sample Code Segment in the Review Comment: > > ![](https://i.imgur.com/jeS2zRC.png) ![](https://i.imgur.com/siYXe6b.png) The loss remains the same. However, we get different generated text this time. We can see the general sentence structure in the generated text. **Test:** ``` Generated: <START> I 'm UNK the UNK of the UNK of UNK UNK UNK UNK , but I think you can UNK the UNK of the UNK of the UNK of the Original: UNK add one more case for ? PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD Generated: <START> This can be UNK to UNK the UNK . UNK the UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK Original: UNK ( ) sounds good . Can change to that PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD Generated: <START> This can be UNK to UNK the UNK . UNK the UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK Original: session 's UNK PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD ``` We can now view a more meaningful [visualization](https://deepestcodereview.gitlab.io/visualization/attention_viz.html?data=code_to_review_decoder_clean_attention.json) below. ![](https://i.imgur.com/9qVNhPG.png) However, our model for task 2 also suffers from the problem where the model always are predicting the same thing. We have manually verified several generations. ## Task 3: Predict Diff Hunk from Review Comment Our group does not get a good performance on the models designed for task 1 and 2. Our last attempt is to predict the diff hunk from the review comment itself. The task is very similar to task 2. The main difference is that we swap the prediction order. We also decide to use a pre-trained model on the review comments (natural languages), instead of using a plain Transformer Encoder-Decoder model. The encoder becomes the BERT model for NLP problem. For the decoder side, we decide to use LSTM (Long Short-Term Memory) to generate diff hunk. :::info We also tried to use the Transformer Decoder for the decoding side. The detail is shown in Appendix [A5 - Task 3: Transformer as Decoder](#A5---Task-3-Transformer-as-Decoder). ::: Here we present the architecture of using LSTM as a generator for diff hunk. ![](https://i.imgur.com/4EGh1f2.png) We use a pre-trained BERT model built by [HuggingFace](https://huggingface.co/transformers/model_doc/bert.html). The weights of the BERT are fixed and we merely use BERT as an encoder for the initial hidden state for the LSTM. We do not fine-tune BERT for three reasons. - With batch size more than `32`, the BERT models are too big to fit in the GPU instance we rent on Google Cloud Platform. - Our dataset (review comments) (`7629`) is not big enough to fine-tune the BERT model. - We believe that the BERT model is sophisticated enough to get the hidden information from the review comment. ```python review_to_code_model = ReviewToCodeModelLSTM( startI=c2i['<START>'], code_characters_size=len(c2i), feature_dim=768 * review_max_length, hidden_dim=2048, review_pretrained_weights='bert-base-uncased', character_vec_dim=256) ``` In the LSTM, we use a hidden dimension of `2048` and character embedding of dimension `256`. We also implement the [beam search](#A1---Beam-Search) to generate diff hunk, as shown in the model architecture. The loss at the LSTM is the cross-entropy loss where we predict the likelihood of each character in the diff hunk. :::info The model is written in [`review_to_code_lstm.py`](https://gitlab.com/deepestcodereview/deepest_code_review/-/blob/master/review_to_code/models/review_to_code_lstm.py). The training code is written in [`review_to_code.ipynb`](https://gitlab.com/deepestcodereview/deepest_code_review/-/blob/master/review_to_code.ipynb). ::: For training process, we use `Adam` optimizer with learning rate `0.0002` and `weight_decay` of `0.0003`. We also set an exponetional learning rate decay for each epoch (`gamma=0.995`). > [color=#907bf7] The hyperparameters have been fine-tuned to avoid overfitting and achieve stabilized loss. One important thing we have done here is that we also remove the code segment (anything surrounded by `` ` ``) in the training data (review comment). As we have done so for task 2 and get better text generation, we believe that this data cleaning will also help the model understand the semantic information in the review comment. ![](https://i.imgur.com/RznZ9Vs.png) The above Figure shows the loss for training, validation, and test with `25` epoch and minibatch of `64`. It looks like we get the right number of epoch and the model also does not overfit the data. The loss itself is not interesting as it does not tell whether the model is good or not. Therefore, we use a test example below to shows what the model is doing and how good the model is. Suppose a new contributor wants to know we kind of changes a reviewer in TEAMMATES will say, "I think this is a good implementation". :::info In the generation section of the training notebook ([`review_to_code.ipynb`](https://gitlab.com/deepestcodereview/deepest_code_review/-/blob/master/review_to_code.ipynb)), there are more generation examples. We just show one of them here. ::: Firstly, the sentence will be tokenized by BERT with appropriate attentions to different words. In the visualization below, we see that the state-of-the-art BERT model indeed pay attentions (self-attentions) to different essential words. For example, it knows that `this` is usually followed by `is`. ![](https://i.imgur.com/U7YNpYR.png =400x) :::info The attention visualization can be found in [`review_body_bert_viz`](https://gitlab.com/deepestcodereview/deepest_code_review/-/blob/master/review_to_code/review_body_bert_viz.ipynb). ::: After that, the hidden states from BERT will be projected to the hidden states in the LSTM. We feed the `<START>` token to the LSTM and let it generates the diff hunk sequentially with beam search. We use `beam=4` with `max_length=400` to generate diff hunk. :::info In Appendix [A1 - Beam Search](#A1---Beam-Search), we will discuss the effectiveness of beam search. ::: The below is the diff hunk generated; we can see at least the model is very effective to remember to the structure of the diff hunk. ``` --- InstructorFeedbackEditPageDataTest.java +++ InstructorFeedbackEditPageDataTest.java @@ -0,0 +1,135 @@ +package teammates.test.cases.ui.pagedata; + +import static org.testng.AssertJUnit.assertEquals; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.Comparator; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +im ``` We can use the same [Diff Hunk Visualizer](https://deepestcodereview.gitlab.io/visualization/code_diff.html) to view it. ![](https://i.imgur.com/ASV3JhB.png) Now the contributor knows that the changes made to `InstructorFeedbackEditPageDataTest.java` are generally good. By viewing the diff hunk, we can see that it is likely that, when the file gets added to the project, the reviewer gives positive comments. > [color=#907bf7] We also confirm with the reviewer from TEAMMATES. He says that a senior developer introduces the page changes with a good code quality. We can also use the same "attention" visualizer to view the change. In particular, the attention for LSTM becomes the connectivity[^attention_lstm]. [^attention_lstm]: Madsen, Andreas. "Visualizing memorization in RNNs." *Distill* 4, no. 3 (2019): e16. https://distill.pub/2019/memorization-in-rnns/ $$ \text { connectivity }\left( t, \tilde{t} \right)=\left\|\frac{\partial\left(h_{L}^{\tilde{t}}\right)_{k}}{\partial x^{t}}\right\|_{2} $$ The connectivity proposed by Madsen[^attention_lstm] essentially wants to know the L2-norm of the gradient of the previous character at $t (t < \tilde{t})$ when the LSTM generates character at $\tilde{t}$. In other words, we want to know what are the positions that will be important to generate the current character. Using the [same "attention" (connectivity) visualization tool](https://deepestcodereview.gitlab.io/visualization/attention_viz.html?data=review_to_code_connectivity.json), we can view the connectivity. > Note that we reuse the User Interface (UI) from the previous attention visualization in the interactive demo. There is no HEAD concept in the LSTM connectivity. :::info The code to generate the connectivity can be found on [`review_to_code_lstm_viz.ipynb`](https://gitlab.com/deepestcodereview/deepest_code_review/-/blob/master/review_to_code/review_to_code_lstm_viz.ipynb) ::: ![](https://i.imgur.com/kCR8twK.png) We can see that characters such as the `;` and the `+` in the previous lines are essential to generate the current character `+` (highlighted in orange). The example also shows that LSTM has some troubles in remembering long term dependencies. The above paragraphs only give one generation example. However, an important thing to note here is that we do not suffer from the problem in task 1 and task 2 where the model always predicts the same thing. For example, the statement "this is bad practice" will give the following diff hunk. We have verified several generations. ``` --- InstructorFeedbackAbstractAction.java +++ InstructorFeedbackAbstractAction.java @@ -0,0 +1,158 @@ +package teammates.ui.controller; + +import java.util.ArrayList; +import java.util.Arrays; +import ``` In a word, we believe our attempt in task 3 is a good proof-of-concept work to build a sample code generator. <!-- // Describe the tools that you used and the reasons for their choice. Justify them in terms of the problem itself and the methods you want to use. // Describe your baseline model. Give a graphical representation of it. Describe your final model. Explain the evolutionary process to get to it. What other models did you try? What worked and what didn't? // Give detailed results for baseline and final models. Make sure you're very clear about what dataset was used, how train/validate/test partitioning was done, what measures were used. Explore surprises. Whenever a method didn't behave as expected, explore and try to figure out why. Please use visualizations whenever possible. Include links to interactive visualizations if you built them. --> ## Conclusion and Future Work In conclusion, we found that it is hard to predict the sentiment score or the review body from the diff hunk. Several experiments fail to achieve good performance. However, we hesitate to conclude that there is no relationship from diff hunk to review comments. Our tuning attempts in sentiment classification task show a relatively good result, but due to time constraints, we do not have time to further explore them. However, we find that, from review comments to diff hunk, there are promising results. Our attempts show that not only the model learns to generate valid diff hunk in [Git Format](https://git-scm.com/docs/git-diff) but also is able to figure out what diff hunk will result in good/bad review comments. Our group believes that, with proper further tuning and training with more data, we could build a sample code generator for production. There are several improvements we can make to our project. For task 1 and 2, our group believes that there are several directions to optimize them: - The attribute diff hunk might not be enough to help us predict the review comment. We could manually augment the data by giving more context, such as the whole changed files. In addition, in case of unbalanced training data, one should employ weighted loss to prevent the model from picking the easiest strategy and outputs the same thing every time. - The training data is not enough to train the Transformer Model. Therefore, we need more training data. In fact, we have crawled the review comments for [elasticsearch](https://github.com/elastic/elasticsearch) (another open-source Java project), which has seven times training data. However, due to time constraints, we do not have time to train them. - We could further improve the model design and hyperparameter tuning for our models. For example, there should be better tokenization for diff hunk. One could pre-train a model like ([`CodeBert`](#A3---CodeBert)) on the target repository (e.g., TEAMMATES) and fine-tune it. In addition, we also think there is room for improvement for task 3. Although the file name itself (the `+++` and `---` chunk) somehow gives interesting information according to the reviewer from TEAMMATES. Most of generations do not give meaningful information as they just provide a bunch of changes for `import` packages. We would expect something like method changes or statement changes that could provide more information. It might be due to our training data that contains a lot of `import` chunk. Therefore, future work could emphasize on cleaning the data or obtaining more data related to method changes or statement changes. ## Team Contributions - Pu Xiao (Taking the class as **Graded**) (55% of total work) - Write 80% of the report - Build attention visualization - Implement beam search - Write models for task 3 - Integrate `CodeBert` to task 1 and 2 - Wrap up all models for task 1, 2 and 3 - Do final hyperparameter tuning on task 1, 2 and 3 - Joey Lou (Taking the class as **P/NP**) (20% of total work) - Write 10% of the report - Do initial analysis on `CodeBert` - Do initial EDA on dataset - Write initial models for task 2 - Ao Liang (Taking the class as **P/NP**) (25% of total work) - Write 10% of the report - Conduct sentiment analysis - Write initial models for task 1 <!-- // This information must be in a section called Team Contributions at the end of your report. Please give a percentage breakdown of the effort from each team member, and what they worked on. Please discuss this within your team to make sure every member agrees with the breakdown. --> ## Appendix ### A1 - Beam Search We believe that beam search is essential in our sequence generation, and therefore, we implement beam search for both LSTM and Transformer. ![](https://i.imgur.com/NvtkxmJ.png) The Figure illustrates the effectiveness of the beam search for LSTM. Instead of being greedy to choose the node with the purpose box, we could expand the search tree and try to find the global maximal total probability. In the case shown above, the path highlighted in red gives the biggest total probability. Note that, in the actual implementation ([`lstm_beam_search.py`](https://gitlab.com/deepestcodereview/deepest_code_review/-/blob/master/util/lstm_beam_search.py)), we use log probability for numerical stability. <br/> ![](https://i.imgur.com/RReeXhb.png) The implementation of beam search ([`transformer_beam_search.py`](https://gitlab.com/deepestcodereview/deepest_code_review/-/blob/master/util/transformer_beam_search.py)) for the Transformer Decoder is a bit different due to different generation processes between LSTM and Transformer Decoder. In Transformer Decoder, we feed a sequence with `<pad>` token first, and then replace the token one by one with the character with the maximal probability. Again, instead of being greedy here, we can use the total probability (all probabilities time together) and use a priority queue to pop out the one with the biggest total probability. Here we show a concrete example to illustrate the effectiveness of the LSTM beam search. For task 3, if we want to generate diff hunk for sentences: "can you change this", "this is bad practice" and "test convention". With beam search (`beam=4`, `topK=1`), the LSTM will generate: ``` --- InstructorCourseStudentDetailsPageDataTest.java +++ InstructorCourseStudentDetailsPageDataTest.java @@ -0,0 +1,19 @@ +package teammates.test.cases.ui.pagedata; + +import static org.testng.AssertJU --- InstructorFeedbackAbstractAction.java +++ InstructorFeedbackAbstractAction.java @@ -0,0 +1,158 @@ +package teammates.ui.controller; + +import java.util.ArrayList; +import java.util.Arrays; +import --- InstructorFeedbackAbstractAction.java +++ InstructorFeedbackAbstractAction.java @@ -0,0 +1,165 @@ +package teammates.ui.controller; + +import java.util.ArrayList; +import java.util.Arrays; +import ``` Without beam search, the LSTM will generate: ``` --- InstructorFeedbackResultsPageData.java +++ InstructorFeedbackResultsPageData.java @@ -25,69 +55,139 @@ public String groupByTeam = null; public String showStats = null; public int s --- InstructorFeedbackResultsPageData.java +++ InstructorFeedbackResultsPageData.java @@ -25,69 +55,139 @@ public String groupByTeam = null; public String showStats = null; public int s --- InstructorFeedbackResultsPageData.java +++ InstructorFeedbackResultsPageData.java @@ -25,69 +55,139 @@ public String groupByTeam = null; public String showStats = null; public int s ``` Note that the above three diff hunks (without beam search) are the same! Therefore, being greedy in the LSTM generation is not a good thing. :::warning Note that in our sequence generation, for simplicity, we do not introduce the `<END>` token. Therefore, the beam search will only terminate after the maximal length has reached. We believe that generating dummy characters in order to meet the maximal length requirement might hinder the total probability. We should introduce `<END>` token for future work. ::: ### A2 - Task 1: Other Score Methods We choose `textblob` specifically in task 1. In fact, we also tried other scoring methods and got similar results as `textblob`. | Performance<br/>/Method | `textblob` | `vader` | |:----------------------- |:------------------------------------ |:------------------------------------ | | Loss | ![](https://i.imgur.com/qKaiaCh.png) | ![](https://i.imgur.com/AXtE2mi.png) | | Output Range | ![](https://i.imgur.com/q3WEfd0.png) | ![](https://i.imgur.com/P2rOk3a.png) | | Performance<br/>/Method | `flair` | `sentiCR` | |:----------------------- |:------------------------------------ |:------------------------------------ | | Loss | ![](https://i.imgur.com/v2ySgAs.png) | ![](https://i.imgur.com/hJrnFCe.png) | | Accuracy | ![](https://i.imgur.com/9LfaVOh.png) | ![](https://i.imgur.com/jgJHzw1.png) | | Output Range | ![](https://i.imgur.com/oFbcFV5.png) | ![](https://i.imgur.com/jIRpz8u.png) | :::info The model is the same and can be found in [code_to_score_transformer.py](https://gitlab.com/deepestcodereview/deepest_code_review/-/blob/master/code_to_score/models/code_to_score_transformer.py). The training notebook can be found in [code_to_score.ipynb](https://gitlab.com/deepestcodereview/deepest_code_review/-/blob/master/code_to_score.ipynb) (To use other methods, one needs to set the `score_method` variable properly). ::: ### A3 - CodeBert In this report, we mention that due to the unstructured diff hunk, we cannot use code parse to tokenize the code itself, and therefore we use character level encoding. In fact, beside the character level encoding, we also tried another pre-trained BERT-like Transformer Encoder on the diff hunk. The encoder is called [CodeBert](https://huggingface.co/huggingface/CodeBERTa-small-v1) from HuggingFace. It has been trained on the [CodeSearchNet](https://github.blog/2019-09-26-introducing-the-codesearchnet-challenge/) dataset from GitHub. We have tried to apply the model for task 1 and task 2. However, we do not get an improved result. An important note on the pre-trained model is that the model was trained on multiple programming languages (`go`, `java`, `javascript`, `php`, `python` and `ruby`) on various projects. Therefore, it is very hard to know what the hidden states generated by the model will be. There is [one trivial success application](https://huggingface.co/huggingface/CodeBERTa-language-id) using `CodeBert` to predict programming languages from a given code segment. We believe that our task 1 and 2 are much complicated than that. Therefore, the hidden states extracted from our diff hunk by `CodeBert` might not be helpful. Indeed, we see almost the same performance. :::info Here, we show an example of applying `CodeBert` to task 1. The training notebook can be found in [code_to_score_code_bert.ipynb](https://gitlab.com/deepestcodereview/deepest_code_review/-/blob/master/code_to_score_code_bert.ipynb). We also tried `CodeBert` on task 2. The training notebook can be found in [code_to_review_code_bert.ipynb](https://gitlab.com/deepestcodereview/deepest_code_review/-/blob/master/code_to_review_code_bert.ipynb). ::: | | Task 1 | Task 1 (with `CodeBert`) | | ------------ |:------------------------------------ |:------------------------------------ | | Loss | ![](https://i.imgur.com/qKaiaCh.png) | ![](https://i.imgur.com/xx4o6vF.png) | | Output Range | ![](https://i.imgur.com/volWMWl.png) | ![](https://i.imgur.com/taccKOp.png) | In the comparison above, we observe the two models have the same performance. ![](https://i.imgur.com/fYkA8K3.png) Here we show the [attention visualization](https://deepestcodereview.gitlab.io/visualization/attention_viz.html?data=code_bert_attention.json) to examine the self-attention `CodeBert` pays to. We see a lot of strange symbols in the tokenization. It is hard to interpret the meaning of each token and the strange symbols. :::info The visualization for `CodeBert` can be found in [code_bert_viz.ipynb](https://gitlab.com/deepestcodereview/deepest_code_review/-/blob/master/code_to_review/code_bert_viz.ipynb). ::: If we think `CodeBert` will not produce the hidden states we want, we also try to fine-tune it. However, this is not a successful attempt as shown in the loss Figure and output comparison below. ![](https://i.imgur.com/U5OVwVS.png =330x)![](https://i.imgur.com/BkMPklb.png =330x) The result partially proves that why we do not tune the BERT model in Task 3. We believe that our dataset is too small to tune `CodeBert` model with 84M parameters. ### A4 - Task 2: LSTM as Decoder In the decoder side of task 2, we use the Transformer Decoder to generate text. We also built a model to use LSTM to generate text. :::info The model can be found in [`code_to_review_lstm`](https://gitlab.com/deepestcodereview/deepest_code_review/-/blob/master/code_to_review/models/code_to_review_lstm.py). The training process shares the same code in [`code_to_review.ipynb`](https://gitlab.com/deepestcodereview/deepest_code_review/-/blob/master/code_to_review.ipynb). The visualization for the LSTM Decoder can be found in [code_to_review_transformer_viz.ipynb](https://gitlab.com/deepestcodereview/deepest_code_review/-/blob/master/code_to_review/code_to_review_transformer_viz.ipynb). ::: ```python= code_to_review_model = CodeToReviewModelLSTM(startI_w=startI_w, code_characters_size=len(c2i), code_max_length=400, review_vocab_size=len(w2i), hidden_size=384, num_hidden_layers=4, num_attention_heads=4, intermediate_size=256, word_embed_dim=256, hidden_dim=256, lstm_layers=1) ``` The training process is left unchanged. The code segment in the review body is also removed. We see that there is one sudden jump in the training process. However, with more and more iterations, the loss gets stabilized. The test and validation loss is a little bit higher than the model with Transformer Encoder. ![](https://i.imgur.com/UZbTxmG.png) Here we show some sample generated text. **Test:** ``` Generated: Same here . I UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK Original: Not too UNK of the double UNK UNK here , but it could be UNK to just PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD Generated: Same here . I UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK Original: good UNK : - ) UNK from above : In other UNK , the main purpose of each test : : testing the UNK of HTML file If the UNK test is to test whether the UNK and the UNK of values work properly , should n't they be tested UNK too ? UNK them UNK the test , does n't it ? Also , I think we should UNK this UNK of `` testing UNK and UNK of values '' in comments so that later UNK UNK remove this test thinking it is a UNK . PAD PAD PAD Generated: Same here . I UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK Original: Similarly here . Should in the UNK method UNK ? PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD ``` We get a similar result as the Transformer Decoder model, where the model keeps outputting the same text. ### A5 - Task 3: Transformer as Decoder In the decoder side of task 3, we use LSTM to generate diff hunk. We also built a model to use the Transformer Decoder to generate diff hunk. :::info The model can be found in [`review_to_code_transformer`](https://gitlab.com/deepestcodereview/deepest_code_review/-/blob/master/review_to_code/models/review_to_code_transformer.py). The training process shares the same code in [`code_to_review.ipynb`](https://gitlab.com/deepestcodereview/deepest_code_review/-/blob/master/review_to_code.ipynb). The visualization for the Transformer Decoder can be found on [review_to_code_transformer_vis.ipynb](https://gitlab.com/deepestcodereview/deepest_code_review/-/blob/master/review_to_code/review_to_code_transformer_vis.ipynb). ::: The training process is the same and the losses for all datasets are almost the same. ```python= review_to_code_model = ReviewToCodeModelTransformer( startI=c2i['<START>'], padI=c2i['PAD'], code_characters_size=len(c2i), review_pretrained_weights='bert-base-uncased') ``` ![](https://i.imgur.com/Fuefkaf.png) Here we show the same example where we want to generate diff hunk for review comment "I think this is a good implementation" (`beam_size=4`, `topK=1`). ``` <START>--- StudentProfileAction.java +++ StudentProfileAttributesTest.java @@ -0,0 +1,11 @@ +package teammates.test.cases.ui.pagedata; + + +import java.util.ArayList; +import java.util.ArayList; +import java ``` ![](https://i.imgur.com/aDq54Lp.png) This time, we get a different diff hunk generated. The reviewer from TEAMMATES says that it looks like the model gets confused for this case as the file change from `StudentProfileAction.java` to `StudentProfileAttributesTest.java` does not make sense.