When Deep Neural Net meets Code Review

Online Interactive Post: https://hackmd.io/@DeepestCodeReview/code_review_with_dnn

Code Avaliable on: https://gitlab.com/deepestcodereview/deepest_code_review

Abstract-Peer code review is an essential step in the development of large software. However, the manual process is sometimes repetitive and even tedious. The community has developed various tools to automate the process. Most of them are linter/static analysis tools. They solve parts of the problems but have several limitations. In this project, we intend to employ Deep Neural Networks (DNN) to automate the process. Specifically, we conduct some proof-of-concept works to examine the relationship between the code changes and the review comment. Our initial attempt shows that our models fail to find a useful link from code changes to review comments. However, we are able to generate something interesting reversely from review comments to code changes. Further work includes more investigations on the DNN for review comment prediction. In addition, more sophisticated tuning on our successful models will also be beneficial.

When Deep Neural Net meets Code Review

Introduction

Code review is a critical process in software development to help developers discover issues that may have been overlooked by themselves. Interestingly, a large portion of the review process is often repetitive, with over 75% related to common issues such as documentation, style, and structure ^[1]. Therefore, we see a lot of potential for improvements in code review by automating parts of the process.

Currently, in industry, the most popular method is using linters/static analysis tools. Linters are made up of rules related to coding best practices. Whenever these practices are violated, linters are activated in IDE (Integrated Development Environment). However, linters require a significant amount of manual customization and are often inefficient ^[2]. Most of the reviews still need to be manual.

The case below shows a typical scenario that a linter is able to check.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

The example linter is called CheckStyle^[3], which is mainly used to enforce coding style checks. The error shows a coding style violation where the { symbol in the Java code should start at the previous line. This check in CheckStyle is relatively easy to write with the help of Abstract Syntax Tree (AST). In fact, most linters rely on AST to examine the structure of the code.

However, a linter cannot give the following comment below.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

The reviewer talks about the best practice in Java, which is to use Builder Pattern to replace a monstrous constructor. Note that right now it is not a monstrous constructor but the reviewer foresees the occurrence in the future. To give this comment, one should have

The overall knowledge of a project (e.g., how the class will be used (in the future) so that the Builder Pattern is suitable to be used)

A decent Software Engineering knowledge (e.g., Using Builder Pattern rather than monstrous constructors is a best practice)

Theoretically, as the code is structured, by examining the AST tree, we can know how many arguments are there in the constructor and fire violations of Builder Pattern when they exceed a certain threshold. However, this will result in many false positive/negative as at least in the case shown above, a threshold bigger than two will result in a false negative. In a word, the setting of fixed threshold is not flexible enough.

In brief, we see that linters/static analysis tools cannot produce review comments on code changes that require a deeper understanding of a project. As a result, there are still significant efforts put into the manual review process. Therefore, in this project, we want to know if we can automate the manual review process, specifically by knowing if there is any relationship between the code review comments and the code changes beyond what linters/static analysis tools can check.

The term "review comments" referred from this point will be all manual reviews given by humans (not the automated reviews generated by linter/static analysis tools).

If there is a relationship from code changes to review comments, we could build an automated tool, which will be smarter than linters, to alleviate the manual review process. Even the produced comments might not be 100% correct, the generated message could give contributors some hints for code improvements before asking for manual review.
More realistically, even we are not able to generate the correct review comment due to limited abilities of models or lack of training data, if we could score the code changes, the generated score will at least tell contributors how good his/her changes are.
More interestingly, if there is a reverse relationship from review comments to code changes, we could build a sample code generator for new contributors to get on board to a project. These samples will help new contributors to know what are the good and bad practices specific to the project.

There are not many attempts for the tasks we described above. Researchers in Microsoft had an excellent experiment on building a deep neural network to classify the relevance of code reviews^[2:1]. Also, they did a user study on the classifier to know the effectiveness of their models. Inspired by their work, we decide to continue using the deep neural network (PyTorch) to accomplish the above three mentioned goals.

Code Review from GitHub

Specifically, we decide to examine the relationship between code review and code changes in Java for one specific project. Therefore, our models will be project-dependent.

The project we choose is TEAMMATES and there are several reasons why we choose it.

Java is a very mature and structured programming language which has been developed for decades. There are tons of best practices in Java, and the language is usually used in Software Engineering classes. Hence we expect a significant impact of our model in the language community if we succeed.
TEAMMATES is an open-source project hosted at School of Computing (SoC), National University of Singapore (NUS), to be a model for Software Engineering classes. Therefore, we expect the code review to be accurate and representative. There are around 3K pull requests with an average of 10 comments per pull request. The review comments are not generated by linter/static analysis. All of them are from senior developers.
One of our team members is the team lead and a frequent code reviewer for TEAMMATES. Therefore, we can conduct some user studies to test the effectiveness of our model, same as the Microsoft Paper ^[2:2].

Data Preparation

We use GitHub Comment API to fetch all review comments from TEAMMATES.

The code to retrieve comments can be found in fetch_from_github.py.

There are two fields that we care most about: the diff_hunk field which is the code changes in the format of Git Diff and the body which is code review comment.

--- AccountsLogic.java
+++ AccountsLogic.java
@@ -162,6 +164,16 @@ public InstructorAttributes joinCourseForInstructor(String encryptedKey, String
return instructor;
}

+    private void validateInstructorInstitute (InstructorAttributes instructor, String institute)
+            throws InvalidParametersException {
+        assert instructor != null : "Should have been checked in validateInstructorJoinRequest() method.";
+        AccountAttributes account = getAccount(instructor.email);

There are two problems here:

getAccount accepts googleId rather than email

If instructor does not have account stored in the database (e.g. it is the first time he/she use TEAMMATES), account will be null. In this case, they can still modify the institute name and bypass the check.

Pu Xiao (Code Reviewer)

The above two boxes show a typical diff hunk and its associated review comment.

Because the diff hunk is generated by Git. We can also view it visually on code_diff.html

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Exploratory Data Analysis (EDA)

After running the comment fetching script, we get 22984 pairs of diff hunk and review comments. We filter all comments that are replying to other comments as we do not want to introduce the complicated context understand in our task.

Hence, after data cleaning, we have 10201 records of data and we split them into training (74.8%), validation (13.2%), and test datasets (12%). We plotted the distribution of length of diff_hunk and body fields.

The raw data (in JSON array) from GitHub API can be found in folder data_teammates.

The notebook for EDA can be found in eda_teammates.ipynb.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

The medium of the length of diff_hunk is 697. The quartile is 387. The medium of the length of body (review comment) is 82. Due to these distributions, without losing most of the information, in our later models, we cut off the length of the diff hunk to contains only the first 300/400 characters. Similarly, we also set the length of the review comment to contain the first 100 words maximumly.

Sentiment Analysis on Review Comment

The raw data does not contain a review score for code changes obtained. Therefore, we decide to augment the data to give a rating to the diff hunk. Particularly, we are going to do a sentiment analysis of review comments firstly.

Since the code reviews in the body field (in raw data) are natural languages, we can use pre-trained sentiment analysis tools to produce sentiment classes or sentiment scores. We have found four popular sentiment analysis tools: nltk vader, textblob, NLP flair and sentiCR. The sentiCR is specifically trained on code review domain, which is expected to be very helpful for our scoring purposes. However, sentiCR is trained only with around one thousand of data; we also expect it will not give a very representative score.

Tools nltk vader and textblob will output scores. We make a short comparison of scores output by them below. As shown in the distribution, the two methods give a continuous score range from [-1,1].

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

The two methods usually give different scores on one review. However, they hardly show opposite attitudes.

The analysis and visualization can be found in the sentiment score comparison section of code_to_score.ipynb.

The other two methods, sentiCR and flair, give sentiment labels: [negative, positive]. The result of them do not overlap much. sentiCR labels almost 86% of our review records as positive while flair labels only 44% of our review records as positive.

Task 1: Predict Sentiment Score from Diff Hunk

Our first attempt is to predict sentiment scores from diff hunks. This task is viewed as an intermediate task for Task 2. In addition, we believe that the predicted sentiment score from diff hunks will also be beneficial as it can serve as an indicator for contributors to know how good his/her code changes are.

Regression on Sentiment Score

Here, among the sentiment analysis methods mentioned in the exploratory data analysis, we first choose textblob as our scoring method for regression.

We also tried vadar for regression using similar models, which will be presented in Appendix A2 - Task 1: Other Score Methods. We obtained similar results.

For diff hunk, because the code in diff_hunk is unstructured, we cannot use the Abstract Syntax Tree (AST) to parser it or tokenize it. Therefore, we built a vocabulary library at the character-level for diff hunk. We have constructed a dictionary of most common symbols (100 total) for diff hunk.

We use Transformer Encoder^[4] to extract hidden states from the diff hunk and several affine non-linear layers to project the result to a single score for the regression task. The Figure below shows the overall architecture of the model.

The reason why we choose Transformer Encoder is mainly because of its self-attention on the input. We believe the self-attention mechanism will help us extract useful information from the diff hunk.




code_to_score_model = CodeToScoreModelTransformer(
    code_max_length=300, code_characters_size=len(i2c),
    hidden_size=400, num_hidden_layers=4, num_attention_heads=4, 
    intermediate_size=500)

For the encoder, we have 4 hidden layers with 4 attention heads each. The hidden size is set to 400 and the intermediate size^[5] is set to 500.

As shown in the EDA section, we found that all textblob scores lay on the range between -1 and 1. Therefore, using MSE (Mean Square Loss) might not be appropriate as the loss will be too small. Instead, we use L1 loss.

The model file can be found in code_to_score_transformer.py.

The training notebook can be found in code_to_score.ipynb.

The learning optimizer is set to Adam with a learning rate of 0.0002. We also set weight_decay to 0.2 for regularization. Furthermore, an exponential learning rate decay is used with gamma=0.995. The learning rate will be multiplied by gamma after each epoch to stabilize the loss.

The dash line indicates the epoch. The training visualization in the rest of the report follows the same convention.

The above Figure shows the training, validation, and test loss for 15 epoch with 64 batch size. Overall, there is no obvious overfitting in the case. In addition, we get around 0.1 L1 error on the prediction, which means the predicted sentiment score is within 0.1 from the ground truth.

The loss does not look normal for a typical neural network training. Typically, it is because of a mediocre learning rate. However, we have tried other learning rates, such as 0.01 and 0.001, and found that they end with similar results. Therefore, we believe this is the best effort we can make on the model by tuning the learning scheme.

To know what the model is outputting, we plot the distribution of predicted scores and the ground truths.

We see that our model "learns" to predict some values around zero for all inputs, which explains why the loss gets fixed after several steps. The phenomenon might also indicates that our model finds there is no relationship between the diff hunk and the sentiment score. Hence, it decides to output the mode value 0 to minimize the L1 loss. However, it is too early to conclude this.

We also find the same behavior for test data.

Nevertheless, despite the poor performance, one interesting thing we can do is to know what attention the encoder is paying to. We have built an interactive attention visualizer.

Sadly, there is no distinct attention pays to different characters, which is somehow expected as our model does not perform well on the prediction.

We have tried to use a pre-trained encoder for diff hunk. The Encoder has a more interesting attention pattern, which has been documented in Appendix A3 - CodeBert.

The code to generate the attention visualization can be found on code_to_score_viz.ipynb.

We guess the poor performance might due to the uneven distribution of scores. We might need to add weighted L1 loss.

Due to time constraints, we do not have time to add weighted L1 loss to our model and decide to left it for future work. However, we manage to add weighted cross-entropy loss in the classification task below.

Classification on Sentiment Class

We also tried classification on the sentiment classes. We would expect the result to be better than the regression task as we know that Deep Neural Network (DNN) is really good at classification problems.

Here, we choose flair method, which has almost even positive (44%) and negative (56%) class distribution. The model remains almost the same, as shown in the regression task. However, we change the final affine layer to output probability of two classes 0 and 1. We also change the loss to be weighted cross-entropy loss.

We also tried sentiCR scoring method, which will be documented in Appendix A2 - Task 1: Other Score Methods. The classes distribution in senticr is uneven (86% v.s. 14%). Even with the help of weighted loss, the model still performs poorly.






code_to_score_model = CodeToScoreModelTransformer(
    is_classification=True, class_weight=[1, 1.3],
    code_max_length=300, code_characters_size=len(i2c),
    hidden_size=400, num_hidden_layers=4, num_attention_heads=4, 
    intermediate_size=500)
code_to_score_model = code_to_score_model.to(code_to_score_model.device)

The class_weight is there to prevent the model from keep predicting negative class (the majority). Generally, we want the positive class to have more weights on the loss so that the model is force to get more positive classes right. We have tuned the weights ([1, 1.3]) to avoid the model becomes a trivial predictor.

The hyperparameters remain the same. The optimizer is set to Adam with a learning rate of 0.0002. The weight_decay is set to 0.2. The learning rate decay remains as gamma=0.995.

The training code for the classification problem shares the same model (code_to_score_transformer.py) and same notebook (code_to_score.ipynb) as the regression problem.

The result is relatively good than the regression problem. However, we see a lot of fluctuations in the accuracy plot. The test accuracy is the same as the baseline model. The baseline model can be a dummy predictor which always gives a positive label. It should have accuracy around 56% (the fraction of the positive classes).

The ROC (Receiver Operating Characteristic) curve also shows the overfitting of the model, which does not reflect in the loss graph. The AUC (Area Under the Curve) is shown in the legend.

However, from the distribution of the predicted scores, we can see, at least with weighted cross-entropy loss, the model does not always choose to predict the same class (unlike the regression problem).

We also constructed a sample attention visualization for this task. In most of cases, it faces the same situation as the regression task: the attention to every character is the same.

However, there are also some interesting findings.

As shown in this visualization, some attentions have been paid to the whitespaces in the diff hunk.

In brief, we make some attempts to make the predictions work but fail to tune it towards the best. Specifically, we find it is likely to get the model to work for classification problems. We believe it is too early to conclude there is no link from diff hunk to the sentiment scores and leave further explorations for future work.

Task 2: Predict Review Comment from Diff Hunk

In task 1, we observe that we do not get a good performance in prediction from code diff to sentiment scores.

Nevertheless, in this task, we still decide built on top of task 1 and go one step further to predict the review comment itself. Note that in task 1, we rely on other tools for sentiment analysis, which might contain some errors and confuse our models.

We can view the problem as a Natural Language Processing (NLP) translation task. Essentially, we want to “translate” the diff hunk to a review comment. Therefore, we choose to apply the state-of-the-art Transformer Encoder and Decoder models^[4:1].

The model is essentially the same as illustrated in the original "Attention Is All You Need" paper ^[6] (the architecture is shown below), with differences in layer numbers and sizes. We have, for the inputs, the diff-hunks. For the outputs, we have review comments.

Source: Model Illustration from "Attention Is All You Need" paper.

We have also tried to use LSTM (Long Short-Term Memory) as decoder, which will be documented in Appendix A4 - Task 2: LSTM as Decoder. We choose to present the Transformer Encoder-Decoder model here because of its relatively good performance.

Similar to Task 1, we have constructed a dictionary of symbols (100 total) such that we can embed our diff hunk when we feed them into the encoder. Besides, we also built our own vocabulary of words (the most common 400 words from all training dataset with the help of nltk.word_tokenize), so that we can embed the review texts.

An example of encoded text is shown below. As shown, it has some UNK as we have only 400 words in vocab, but it is a compromise for efficiency as a larger vocab will be infeasible to train on a single GPU.

Original:
I wonder if it will be better if we just returned <code>getInstructorFeedbackEditCopyActionLink</code>, instead of making a new field?\n

After Encoding:
I UNK if it will be better if we just returned < code > UNK < /code > , instead of making a new field ? PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD

The model can be found in code_to_review_transformer.py.

Both the tokenization and the training notebook can be found in code_to_review.ipynb.






code_to_review_model = CodeToReviewModelTransformer(startI_w=startI_w, padI_w=padI_w, 
                                                    code_characters_size=len(c2i), review_vocab_size=len(w2i), 
                                                    encoder_hidden_size=384, encoder_num_layers=4, 
                                                    encoder_num_attention_heads=4, encoder_intermediate_size=256,
                                                    decoder_hidden_size=384, decoder_num_layers=4, 
                                                    decoder_num_attention_heads=4, decoder_intermediate_size=256)

We set the hidden_size, num_layers, and attention_heads for the encoder and decoder as shown in the code snippet above. The learning optimizer is set to Adam with a learning rate of 0.0002. We also set the weight_decay to 0.0020 to avoid overfitting (We have tried that lower weight_decay will result in significant overfitting). Similar to the settings before, we also provided learning decay with gamma=0.995.

The Figure below shows the loss for the training, validation, and test datasets, we managed to reduce the cross-entropy loss on the probability of words quite a bit.

From the loss graph above, we can see that loss gets stabilized after 30 epoch with minibatch=64. Thanks to the regularisation, we do not get obvious overfitting for the training data. We have the test loss, as long as the validation loss, to be around 3.5.

Below, we generate several texts from the training and test set using beam search (documented in Appendix A1 - Beam Search). The beam_size is set to 4 and we only select the first in the topK candidates. The length to generate is set to 30.

Train:

Generated:

<START> UNK UNK ( UNK ) UNK ( UNK ( UNK ( ) ) ) ) ) UNK ( UNK ( UNK ( UNK ( UNK ( ) ) ) )


Original:

Is there any reason this is not sanitized ? PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD


Generated:

<START> UNK UNK ( UNK ) UNK ( UNK ( UNK ( ) ) ) ) ) UNK ( UNK ( UNK ( UNK ( UNK ( ) ) ) )


Original:

Revert these as well . PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD


Generated:

<START> UNK UNK ( UNK ) UNK ( UNK ) UNK ( UNK ( UNK ( UNK ) ) ) ) ) UNK ( UNK ( UNK ( ) ) )


Original:

Remove lines UNK instead and let this import be . You 'll also need to add UNK . to all the method calls in this file . PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD

Test:

Generated:

<START> UNK UNK ( ) UNK ( ) ) UNK ( UNK ( UNK ( ) ) ) ; UNK ( UNK ( UNK ( UNK ( UNK ) ) )


Original:

Can rename this class UNK - > UNK now that it has multiple uses ? PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD


Generated:

<START> UNK UNK ( ) UNK ( UNK ( UNK ) ) ) UNK ( ) ) UNK ( UNK ( UNK ( UNK ( UNK ) ) ) ) )


Original:

I do n't think we should UNK validation checks just because we UNK it in one place . PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD


Generated:

<START> UNK UNK ( UNK ) UNK ( UNK ) UNK ( UNK ( UNK ( UNK ) ) ) ) ) UNK ( UNK ( UNK ( ) ) )


Original:

Now this method has no difference from UNK ( ... ) . You can remove the file parameter and remove the three UNK tests in the method body , and name the method like how you did previously . Add : while you 're at it , maybe add a comment on why the tests are UNK . PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD

Sadly, we are facing the same issue as task 1. It looks like the model repeatedly generates the same text (with little variation), regardless of training or test datasets. We somehow expect the result as we know the task is inherently tricky based on the experience in task 1. Again, it might also be possible that there is no relationship between the diff hunk and the review comment under our simple Transformer Encoder-Decoder model.

Nevertheless, we also apply the same visualization technique to know the masked self-attention in the decoder layer when generating the repeated text.

The attention visualization on the encoder side faces the same issue as task 1 where the attention to every character is almost the same.

The code to generate the attention on the encoder side can be found in code_encoder_viz.ipynb.

The code to generate the attention on the decoder side can be found in code_to_review_transformer_viz.ipynb.

From the above visualization, we first notice that indeed for the generation task, the encoder has masked self-attention (only attention to the word that has been generated). Also, we find different attention pattern in different heads.

We found that there are many UNK symbols. We suspect the code segment in the review comment result in many UNK symbols and thus result in predicting a lot of UNK. Therefore, we remove all the code segments in the training data and train another model with the same hyperparameters.

Sample Code Segment in the Review Comment:

The loss remains the same. However, we get different generated text this time. We can see the general sentence structure in the generated text.

Test:

Generated:

<START> I 'm UNK the UNK of the UNK of UNK UNK UNK UNK , but I think you can UNK the UNK of the UNK of the UNK of the


Original:

UNK add one more case for ? PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD


Generated:

<START> This can be UNK to UNK the UNK . UNK the UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK


Original:

UNK ( ) sounds good . Can change to that PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD


Generated:

<START> This can be UNK to UNK the UNK . UNK the UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK


Original:

session 's UNK PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD

We can now view a more meaningful visualization below.

However, our model for task 2 also suffers from the problem where the model always are predicting the same thing. We have manually verified several generations.

Task 3: Predict Diff Hunk from Review Comment

Our group does not get a good performance on the models designed for task 1 and 2. Our last attempt is to predict the diff hunk from the review comment itself.

The task is very similar to task 2. The main difference is that we swap the prediction order. We also decide to use a pre-trained model on the review comments (natural languages), instead of using a plain Transformer Encoder-Decoder model.

The encoder becomes the BERT model for NLP problem. For the decoder side, we decide to use LSTM (Long Short-Term Memory) to generate diff hunk.

We also tried to use the Transformer Decoder for the decoding side. The detail is shown in Appendix A5 - Task 3: Transformer as Decoder.

Here we present the architecture of using LSTM as a generator for diff hunk.

We use a pre-trained BERT model built by HuggingFace. The weights of the BERT are fixed and we merely use BERT as an encoder for the initial hidden state for the LSTM. We do not fine-tune BERT for three reasons.

With batch size more than 32, the BERT models are too big to fit in the GPU instance we rent on Google Cloud Platform.
Our dataset (review comments) (7629) is not big enough to fine-tune the BERT model.
We believe that the BERT model is sophisticated enough to get the hidden information from the review comment.

review_to_code_model = ReviewToCodeModelLSTM(
          startI=c2i['<START>'], code_characters_size=len(c2i),
          feature_dim=768 * review_max_length,
          hidden_dim=2048, review_pretrained_weights='bert-base-uncased',
          character_vec_dim=256)

In the LSTM, we use a hidden dimension of 2048 and character embedding of dimension 256. We also implement the beam search to generate diff hunk, as shown in the model architecture. The loss at the LSTM is the cross-entropy loss where we predict the likelihood of each character in the diff hunk.

The model is written in review_to_code_lstm.py.

The training code is written in review_to_code.ipynb.

For training process, we use Adam optimizer with learning rate 0.0002 and weight_decay of 0.0003. We also set an exponetional learning rate decay for each epoch (gamma=0.995).

The hyperparameters have been fine-tuned to avoid overfitting and achieve stabilized loss.

One important thing we have done here is that we also remove the code segment (anything surrounded by `) in the training data (review comment). As we have done so for task 2 and get better text generation, we believe that this data cleaning will also help the model understand the semantic information in the review comment.

The above Figure shows the loss for training, validation, and test with 25 epoch and minibatch of 64. It looks like we get the right number of epoch and the model also does not overfit the data.

The loss itself is not interesting as it does not tell whether the model is good or not. Therefore, we use a test example below to shows what the model is doing and how good the model is.

Suppose a new contributor wants to know we kind of changes a reviewer in TEAMMATES will say, "I think this is a good implementation".

In the generation section of the training notebook (review_to_code.ipynb), there are more generation examples. We just show one of them here.

Firstly, the sentence will be tokenized by BERT with appropriate attentions to different words. In the visualization below, we see that the state-of-the-art BERT model indeed pay attentions (self-attentions) to different essential words. For example, it knows that this is usually followed by is.

The attention visualization can be found in review_body_bert_viz.

After that, the hidden states from BERT will be projected to the hidden states in the LSTM. We feed the <START> token to the LSTM and let it generates the diff hunk sequentially with beam search. We use beam=4 with max_length=400 to generate diff hunk.

In Appendix A1 - Beam Search, we will discuss the effectiveness of beam search.

The below is the diff hunk generated; we can see at least the model is very effective to remember to the structure of the diff hunk.

--- InstructorFeedbackEditPageDataTest.java
+++ InstructorFeedbackEditPageDataTest.java
@@ -0,0 +1,135 @@
+package teammates.test.cases.ui.pagedata;
+
+import static org.testng.AssertJUnit.assertEquals;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Comparator;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+im

We can use the same Diff Hunk Visualizer to view it.

Now the contributor knows that the changes made to InstructorFeedbackEditPageDataTest.java are generally good. By viewing the diff hunk, we can see that it is likely that, when the file gets added to the project, the reviewer gives positive comments.

We also confirm with the reviewer from TEAMMATES. He says that a senior developer introduces the page changes with a good code quality.

We can also use the same "attention" visualizer to view the change. In particular, the attention for LSTM becomes the connectivity^[7].

connectivity (t, \tilde{t}) = {‖ \frac{\partial {(h_{L}^{\tilde{t}})}_{k}}{\partial x^{t}} ‖}_{2}

The connectivity proposed by Madsen^[7:1] essentially wants to know the L2-norm of the gradient of the previous character at

t (t < \tilde{t})

when the LSTM generates character at

\tilde{t}

. In other words, we want to know what are the positions that will be important to generate the current character.

Using the same "attention" (connectivity) visualization tool, we can view the connectivity.

Note that we reuse the User Interface (UI) from the previous attention visualization in the interactive demo. There is no HEAD concept in the LSTM connectivity.

The code to generate the connectivity can be found on review_to_code_lstm_viz.ipynb

We can see that characters such as the ; and the + in the previous lines are essential to generate the current character + (highlighted in orange). The example also shows that LSTM has some troubles in remembering long term dependencies.

The above paragraphs only give one generation example. However, an important thing to note here is that we do not suffer from the problem in task 1 and task 2 where the model always predicts the same thing. For example, the statement "this is bad practice" will give the following diff hunk. We have verified several generations.

--- InstructorFeedbackAbstractAction.java
+++ InstructorFeedbackAbstractAction.java
@@ -0,0 +1,158 @@
+package teammates.ui.controller;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import

In a word, we believe our attempt in task 3 is a good proof-of-concept work to build a sample code generator.

Conclusion and Future Work

In conclusion, we found that it is hard to predict the sentiment score or the review body from the diff hunk. Several experiments fail to achieve good performance. However, we hesitate to conclude that there is no relationship from diff hunk to review comments. Our tuning attempts in sentiment classification task show a relatively good result, but due to time constraints, we do not have time to further explore them.

However, we find that, from review comments to diff hunk, there are promising results. Our attempts show that not only the model learns to generate valid diff hunk in Git Format but also is able to figure out what diff hunk will result in good/bad review comments. Our group believes that, with proper further tuning and training with more data, we could build a sample code generator for production.

There are several improvements we can make to our project. For task 1 and 2, our group believes that there are several directions to optimize them:

The attribute diff hunk might not be enough to help us predict the review comment. We could manually augment the data by giving more context, such as the whole changed files. In addition, in case of unbalanced training data, one should employ weighted loss to prevent the model from picking the easiest strategy and outputs the same thing every time.
The training data is not enough to train the Transformer Model. Therefore, we need more training data. In fact, we have crawled the review comments for elasticsearch (another open-source Java project), which has seven times training data. However, due to time constraints, we do not have time to train them.
We could further improve the model design and hyperparameter tuning for our models. For example, there should be better tokenization for diff hunk. One could pre-train a model like (CodeBert) on the target repository (e.g., TEAMMATES) and fine-tune it.

In addition, we also think there is room for improvement for task 3. Although the file name itself (the +++ and --- chunk) somehow gives interesting information according to the reviewer from TEAMMATES. Most of generations do not give meaningful information as they just provide a bunch of changes for import packages. We would expect something like method changes or statement changes that could provide more information. It might be due to our training data that contains a lot of import chunk. Therefore, future work could emphasize on cleaning the data or obtaining more data related to method changes or statement changes.

Team Contributions

Pu Xiao (Taking the class as Graded) (55% of total work)
- Write 80% of the report
- Build attention visualization
- Implement beam search
- Write models for task 3
- Integrate CodeBert to task 1 and 2
- Wrap up all models for task 1, 2 and 3
- Do final hyperparameter tuning on task 1, 2 and 3
Joey Lou (Taking the class as P/NP) (20% of total work)
- Write 10% of the report
- Do initial analysis on CodeBert
- Do initial EDA on dataset
- Write initial models for task 2
Ao Liang (Taking the class as P/NP) (25% of total work)
- Write 10% of the report
- Conduct sentiment analysis
- Write initial models for task 1

Appendix

A1 - Beam Search

We believe that beam search is essential in our sequence generation, and therefore, we implement beam search for both LSTM and Transformer.

The Figure illustrates the effectiveness of the beam search for LSTM. Instead of being greedy to choose the node with the purpose box, we could expand the search tree and try to find the global maximal total probability. In the case shown above, the path highlighted in red gives the biggest total probability. Note that, in the actual implementation (lstm_beam_search.py), we use log probability for numerical stability.

The implementation of beam search (transformer_beam_search.py) for the Transformer Decoder is a bit different due to different generation processes between LSTM and Transformer Decoder. In Transformer Decoder, we feed a sequence with <pad> token first, and then replace the token one by one with the character with the maximal probability. Again, instead of being greedy here, we can use the total probability (all probabilities time together) and use a priority queue to pop out the one with the biggest total probability.

Here we show a concrete example to illustrate the effectiveness of the LSTM beam search. For task 3, if we want to generate diff hunk for sentences: "can you change this", "this is bad practice" and "test convention".

With beam search (beam=4, topK=1), the LSTM will generate:

--- InstructorCourseStudentDetailsPageDataTest.java
+++ InstructorCourseStudentDetailsPageDataTest.java
@@ -0,0 +1,19 @@
+package teammates.test.cases.ui.pagedata;
+
+import static org.testng.AssertJU


--- InstructorFeedbackAbstractAction.java
+++ InstructorFeedbackAbstractAction.java
@@ -0,0 +1,158 @@
+package teammates.ui.controller;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import


--- InstructorFeedbackAbstractAction.java
+++ InstructorFeedbackAbstractAction.java
@@ -0,0 +1,165 @@
+package teammates.ui.controller;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import

Without beam search, the LSTM will generate:

--- InstructorFeedbackResultsPageData.java
+++ InstructorFeedbackResultsPageData.java
@@ -25,69 +55,139 @@
     public String groupByTeam = null;
     public String showStats = null;
     public int s


--- InstructorFeedbackResultsPageData.java
+++ InstructorFeedbackResultsPageData.java
@@ -25,69 +55,139 @@
     public String groupByTeam = null;
     public String showStats = null;
     public int s


--- InstructorFeedbackResultsPageData.java
+++ InstructorFeedbackResultsPageData.java
@@ -25,69 +55,139 @@
     public String groupByTeam = null;
     public String showStats = null;
     public int s

Note that the above three diff hunks (without beam search) are the same! Therefore, being greedy in the LSTM generation is not a good thing.

Note that in our sequence generation, for simplicity, we do not introduce the <END> token. Therefore, the beam search will only terminate after the maximal length has reached. We believe that generating dummy characters in order to meet the maximal length requirement might hinder the total probability. We should introduce <END> token for future work.

A2 - Task 1: Other Score Methods

We choose textblob specifically in task 1. In fact, we also tried other scoring methods and got similar results as textblob.

Performance /Method	`textblob`	`vader`
Loss
Output Range

Performance /Method	`flair`	`sentiCR`
Loss
Accuracy
Output Range

The model is the same and can be found in code_to_score_transformer.py.

The training notebook can be found in code_to_score.ipynb (To use other methods, one needs to set the score_method variable properly).

A3 - CodeBert

In this report, we mention that due to the unstructured diff hunk, we cannot use code parse to tokenize the code itself, and therefore we use character level encoding. In fact, beside the character level encoding, we also tried another pre-trained BERT-like Transformer Encoder on the diff hunk.

The encoder is called CodeBert from HuggingFace. It has been trained on the CodeSearchNet dataset from GitHub. We have tried to apply the model for task 1 and task 2. However, we do not get an improved result.

An important note on the pre-trained model is that the model was trained on multiple programming languages (go, java, javascript, php, python and ruby) on various projects. Therefore, it is very hard to know what the hidden states generated by the model will be. There is one trivial success application using CodeBert to predict programming languages from a given code segment.

We believe that our task 1 and 2 are much complicated than that. Therefore, the hidden states extracted from our diff hunk by CodeBert might not be helpful. Indeed, we see almost the same performance.

Here, we show an example of applying CodeBert to task 1. The training notebook can be found in code_to_score_code_bert.ipynb.

We also tried CodeBert on task 2. The training notebook can be found in code_to_review_code_bert.ipynb.

	Task 1	Task 1 (with `CodeBert`)
Loss
Output Range

In the comparison above, we observe the two models have the same performance.

Here we show the attention visualization to examine the self-attention CodeBert pays to. We see a lot of strange symbols in the tokenization. It is hard to interpret the meaning of each token and the strange symbols.

The visualization for CodeBert can be found in code_bert_viz.ipynb.

If we think CodeBert will not produce the hidden states we want, we also try to fine-tune it. However, this is not a successful attempt as shown in the loss Figure and output comparison below.

The result partially proves that why we do not tune the BERT model in Task 3. We believe that our dataset is too small to tune CodeBert model with 84M parameters.

A4 - Task 2: LSTM as Decoder

In the decoder side of task 2, we use the Transformer Decoder to generate text. We also built a model to use LSTM to generate text.

The model can be found in code_to_review_lstm.

The training process shares the same code in code_to_review.ipynb.

The visualization for the LSTM Decoder can be found in code_to_review_transformer_viz.ipynb.






code_to_review_model = CodeToReviewModelLSTM(startI_w=startI_w, code_characters_size=len(c2i), 
                                             code_max_length=400, 
                                             review_vocab_size=len(w2i),
                                             hidden_size=384, num_hidden_layers=4,
                                             num_attention_heads=4, intermediate_size=256,
                                             word_embed_dim=256, hidden_dim=256, lstm_layers=1)

The training process is left unchanged. The code segment in the review body is also removed. We see that there is one sudden jump in the training process. However, with more and more iterations, the loss gets stabilized. The test and validation loss is a little bit higher than the model with Transformer Encoder.

Here we show some sample generated text.

Test:

Generated:

Same here . I UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK


Original:

Not too UNK of the double UNK UNK here , but it could be UNK to just PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD


Generated:

Same here . I UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK


Original:

good UNK : - ) UNK from above : In other UNK , the main purpose of each test : : testing the UNK of HTML file If the UNK test is to test whether the UNK and the UNK of values work properly , should n't they be tested UNK too ? UNK them UNK the test , does n't it ? Also , I think we should UNK this UNK of `` testing UNK and UNK of values '' in comments so that later UNK UNK remove this test thinking it is a UNK . PAD PAD PAD


Generated:

Same here . I UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK


Original:

Similarly here . Should in the UNK method UNK ? PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD

We get a similar result as the Transformer Decoder model, where the model keeps outputting the same text.

A5 - Task 3: Transformer as Decoder

In the decoder side of task 3, we use LSTM to generate diff hunk. We also built a model to use the Transformer Decoder to generate diff hunk.

The model can be found in review_to_code_transformer.

The training process shares the same code in code_to_review.ipynb.

The visualization for the Transformer Decoder can be found on review_to_code_transformer_vis.ipynb.

The training process is the same and the losses for all datasets are almost the same.




review_to_code_model = ReviewToCodeModelTransformer(
    startI=c2i['<START>'], padI=c2i['PAD'],
    code_characters_size=len(c2i),
    review_pretrained_weights='bert-base-uncased')

Here we show the same example where we want to generate diff hunk for review comment "I think this is a good implementation" (beam_size=4, topK=1).

<START>--- StudentProfileAction.java
+++ StudentProfileAttributesTest.java
@@ -0,0 +1,11 @@
+package teammates.test.cases.ui.pagedata;
+
+
+import java.util.ArayList;
+import java.util.ArayList;
+import java

This time, we get a different diff hunk generated. The reviewer from TEAMMATES says that it looks like the model gets confused for this case as the file change from StudentProfileAction.java to StudentProfileAttributesTest.java does not make sense.

Mäntylä, Mika V., and Casper Lassenius. "What types of defects are really discovered in code reviews?." IEEE Transactions on Software Engineering 35, no. 3 (2008): 430-448. ↩︎
Gupta, Anshul, and Neel Sundaresan. "Intelligent code reviews using deep learning." (2018). ↩︎ ↩︎ ↩︎
CheckStyle is a static analysis tool to enforce coding style check. ↩︎
The Transformer (Encoder/Decoder) are PyTorch libraries from HuggingFace. ↩︎ ↩︎
intermediate_size is the dimensionality of the feed-forward layer in the Transformer encoder. ↩︎
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." In Advances in neural information processing systems, pp. 5998-6008. 2017. ↩︎
Madsen, Andreas. "Visualizing memorization in RNNs." Distill 4, no. 3 (2019): e16. https://distill.pub/2019/memorization-in-rnns/ ↩︎ ↩︎