# DSSE (Fall 2021) Final Report of Team 9 > Authors[^1]: Zixuan Chen, Xuehan Cheng, Wei Luo, Zhenwei Zhang, Yuxin Xiao [^1]: Team members are arranged according to the order of appearance in the Final Presentation. # Introduction In this project, we use the `pydriller` as our data mining tool, to collect the commit information from 4 repositories (for example, Gson, and Vue) and analyze the possible pattern of modifications and code duplications. We use `pydriller` to collect the raw data, and write our own parsers and classes to extract necessary information from the repository. We apply `Moss` to get the code duplication, and organize them. After running Polynomial Regression, we get some conclusions on the relationship with hunk size and possible code duplications. This is a self-driven project without any paper guides, so during this project, we have made some highlights, and we also gets some difficulties. We change our strategies back and forth, but we also overcome a lot of difficulties. We enjoy this project a lot, especially when discussing with teammates and trying different methods and finally working something out. In this report, we will first show you the background information, and then our proposed approach. Based on that approach, we will show you our results and analysis. Finally, we will have a quick summary and talk about future improvements and lessons learned. # Background Nowadays, many companies use Agile development, which requires relatively fast iterations. Programmers will have a lot of action items and need to develop faster. As the project becomes complex and the number of developers grows, project management becomes hard. It’s a challenge to catalog, archive, and retrieve reusable resources due to some organization, and administration issues. Therefore, it is difficult to share reusable code among teammates. Under these situations, some programmers tend to copy and paste because it saves their time. However, copy and paste is not a good habit and sometimes it might even be harmful, such as: * Lengthy and bulky * Shotgun Surgery * Decrease code quality * Increase security risk # Traditional Approach In traditional, people try to find code duplication through code review, which actually is based on their experiences and memory. Obviously, only depending on experiences and memory is not accurate and reliable since it's impossible that everyone is familiar with each detail of the project. In reality, people might be responsible for different parts of projects and developers may even change due to resignation or job-hopping. For this reason, some teams might use tools to detect duplication rather than the naked eye. However, for those small teams or companies, it’s too costly to use tools. Since most tools should be purchased for commercial use, they have to pay for them. Moreover, duplication detection tools might usually contain some time-consuming processes, such as preprocessing code, building an abstract syntax tree, etc., which leads to a longer time to get results, especially for large projects. Therefore, they not only should purchase tools, but also take time to wait for the detection results whenever they commit code. # Software Engineering Problem Base on this background, we want to solve this problem: find a common pattern for commits that contain high code duplication. Once we find the pattern, people can use tools to detect duplication only when they encounter such pattern, which is more efficient and saves more resources. After some discussions, we think hunk is a simple and efficient attribute. We define hunk as a piece of continuous adding code in a commit and hunk size as the line of code (LOC) in a hunk. Therefore, in our project, we will focus on finding the relationship between duplicate code and hunk size of the adding code in a specific commit. ![](https://i.imgur.com/2LYlH1F.png) # Procedure Out project development process mainly involves 4 steps: 1. The first step is Data mining, where we get the commit information in the repositories. 2. The second is filtering and processing the information in the commit to get the data for duplicate detecting. 3. The third step is to use code duplication detection tools to get the duplication information . 4. The last step is to analyze. We list detials below. ## Data Mining Tool: PyDriller - a python framework, to analyze the git repositories https://pydriller.readthedocs.io/en/latest/index.html - APIs: We mainly used two APIs in pydriller to collect the data - **Git**: Git is a wrapper for the most common utilities of Git, such as checkout and reset. `Git.files` provides the list of files present in the repo at the current commit, which can help us collect all the code in the repository - **Commit**: A Commit object has all the information of a Git commit, and much more, which help us get all the modified files and code under the current commit - Mining result: - Collected the raw data of all the fils in the repository under each commit - Collected all the Commits objects in the repository ## Data Processing ### Goal Our object for processing the data is to get the hunk information and all the other code in the repository and compare these two parts to check if duplication exists. ![](https://i.imgur.com/9HmEJsp.png) ### Two steps - Get Hunk data - To get the hunk information, we will traverse all the commits, and for each commit, we will check the modified files under it. Each modified file has an attribute called diff_parsed. In the `diff_parsed` dictionary, There are added and deleted code with the line number in the file. We can utilize this information to collect the hunk data. - The structure of `diff_parsed` dictionary is as follow: ```json { "added": [ (54, 'procedure Count_Evens(Data : in Integer_Array) is'), (55, ' type My_Int is range 0 .. 100;'), (56, ' Count : My_Int := 0;'), (59, ' pragma Loop_Invariant ((Count + 1) in My_Int'First .. My_Int'Last') ], "deleted": [ (53, 'return Count;') ] } ``` - This `diff_parsed` dictionary we got through data mining is corresponding to the highlightened modified code in a commit on Github. We extract the hunk from this dictionary by finding the blocks with consecutive row numbers. Each bloack represents a hunk. For example, line number 54 to 56 is a hunk. Once we find a hunk, we store all the information in a hunk object. ```python class Hunk: def __init__(self, commit, file_path, source_code, size, start_line, end_line, other_code_in_current_file): ``` - Get All the other code excpet Hunk - Collect all the other code in the repo for comparing with hunk code and checking the duplication rate. This includes three main steps - Use Git API to get a Git object representing the repository. - Checkout the repository by the current hunk commit, and get all the files in the repository. - Merge all the code in the files except the file contains current hunk. - Merge the code collected in the above step with the `other_code_in_current_file` in the hunk ### Result Get a file with hunk source code and a file with all the other code in the repo for comparison. ## Code Duplication Detection We stored the *Hunk Code* as a file and *Other Code* as another file. We can then used the code duplication detection tool to get the duplicaiton rate and the duplicated code. ![](https://i.imgur.com/9WyCiIA.png) There are differnt types of duplication detection tools, with pros and cons for each, as listed below: - Duplicateion Detection Tools - **Text** (Character by Character) - Quick - Low accuracy - **NLP** (Word2Vec, LCS (Longest Common Subsequence), SWLR (Shingling with Location Relationship) - Efficient in paper plaiarism detection - Cannot reflect the code duplication: It may not be semantic duplication - **AST** (Construct the Abstract Syntax Tree based on the code structure) - Takes locng time to analyze - Better reflect the code duplication for code - More commonly used in the industry duplication Based on that, we decide to use Moss, which is based on AST and can better reflect the code duplication in industry. - MOSS (**M**easure **O**f **S**oftware **S**imilarity) - based on the *Abstract Syntax Tree* ![](https://i.imgur.com/PdKZnOx.png) - MOSS Procedure **Definition of duplication rate**: Proportion of duplicate code in this hunk size of code. - First, we run Moss on the two files we prepared before. MOSS will then render us with an HTML file with the duplication rate for file 1 and file 2. - Then we write an python parser to parse the MOSS's HTML file and the parpser will return a python list which contains the duplication rate for file 1 and file 2 as list's attribute. ![](https://i.imgur.com/Ep1OhB9.png) # Analysis With the procedures above, we are now starting our data mining and analyzing. We choose Gson because it is a popular and well maintained repository, with more than 1600 commits over last 13 years. And we get the following result: ![](https://i.imgur.com/DV0uKic.png) We can see from the figure that: + The percentage of commits containing duplicate code is very small for the small hunk size. + But for the large hunk size, nearly 100% of the commits contain duplicate code (totally in blue). Unfortunately, for hunk size greater than 150, we have a very different result: almost none of these commits contain duplication. ![](https://i.imgur.com/cUdbpUw.png) Recall what we learned from Assignment 1, we realize that the dataset may have bias, so we decide to do data mining on more repositories, and to draw plots to check the data distribution. To minimize the bias from the data, we selected a variety of repositories by the following standards. - Matured Repositories: - Established more than 1 year - More than 1000 commits - More than 50 contributors - Number of stars is over 10k - Unmatured Repositories: - Ongoing projects - Less than 50 contributors - Established within 1 year Now we have 4 repositories with 4000 commits, and 5710 samples in total: | Repo | Initial Commit Date | Latest Commit Date | Year Span | Total Commit Counts | Number of Contributors | Number of Stars | |:----------:|:-------------------:|:------------------:|:---------:|:-------------------:|:----------------------:|:---------------:| | Vue | Apr 10, 2016 | Nov 22, 2021 | 5 years | 3215 | 404 | 191K | | Gson | Aug 31, 2008 | Nov 28, 2021 | 13 years | 1605 | 121 | 20.3K | | React | May 29, 2013 | Nov 30, 2021 | 8 years | 14661 | 1529 | 178K | | React-Vant | July 18, 2021 | Nov 25, 2021 | 5 month | 1031 | 14 | 251 | We find a lot of them are less than 100 lines. ![](https://i.imgur.com/9aUeStL.png) Based on the domain knowledge that human beings are less likely to write hundreds or thousands of code in one commit, we take the large hunks as possible auto-generated ones, so we decide to care about only 90% of the samples. We find a very interesting result that more than 90% of the commits are having hunks within 70 lines. ![](https://i.imgur.com/BXmlSJf.png) # Conclusion ![](https://i.imgur.com/G8dffzJ.png) We could draw the line chart based on these 90% data, which is shown as below. The blue line represents the result of those data. The horizontal axis is the hunk size and the vertical axis is the posibility of duplication. Therefore, we could draw the polynominal regression on these blue lines, which is represented by the yellow line. We could see from these diagrams: - k = 1, the yellow line is underfitting - k = 6, the yellow line is overfitting. In these line charts, we just draw the polynominal regression from k = 1 to k = 6. The company in sofware engineering field could choose one yellow line as their standard according to the diagrams we provided. In this report, we choose the polynominal regression when k = 3. ![](https://i.imgur.com/1bCz65w.png) As we mentioned before, out purpose are: - Find a size of insertion code. - When the hunk size is over this size, engineers could run other tools to check duplicate code. Therefore, according to the Polynomial Regression shown as above, we could obtain the relationship between hunk size and duplicate rate.(the trend shown as yellow line). For example, if a company want to fix the code which has duplication code over 50%, they could draw a horizontal line at 0.5 and find two intersaction points with the yellow line and then find the hunk size range that possibility is over 0.5. As a result, company could run tools within this range of hunk size to check duplication code. Our result could help them run the tools in a high accuracy instead of costing too much money and time. And fianlly, we will quickly talk about the future improvements and lessons learned. # Challenges - Due to the modern development method, large hunks of code are uncommon, so we are dealing with possible biased data. - There is no essay and detail for us to follow. We find the probel in SE field and figure out the solution all by ourseleves. This is a self-driven project. - Consider many corner cases in these commits. Like the second commits may contain the insertion code of first commit and we need to fix it. - Choose a better check duplication code tool for our project. # Lessons Learned - Planed better before starting the project to avoid back and forth. - There exist bias data in dataset, should try to decrease the effect of bias data as possible as we could. - Need to be thoughtful when choose a metric for pydriller to be used for our data mining. # Improvements - There are some changes may not be programmed by developers, so we need to filter them out. - Auto-generated code - Configurations - Should have better way to select repositories, and compare in more dimensions. (in out project, we only focus on one dimension, which is hunk size) - Diversity in coding languages - The distance between two hunks - Number of contributors