CES - Quality Improvement (ICSME 2020)

# CES - Quality Improvement (ICSME 2020) ## Abstract __ _In this section, summarize the objectives, methods and findings._ __ ## Keywords __ _Union of the keywords of the paper + any other keyword you find missing._ __ - Software Testing - Mocks - Test Doubles - Software Maintenance ## Introduction ##### SHAKER In this session, the instability of software tests was addressed with the Denini et al. presentation of their study. When expanding the software product and its test suite with increasingly complex tests, their results tend to become non-deterministically. In other words, the result of a particular test might change when no changes were made in the code base or architecture. Such instability might drive off software developers from trusting the effectiveness of regression testing, and even ignore their results. This may cause devasting consequences in the quality of the software: less bug detection (and correction), which leads to more crashes in the deployed build. Large tech companies, such as Google, Facebook and Microsoft, have prioritized this issues internally. Tests which have this non-deterministic characteristic are _flaky_ tests. Despite bugs in the code, _flakiness_ is caused by several factors: microservices architectures with unsable external APIs, or concurrency. They may need to be adjust to tolerate these unstable variables, and make sure bugs are the only cause of failure in a test. Denini et al. focus on the latter factor - concurrency - and proposed SHAKER, a lightweight tecnique to detect flaky tests. Test environment are usually less stressed than deployment environment, and this techniques addresses that by introducing noise in the test environment to detect those which may become unstable. It is an alternative to other similar approaches (ReRun, DeFlaker), and it proves it can detect more flaky tests and in a shorter ammount of total time. ##### Modern Code Review on Software Degratation In this session, an extensive study was carried out by Uchoa et al. focusing on the impact of modern codern review on software degration. Modern code review is a lightweight, informal, asynchronous, and tool-assisted technique aimed at detecting and removing issues that were introduced during development tasks. It is a iterative process that relies on the review of the code change proposed by one developer by other developers, which may be refined (through revisions) and ultimately either integratated into the codebase or rejected (9, 31, 32). As stated before, the intent of a code change varies a lot: it goes from a simple fix of a bug, to structural changes in the codebase. Moreover, while trying to fix a issue, other design problems may by introduced in a system, which ultimately impact quality attributes such as maintainabilitay and extensibility. Most studies see as symptoms of design degradation the presencen of code smells: either fine-grained (FG) smells, which are ndicators of stuctural degradation in the scope of methods and code blocks, as well as course-grained (CG) smells, which are indicators of stuctural degradation related to object-oriented principles (abstraction, encapsulation, modularity, and hierarchy). Example of a FG smell is long method, which impacts comprehensability and modifiability. For CG smell, a large and complex class with accumulation of lots of resonpabilities is an example of such type of smell. Other studies have also studied the impact of modern code reviews on design degradation. However, this one tries to go a step forward by also analysing how it evolves during the lifecycle of a software project, and what code review practices influence most design degration. __ _Introduce the theme of the session and enumerate the papers you'll be presenting. For this purpose, you can use the introductions of the papers in your session and each paper's related work (besides doing your own research on the topic, if you believe the previous two don't suffice)._ __ ##### Assessing Mock Classes: An Empirical Study In this session, an empirecal study is taken regarding Mock classes by Gustavo Pereira and Andre Hora where the main objective is to try to understand when,why and how mock classes are created. Mock objects are objectes that are created during test activities to mimic dependencies (domain objects, external dependency, web service etc.) so that those tests are more easier to implement and execute (making the tests faster, isolated and deterministic). For this study it was analyzed 12 of the most popular open source java softwares on github which led them to detect 604 mock classes and assess their content, design and usage. This study takes as its base a past research that showecased the uses of mockings frameworks. Mocking frameworks are largely adopted by software projects to help developers in the creation of unit tests. However that is not the topic of this study , this study focus on mock classes , which means the classes that developers created on their own to help them building the tests. A few questions arose regarding the direction that this study would take. - RQ1 What is the content of mock classes? - RQ2 How are mock classes designed? - RQ3 How are mock classes used by developers? ##### Improving Testing by Mimicking User Behavior Improving Testing by Mimicking User Behavior is a paper by two authors (Qianqian Wang and Alessandro Orso) that explains a new technique created by the researchers called Replica. When developers create a program they tend to create tests that they think are crucial and should be tested, nevertheless in-house tests typically only exercise a tiny fraction of all the possible software behaviours. Although there are a lot of techniques and strategies to help mitigate this problem (eg. Beta Testing, staged rollout, EvoSuit) we are still witnessing a gap between the tests created and the field usage of the program. The technique Replica comes to life to tackle the problems related to the difficulties of creating unit tests that cover all the possible inputs of the user and all the possible usages of a program. Replica make in-house tests more representative by mimicking observed and untested user behaviour, then after detecting untested behaviour it generates new tests that exercise such behaviour. This technique is different than any other in two distinct ways , it incorporates field-execution information into the input generation process and it only collects information for relevant executions (executions that are not yet tested). ##### Who (Self) Admits Technical Debt? Self Admitted Technical Debt (SATD) are comments left by developers in the source code or elsewhere aimed at describing the presence of source code “not ready yet” (TD). TD may be not admitted by the same person who has written the source code although the name “Self Admitted Technical Debt” suggests. It may happen that a developer notices a technical debt in the source code written, or recently modified, by somebody else, and decides to leave a comment. The main purpose of this document was to understand the extent to which SATD comments are introduced by authors different from those who have done last changes to the related source code and when when this happens, what is the level of ownership those developers have about the commented source code. For better understanding and answer this goal, two question were estabilished: 1: Is Technical Debt admitted by developers who changed the affected source code? 2: What is the proportion of ownership while adding a TD comment into the code? The goal of this study is to analyze the authorship of SATD-related comments, to determine whether they have been introduced by a developer who also changed the related source code or, possibly by somebody else that has noticed a likely occurrence of TD in the source code. ## Methodology ##### SHAKER Regarding SHAKER, the lightweight tecnique to detect flaky tests, Danini et al. proposed a methodology that reveals flaky tests by running them in a noisy environment. Is is multi-step. First, it finds the configurations for the noise generator by running a sample of flaky tests. They focused on CPU and memory options of the machine running the tests. They discard the heavier options that cause the test environment to crash, and select those which lead to the highest probability of detecting flakiness in those sample tests. They also offered several strategies to select the best subset of configurations, but their difference was not statiscally relevant. Important to note that this step is run only once per machine and can be done offline. The second step consists of running the test suite under the noise configurations found in the offline step, for a specified number of times. Under theses conditions, any divergence on the test outputs is marked as flaky. ##### Modern Code Review on Software Degradation Regarding the study on the impact of modern code review on software degradation, Uchoa et al. perfomed an extensive mining of code review data from two large open source communities: Eclipse and Couchbase. They are hosted in the Code Review Open Platform (CROP), an open-source dataset that links code review data with their respective code changes. Since all systems in CROP employ Gerrit as their code review tool, the authors had access to a rich dataset of source code changes. They performed an extensive data preparation work, which included detection and characterisation of degradation symptoms, extraction of metrics that measure the code review practices and activity, and manual classification of the intent of each code review that was analysed. To detect degradation symptons, in the form of code smells, the authors automated the process by using the DesignateJava tool, whose limitations is that the selected projects had to be written in Java. In order to detect their introduction into the codebase solely by means of code revisions, the authors analysed the system before and after each submitted revision. The code smells considered in this study are in __TABLE X__. The dependent variables in this study comprise design degradation characteristics, which were computed for each of the previously mentioned code smells: _density_ of the symptoms in the codebase, and _diversity_ of the types of symptoms in the codebase. Regarding the detection and measurement of code review practices, the authors used _Product_ and _Process_ metrics, which reflect the modifications done to the codebase: number of files subject to review, number of lines of code added and removed. In addition, metrics that reflect the code review activity were a also extracted: _Review Intensity_, which measures scrutiny in each code review (number of revisions, modifications between revisions, and comments by reviewers), _Review Participation_, which measures how active other developers were in a code review process (number of developers who particpated, number of votes), and _Reviewing Time_, which measure the duration of the code review. Finally, the authors manually analysed and classified each review according to their intent and discussion. Regarding the intent, they classified as either _design-related_ or _design-unrelated_, according to the explicity of the review intent. In addition, if in the discussion there was awareness of the impact of that review on the system's design, they are classified as such. For this manual, extensive, and subjective analysis, the two authors who performed it needed to be in sync, which required independent analysis by each one of them, followed by discussions until a consensus was reached. ##### Assessing Mock Classes: An Empirical Study For this study Gustavo Pereira and Andre Hora started by selecting 12 of the most popular open source java softwares on github to use as the base subject of the paper (the selection criteria they used to access "the most popular" software was the star metric used on github). In this group of projects we could find projects from different software domains such as web framework (Spring Boot and Spring Framework), search engine (Elasticsearch and Lucene-Solr), asynchronous/event library (RxJava and EventBus), HTTP client (OkHttp and Retrofit), Java support library (Guava), RPC framework (Dubbo), integration framework (Camel), and analytics engine (Spark). The next step taken was to indentify the mock classes inside of those 12 projects. First it was extracted all the classes in the Java files, even the nested ones. After getting all the classes they removed the classes that used mock frameworks (since that is not the topic of the subject) and it was applyed a filter to access which classes are indeed mock classes ( for this filter they consider a mock class when in the name of the class it includes words as "mock", it does not use mocking frameworks and that is not a test class). Regarding the results of this steps we reached a total of 604 mock classes in total from the 12 projects. After having the data ready to be treated (the mock classes) the researchers started by analysing the three big questions (what type of dependency mock classes are simulating, how mock classes are designed and to detect whether mock classes are used in the wild). For the first question the authors used the previous related study, that studied mock frameworks, and applied the same categories that they used in that study to divide the different types of mock objects. In the end the mock classes were divided in Domain objects, Database, Native Java Libraries, Web services, External Dependencies, Test support and Network services. The next question led the researchers to evaluate the classes regarding their design and the three big topic analyzed here were if the classes inherithed from other classes or were applying an interface, if the classes were public or private and the number of methods in the mock classes compared to the number of methods in other classes. In the last question it was used a ultra-large dataset , called Boa, to detect if mock classes are used in the wild, for that the researchers queried on all Java systems looking for import statements with the term "mock" and removed the ones that included the name of the most popular java mock frameworks (such as "Mockito" for example). ##### Improving Testing by Mimicking User Behavior The technique Replica is able to increment the test quality by mimicking observed untests user behaviour and creating tests based on that data that received. Replica is composed of three main components , the Execution Mimicker and two helper components, Execution Intrumenter and Execution Processor. The Execution Intrumenter receives the information regarding the tests created in-house and the input of field execution, after getting the data the component create ,what is called by the authors, the execution traces (this are basically files with orders of instructions that can reach a specific execution part of the program of interest). The Execution Processor grabs the data and compares both execution traces to get invariant-violating sequences, which means the differences in the data from the execution traces in-house tests and the data from the execution traces input of using the program. Finnaly the Execution Mimicker uses the marked traces as guidance to generate inputs that mimic the natural use of the application to reach to a specific point (the technique tries different set of combination of actions to reach to a specific location in the code). Being able to reach a certain point in the application, Replica then can generate the new tests that cover the cases that were not previously covered. ##### Who (Self) Admits Technical Debt? To achieve this goal, authors picked some known SATD instances in five Java open-source projects. In each project they accessed Then, they identified the SATD comment lines in the source code, and they traced them back to their introduction and/or last changes using GIT BLAME. After that, based on the comment location, they attached it to source code elements (methods, blocks, or single statements) and checked whether any source code line has also been modified together with the comment. If this is the case, they highlighted it as an instance of TD admitted by “somebody else”. It could still happen that developers may comment source code not recently modified by them while still having a good knowledge of the source code fragment they are commenting. To this aim, the authors analyzed the level of “ownership” of the comment’s authors for the source code fragment. They relied the proportion of source code changes made by an author. __ _Describe the methods the authors used for each one of their particular topics._ __ ## Findings Regarding the first question authors concluded that there is a percentage between 0 and 16 percent of STAD comments that are addmited by somebody different from the developers authoring the related lines of code. Most of this comments were newly-added SATD comments, while a minority bettwen 0 and 31 percent were about comment changes. Reagarding the second question they concluded that, in most cases, whoever adds or modifies SATD comments on source code lastly changed by somebody else is a major contributor of that source code file. ##### SHAKER Regarding SHAKER, this technique proved to not only detect more flaky tests but also in less time that other popular alternatives. Even though there is an extra step that finds the noise configurations for the test environment, it is run once per machine. The flakiness of a test is detect by rerunning them several times and look for divergences in the test output. Under load conditions, it is expected that a single execution of a test takes longer. However, Danini et al. proved that to detect the flakiness of a test, a significant less number of rerunnings of the test suite are needed, and the total time of execution is significantly less than its alternatives. In addition, using this techniques more flaky tests were detected, some of them were not even marked as such by the owner of the code. ##### Modern Code Review on Software Degradation The extensive study of Uchoa et al. on the impact of modern code review on software degradation demonstrated to what extent it occurs. While the majority of merged reviews have no impact on the density of code smells, design degradation, the remaining ones lead to a negative impact more often than positive one, particularly in fine-grained code smells. On the other hand, their diversity is not affected since to remove one type of code smell one would have to remove all instances in the codebase, and they tend to be introduced in the beginning stages of a project. This study had very interesting results in regards to the intent of the reviews. Those with design-related intents tended to have a positive or invariant impact on code degradation than other types of review. However, the same results were not applicable to the mere fact that design awareness was present in a code review discussion. Along a single review process, there are a set of revisions until ultimately the code change is merged. The authors did not find evidence that throughout the revisions the design impact changes, even for reviews with design-related concerns. Considering the design-related reviews, they showed limited impact on coarse-grained smells, even when there is design feedback. In fact, they are mostly aggravated. There removal requires complex refactorings and often represent more severe problems. On the other hand, fine-grained smells are simpler to remove and to refactor, and most times they represent smaller readability and understandability problems. During a code review, most have a mixed impact, ie. even though they fix a degragation, other issues may be introduced. Through careful analyisis of the dataset, the authors theorized that even though at the end the impact is mostly positive, the developers are still unable to see all the ramifications of the impact of their changes. The final findings of this study concern the impact of code review practices on software degratation. Interistingly, long code review discussion are often associated with higher risk of software degration. In such discussions, the participants may not be necessarily concern with the structural quality of the code, mostly with the some functionality, and they tend to lead to further code changes, which ultimately increases software degration. Moreover, when there is a high rate of disagreement among participants, the risk of software degration also increases. The duration of the review is also associated with the increase risk of software degration, but for different results. After manual analysis, the authors showed that long review duration are often signs of lack of attention by reviewers. A code review practice that leads to positive impact are the number of active reviewers engaged in code reviews. When such conditions occur, there is a decreasing effect on the degradation risk, particurlaly for coarse-grained smells. Again, this smells are usually more complex, and their removal is benefited by a large and active pool of reviewers who have a better understanding of the codebase. ##### Assessing Mock Classes: An Empirical Study As the results of this study we got answers for each of the three main questions leading to some conclusions regarding the different types of mocks present in the software world. Regarding the content of the mock classes the authors noticed that the most common category is the domain object (35%) followed by the external dependency (23%) and web service (15%). Mock classes mocking domain objects are indeed the most frequent to appear in the projects with 211 classes in total. After analysing this result we can assume developers tend to mock the same type of dependencies no matter if they are using mocking frameworks or if they created the mock classes since the results from this empirical study have very similar results comparing to the study of the mock frameworks. Concerning the design of the mock classes, it is more common to have class extension compared to interface implementation and the majority of the mock classes are public thus all parts of the code can use this classes (reuse can be considered an important advantage for creating mock classes). For the number of methods compared in the mock classes with the regular classes the conclusion is that the both types of classes have 3 methods on the median, which leads to conclude that the amount of effort to mantain these types of classes is the same. For the results of the last topic refering to the external use of the mock classes, mock classes are largely used by client projects to help the creation of their own tests (analysing the Boa dataset 6,444 mock classes are being used 147,433 times) and the web services mock classes are the most emulated dependencies. ##### Improving Testing by Mimicking User Behavior For accessing the effectiveness of Replica the authors performed an empirical evaluation in which they evaluated the technique based on it's performance regarding if the technique could indeed add new tests behaviours observed in the field and missed in the in-house tests, if the technique could detect faults revelead in the field and missed by the in-house tests (if the technique can generate tests that can kill the same mutants killed by the observed field executions) and if the technique is more effective than the vannilla way of generating new tests. To reach the answers to the questions the researchers selected four real open source programs that are widely used and have the developer written in-house test suites available. Having the subject of the empirical study ready the authors started running some tests and applying some metrics and the results were very positive regarding the performance of Replica. On average replica was able to generates and exercise 56% of the missed behaviours in the projects. Regarding the mutants 26% of mutants killed from the replica tests that were not previously killed from the in-house tests. Although Replica was not able to exercise all the behaviours missed by in-house tests and exercised by field execution it was able to automatically exercise over half of them on average. __ _Summarize the findings of each paper._ __ ## Conclusions ##### Assessing Mock Classes: An Empirical Study The results of this empirical study were quiet promissing and some conclusion were taken after analysing the results. We can conclude that either using mock frameworks or creating the mock classes the developers tend to mock the same type of dependencies. Another conclusion is that mock classes are a big part of software testing domain since from 12 project we found 604 classes and that those classes are created in the biggest cases to mock domain objects, that are part of a hierarchy and are public and that the mock classes are largely consumed by external client projects. ##### Improving Testing by Mimicking User Behavior Tests created by developers rarely reflect the way software is actually used in day to day, regarding this problem Replica is indeed a technique that is very usefull to tackle this problem. The results from the empirical study prove that replica was sucessuful regarding testing behaviours that were not tested in the in-house tests. __ _Conclusion for the whole session, e.g., highlight areas of research, methods, results, and future of the field._ __ ## References * ICSME 2020 Program (**Quality Improvement II**): https://icsme2020.github.io/program/schedule.html#quality2 * Shake It! Detecting Flaky Tests Caused by Concurrency with Shaker. Denini Silva, Leopoldo Teixeira and Marcelo d'Amorim (Research Track) https://ieeexplore.ieee.org/document/9240694 https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9240694 * Assessing Mock Classes: An Empirical Study. Gustavo Pereira and Andre Hora (Research Track) https://ieeexplore.ieee.org/document/9240675 https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9240675 * D. Spadini, M. Aniche, M. Bruntink, and A. Bacchelli, “To mock or not to mock? an empirical study on mocking practices,” in International Conference on Mining Software Repositories, 2017, pp. 402–412. * Improving Testing by Mimicking User Behavior. Qianqian Wang and Alessandro Orso (Research Track) https://ieeexplore.ieee.org/document/9240614 https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9240614 * How Does Modern Code Review Impact Software Design Degradation? An In-depth Empirical Study. Anderson Uchôa, Caio Barbosa, Willian Oizumi, Publio Blenílio, Rafael Lima, Alessandro Garcia and Carla Bezerra (Research Track) https://ieeexplore.ieee.org/document/9240657 https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9240657 * Who (Self) Admits Technical Debt?. Gianmarco Fucci, Fiorella Zampetti, Alexander Serebrenik and Massimiliano Di Penta (New Ideas Track) https://ieeexplore.ieee.org/document/9240605 https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9240605

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.