"Many ideas grow better when transplanted into another mind
than the one where they sprang up." (Oliver Wendell Holmes)
Interdependence is certainly more valuable than independence.
These publications represent the results of several fruitful collaborations
with excellent students and great colleagues around the world.
Without their support I would not have ever been able to produce all the
papers listed below. Thanks to all!
Authors are usually listed in alphabetic order. When such a rule is not followed, authors are listed by seniority (students first). Authors marked with * are (former) students I (co-)advised.
Modern video games are extremely complex software systems and, as such, they might suffer from several types of post-release issues. A particularly insidious issue is constituted by drops in the frame rate (i.e., stuttering events), which might have a negative impact on the user experience. Stuttering events are frequently documented in the million of hours of gameplay videos shared by players on platforms such as Twitch or YouTube. From the developers’ perspective, these videos represent a free source of documented “testing activities”. However, especially for popular games, the quantity and length of these videos make impractical their manual inspection. We introduce HASTE, an approach for the automatic detection of stuttering events in gameplay videos that can be exploited to generate candidate bug reports. HASTE firstly splits a given video into visually coherent slices, with the goal of filtering-out those that not representing actual gameplay (e.g., navigating the game settings). Then, it identifies the subset of pixels in the video frames which actually show the game in action excluding additional elements on screen such as the logo of the YouTube channel, on-screen chats etc. In this way, HASTE can exploit state-of-the-art image similarity metrics to identify candidate stuttering events, namely subsequent frames being almost identical in the pixels depicting the game. We evaluate the different steps behind HASTE on a total of 105 videos showing that it can correctly extract video slices with a 76% precision, and can correctly identify the slices related to gameplay with a recall and precision higher than 77%. Overall, HASTE achieves 71% recall and 89% precision for the identification of stuttering events in gameplay videos.
Large Language Models (LLMs) have demonstrated effectiveness in tackling coding tasks, leading to their growing popularity in commercial solutions like GitHub Copilot and ChatGPT. These models, however, may be trained on proprietary code, raising concerns about potential leaks of intellectual property. A recent study indicates that LLMs can memorize parts of the source code, rendering them vulnerable to extraction attacks. However, it used white-box attacks which assume that adversaries have partial knowledge of the training set. This paper presents a pioneering effort to conduct a black-box attack (reconstruction attack) on an LLM designed for a specific coding task – code summarization. The results achieved reveal that while the attack is generally unsuccessful (with an average BLEU score below 0.1), it succeeds in a few instances, reconstructing versions of the code that closely resemble the original.
Docker is the de facto standard for software containerization. A Dockerfile contains the requirements to build a Docker image containing a target application. There are several best practice rules for writing Dockerfiles, but the developers do not always follow them. Violations of such practices, known as Dockerfile smells, can negatively impact the reliability and performance of Docker images. Previous studies showed that Dockerfile smells are widely diffused, and there is a lack of automatic tools that support developers in fixing them. However, it is still unclear what Dockerfile smells get fixed by developers and to what extent developers would be willing to fix smells in the first place. The aim of our study is twofold. First, we want to understand what Dockerfiles smells receive more attention from developers, i.e., are fixed more frequently in the history of open-source projects. Second, we want to check if developers are willing to accept changes aimed at fixing Dockerfile smells (e.g., generated by an automated tool), to understand if they care about them. We evaluated the survivability of Dockerfile smells from a total of 53,456 unique Dockerfiles, where we manually validated a large sample of smell-removing commits to understand (i) if developers performed the change with the intention of removing bad practices, and (ii) if they were aware of the removed smell. In the second part, we used a rule-based tool to automatically fix Dockerfile smells. Then, we proposed such fixes to developers via pull requests. Finally, we quantitatively and qualitatively evaluated the outcome after a monitoring period of more than 7 months. The results of our study showed that most developers pay more attention to changes aimed at improving the performance of Dockerfiles (image size and build time). Moreover, they are willing to accept the fixes for the most common smells, with some exceptions (e.g., missing version pinning for OS packages).
Voice-based virtual assistants are becoming increasingly popular. Such systems provide frameworks to developers for building custom apps. End-users can interact with such apps through a Voice User Interface (VUI), which allows the user to use natural language commands to perform actions. Testing such apps is not trivial: The same command can be expressed in different semantically equivalent ways. In this article, we introduce VUI-UPSET, an approach that adapts chatbot-testing approaches to VUI-testing. We conducted an empirical study to understand how VUI-UPSET compares to two state-of-the-art approaches (i.e., a chatbot testing technique and ChatGPT) in terms of (i) correctness of the generated paraphrases, and (ii) capability of revealing bugs. To this aim, we analyzed 14,898 generated paraphrases for 40 Alexa Skills. Our results show that VUI-UPSET generates more bug-revealing paraphrases than the two baselines with, however, ChatGPT being the approach generating the highest percentage of correct paraphrases. We also tried to use the generated paraphrases to improve the skills. We tried to include in the voice interaction models of the skills (i) only the bug-revealing paraphrases, (ii) all the valid paraphrases. We observed that including only bug-revealing paraphrases is sometimes not sufficient to make all the tests pass.
Strokes constitute a major cause of both mortality and disability, carrying significant economic implications for healthcare systems. Evaluating the quality of gait in post-stroke patients during rehabilitation is essential for providing effective care. The Dynamic Gait Index (DGI) is a valuable metric for evaluating gait quality. However, the assessment of such an index typically requires invasive tests or specialized sensors. In this paper, we introduce a machine learning-based approach for estimating DGI exclusively from video recordings. Our research encompasses a comprehensive set of experiments, including data preprocessing, feature selection, and the application of various machine learning algorithms. To ensure the robustness of our findings, we employ the Leave 1 Subject Out (L1SO) cross-validation method. Our results underscore the challenge of accurately estimating DGI using solely video data. We achieved an R-squared (R^2) value of only 0.19 and a mean absolute error (MAE) of 2.2. Notably, we observed that our approach yielded notably poorer results for a specific subset of three patients. Upon excluding this subset, the R^2 increased to 0.30, and the MAE improved to 1.9. This observation suggests that incorporating patient-specific features into the model may hold the key to enhancing its overall accuracy
We present 2ViTA-B Cognitive, an advanced virtual assistant designed to effectively address emotional functioning disorders and enhance emotional well-being. The system is carefully crafted to engage with users and positively influence their emotions, supporting cognitive rehabilitation in both real and virtual environments. A controlled experiment has been conducted to evaluate the benefits of the proposed system. The results provide valuable insights into the potential benefits of immersive virtual reality interventions for improving emotional well-being and cognitive functions. These findings suggest promising avenues for advancements in therapeutic practices within this field.
The video game industry has experienced a continuous growth in the last decades. In such a competitive market, it is fundamental to ensure a great gaming experience to the player avoiding, for example, bugs. However, video game testing is an extremely challenging activity, especially considering the extensive number of gaming scenarios that modern video games support (e.g., 3D worlds to explore). Thus, more often than not, numerous bugs are discovered only once the game is released and played by millions of users. For this reason, recent work in the literature suggested to exploit gameplay videos to support developers in identifying possible bugs missed during testing: given the large amount of gameplays posted every day on streaming platforms (> 2M hours), these gameplays are likely to document failures experienced by the player. Empirical evidence show the ability of these techniques to identify parts of the gameplay in which the failure was experienced. However, it could still be difficult for game developers to reproduce the bug. In this paper, we propose the idea of developing a technique able to automate this process, providing the game developer with all actions performed by the player to reach the faulty state shown in the gameplay. We present a simple approach which leverages the on-screen controls overlay available in some gameplay videos. We show that such an approach can replicate 47.2% of gameplays in our preliminary study run on a racing game. We discuss the strong limitations of this first attempt, listing directions for future work we plan to pursue in order to overcome them.
Dockerfiles can be affected by bad design choices, known as Dock-erfile smells. Hadolint is currently the reference tool able to detect them, and it is widely used both by researchers and practitioners. The literature shows that these smells are commonly diffused in Dockerfiles, but it is still not clear how developers perceive them as bad practices. This paper aims to investigate the relevance of the Dockerfile smells captured by hadolint from the perspective of expert Dockerfile developers. We first perform a mining study in which we extract the change history of Dockerfiles maintained by experts to understand what smells have been more frequently introduced in their history. Next, we ran a survey in which we asked expert Dockerfile developers to evaluate Dockerfiles affected by different smells. We obtained 94 responses for 17 smells, representative of 24 Dockerfile smells. We found that experts prioritize a small part of the evaluated smells over others. Besides, they report additional bad practices not mapped as smells in any existing catalog. Thus, we propose a ranked catalog containing 26 additional Docker-file smells, which can be used as a guide for novices to understand which aspects to focus on to write good-quality Dockerfiles.CCS CONCEPTS• Software and its engineering → Software notations and tools.
QualAI is a two-year project that aims to define a set of recommenders to continuously monitor, assess, and improve the quality of AI-based systems, with a particular focus on ML-based systems. Quality assurance will be guaranteed from different perspectives and during both the development and operations phases. We will define recommenders for the quality assurance of both data and ML models to enable practitioners to mitigate technical debt. Emphasis will be given to communication issues that could arise in hybrid teams including data scientists and software developers. In this paper, we present the project outline, provide an executive summary of the research activities, and present the expected project results.
Writing and modifying source code are core activities in software development and evolution. The outcome of a coding task in terms of quality may depend on several aspects, such as the difficulty of the task or the complexity of the system. Besides, it is well known that individual characteristics of developers, like the programming experience, play a lead role in this. Recent work started exploring the influence that cognitive human aspects have on the ability of developers to acquire information from the source code (e.g., finding security blind spots). However, it is still unknown to what extent such aspects influence their ability of completing coding tasks. In this paper, we theorize that two cognitive human aspects, attention and memory, play a role in predicting the outcome of a coding task. We conducted a controlled experiment involving 32 participants (18 bachelor students, 9 master students, 2 Ph.D. students. and 3 practitioners), in which we asked them to complete two bug-fixing and two feature implementation tasks. We measured, for each of them, three attention-related factors (i.e., alerting, orienting, and executive control) and two memory-related ones (i.e., working memory and immediate recall) through well-established psychometric tests. Finally, we investigated to what extent these factors can explain the correctness, the readability and the time taken to complete a task in function of such factors. Our results show that all the attention- and memory-related factors achieved very low correlation with correctness and time. Indeed, the number of years of programming experience is far more important than all the other variables we considered for explaining the correctness and the time required to complete a task. Moreover, we found a significant relationship between orienting (an attention-related factor) and code readability.
The game industry is increasingly growing in recent years. Every day, millions of people play video games, not only as a hobby, but also for professional competitions (e.g., e-sports or speed-running) or for making business by entertaining others (e.g., streamers). The latter daily produce a large amount of gameplay videos in which they also comment live what they experience. But no software and, thus, no video game is perfect: Streamers may encounter several problems (such as bugs, glitches, or performance issues) while they play. Also, it is unlikely that they explicitly report such issues to developers. The identified problems may negatively impact the user’s gaming experience and, in turn, can harm the reputation of the game and of the producer. In this paper, we propose and empirically evaluate GELID, an approach for automatically extracting relevant information from gameplay videos by (i) identifying video segments in which streamers experienced anomalies; (ii) categorizing them based on their type (e.g., logic or presentation); clustering them based on (iii) the context in which appear (e.g., level or game area) and (iv) on the specific issue type (e.g., game crashes). We manually defined a training set for step 2 of GELID (categorization) and a test set for validating in isolation the four components of GELID. In total, we manually segmented, labeled, and clustered 170 videos related to 3 video games, defining a dataset containing 604 segments. While in steps 1 (segmentation) and 4 (specific issue clustering) GELID achieves satisfactory results, it shows limitations on step 3 (game context clustering) and, above all, step 2 (categorization).
Automatically linking bug-fixing changes to bug-inducing ones (BICs) is one of the key data-extraction steps behind several empirical studies in software engineering. The SZZ algorithm is the de facto standard to achieve this goal, with several improvements proposed over time. Evaluating the performance of SZZ implementations is, however, far from trivial. In previous works, researchers (i) manually assessed whether the BICs identified by the SZZ implementation were correct or not, or (ii) defined oracles in which they manually determined BICs from bug-fixing commits. However, ideally, the original developers should be involved in defining a labeled dataset to evaluate SZZ implementations. We propose a methodology to define a “developer-informed” oracle for evaluating SZZ implementations, without requiring a manual inspection from the original developers. We use Natural Language Processing (NLP) to identify bug-fixing commits in which developers explicitly reference the commit(s) that introduced the fixed bug. We use the built oracle to extensively evaluate existing SZZ variants defined in the literature. We also introduce and evaluate two new variants aimed at addressing two weaknesses we observed in state-of-the-art implementations (i.e., processing added lines and handling of revert commits).
Deep learning (DL) techniques have been used to support several code-related tasks such as code summarization and bug-fixing. In particular, pre-trained transformer models are on the rise, also thanks to the excellent results they achieved in Natural Language Processing (NLP) tasks. The basic idea behind these models is to first pre-train them on a generic dataset using a self-supervised task (e.g., filling masked words in sentences). Then, these models are fine-tuned to support specific tasks of interest (e.g., language translation). A single model can be fine-tuned to support multiple tasks, possibly exploiting the benefits of transfer learning . This means that knowledge acquired to solve a specific task (e.g., language translation) can be useful to boost performance on another task (e.g., sentiment classification). While the benefits of transfer learning have been widely studied in NLP, limited empirical evidence is available when it comes to code-related tasks. In this paper, we assess the performance of the Text-To-Text Transfer Transformer (T5) model in supporting four different code-related tasks: (i) automatic bug-fixing, (ii) injection of code mutants, (iii) generation of assert statements, and (iv) code summarization. We pay particular attention in studying the role played by pre-training and multi-task fine-tuning on the model's performance. We show that (i) the T5 can achieve better performance as compared to state-of-the-art baselines; and (ii) while pre-training helps the model, not all tasks benefit from a multi-task fine-tuning.
The increasing diffusion of mobile devices and their integration with sophisticated hardware and software components has promoted the development of numerous applications in which developers find new ingenious ways to exploit the possibilities offered by the access to resources such as cameras, biometric sensors, and GPS receivers. As a result, we are increasingly used to seeing applications that make extensive use of sensitive resources, potentially dangerous for our privacy. To address this problem, the latest approach to support user awareness in terms of privacy is represented by the Privacy Indicators (PI), a software solution implemented by the operating system to provide a visual stimulus to inform users whenever a dangerous resource is exploited by the app. However, the effectiveness of this approach has not been assessed yet. In this article, we present the result of a study on the effectiveness of using the PI to inform the user every time an app accesses the mobile device camera or microphone. We have chosen these two resources as the PI are currently implemented only for a very limited number of permissions. The controlled experiment involved 122 Android users who were asked to complete a series of tasks on their smartphone through prototypes using the involved resources in an explicit and latent way. Although the PI mechanism is very similar between Android and iOS, we have decided to focus on the former due to its greater diffusion. The results show no significant correlation between the use of PI and the detection of the resource being used by the app, suggesting that the effectiveness of PI in improving sensitive-related resources usage awareness, as currently implemented, is still unsatisfactory. In order to understand if the problem was due to the specific implementation of the PI, we implemented an enhanced version and compared it with the standard one. The results confirmed that an implementation that makes the indicators more visible and that is clearer in highlighting the fact that the app is accessing a resource improves resources usage awareness.
Docker is a containerization technology that allows developers to ship software applications along with their dependencies in Docker images. Developers can extend existing images using them as base images when writing Dockerfiles. However, a lot of alternative functionally equivalent base images are available. Although many studies define and evaluate quality features that can be extracted from Docker artifacts, the criteria on which developers choose a base image over another remain unclear. In this article, we aim to fill this gap. First, we conduct a literature review through which we define a taxonomy of quality features, identifying two main groups: configuration-related features (i.e., mainly related to the Dockerfile and image build process), and externally observable features (i.e., what the Docker image users can observe). Second, we ran an empirical study considering the developers’ preference for 2,441 Docker images in 1,911 open source software projects. We want to understand how the externally observable features influence the developers’ preferences, and how they are related to the configuration-related features. Our results pave the way to the definition of a reliable quality measure for Docker artifacts, along with tools that support developers for a quality-aware development of them.
Blockchain is a platform of distributed elaboration, which allows users to provide software for a huge range of next-generation decentralized applications without involving reliable third parties. Smart contracts (SCs) are an important component in blockchain applications: they are programmatic agreements among two or more parties that cannot be rescinded. Furthermore, SCs have an important characteristic: they allow users to implement reliable transactions without involving third parties. However, the advantages of SCs have a price. Like any program, SCs can contain bugs, some of which may also constitute security threats. Writing correct and secure SCs can be extremely difficult because, once deployed, they cannot be modified. Although SCs have been recently introduced, a large number of approaches have been proposed to find bugs and vulnerabilities in SCs. In this article, we present a systematic literature review on the approaches for the automated detection of bugs and vulnerabilities in SCs. We survey 68 papers published between 2015 and 2020, and we annotate each paper according to our classification framework to provide quantitative results and find possible areas not explored yet. Finally, we identify the open problems in this research field to provide possible directions to future researchers.
Recent research highlights the significance of active aging, emphasizing its role in enhancing the wellbeing of the elderly population. To effectively promote active aging and address the evolving needs of our growing global aging demographic, a fusion of technological innovation and progressive policy strategies is imperative. In response to this challenge, we introduce the D3A (Digital Assistant for Active Aging) system – a pioneering platform aimed at elevating the quality of life for the elderly. D3A focuses on optimizing interconnected facets of physical and cognitive performance. Within this framework, healthcare professionals can tailor individualized training regimens encompassing both physical and cognitive exercises, which can be experienced in both traditional and virtual reality settings. D3A leverages artificial intelligence algorithms to assess the efficacy of these exercises, providing valuable support to healthcare providers. Moreover, the platform incorporates wearable devices to monitor users’ physiological parameters, serving as crucial indicators of their overall health. Also, D3A employs blockchain technology to ensure robust security and data immutability for access control, activity logs, and health records within the application. This holistic approach to active aging promises a brighter future for our aging population, driven by innovation and evidence-based care.
Logic bombs are a critical security threat in Android applications that can be triggered by specific events or conditions, leading to serious consequences. In this work we focus on apps accessing mobile device resources for sensitive data leakage. Such malicious behaviour can exploit Android permission model by gaining access to sensitive related resources in a legitimate context and later using them in a dangerous one, once the logic bomb is triggered. We propose a dynamic approach by extending RPCDroid, a tool that monitors the behavior of an Android application whenever it accesses specific device resources. To defuse the logic bomb we force an explicit prompt to authorize access requests based on the usage context preventing accesses unbeknownst to the user. We assessed the effectiveness of our proposal using TriggerZoo, a publicly available dataset of apps injected with logic bombs. Our results show that a context aware permission model can effectively prevent uncontrolled access to privacy related data in case a logic bomb is triggered.
Over the years, there has been an explosion in the app market offering users a wide range of functionalities especially since modern devices are equipped with many hardware resources such as cameras, GPS, and so on. Unfortunately, this is sometimes associated to indiscriminate access to sensitive data. This exposes users to security and privacy risks because, although resource usage requires explicit user authorization, once permission is granted, a mobile application is usually free to access the corresponding resource until the permission is expressly revoked or the app is uninstalled. In this work, we introduce RPCDroid, a dynamic analysis tool for run-time tracking of the behavior (UI events and used permissions) of Android mobile applications that use device resources requiring dangerous permissions. We assessed the effectiveness of the tool to identify usage contexts, discriminating between different kinds of access to the same sensitive resource. We executed RPCDroid on a set of popular applications obtaining evidence that, in many cases, mobile applications access to the same resource though different user interactions.
Reading source code occupies most of developer’s daily activities. Any maintenance and evolution task requires developers to read and understand the code they are going to modify. For this reason, previous research focused on the definition of techniques to automatically assess the readability of a given snippet. However, when many unreadable code sections are detected, developers might be required to manually modify them all to improve their readability. While existing approaches aim at solving specific readability-related issues, such as improving variable names or fixing styling issues, there is still no approach to automatically suggest which actions should be taken to improve code readability. In this paper, we define the first holistic readability-improving approach. As a first contribution, we introduce a methodology for automatically identifying readability-improving commits, and we use it to build a large dataset of 122k commits by mining the whole revision history of all the projects hosted on GitHub between 2015 and 2022. We show that such a methodology has ∼86% accuracy. As a second contribution, we train and test the T5 model to emulate what developers did to improve readability. We show that our model achieves a perfect prediction accuracy between 21% and 28%. The results of a manual evaluation we performed on 500 predictions shows that when the model does not change the behavior of the input and it applies changes (34% of the cases), in the large majority of the cases (79.4%) it allows to improve code readability.
Containerization allows developers to define the execution environment in which their software needs to be installed. Docker is the leading platform in this field, and developers that use it are required to write a Dockerfile for their software. Writing Dockerfiles is far from trivial, especially when the system has unusual requirements for its execution environment. Despite several tools exist to support developers in writing Dockerfiles, none of them is able to generate entire Dockerfiles from scratch given a high-level specification of the requirements of the execution environment. In this paper, we present a study in which we aim at understanding to what extent Deep Learning (DL), which has been proven successful for other coding tasks, can be used for this specific coding task. We preliminarily defined a structured natural language specification for Dockerfile requirements and a methodology that we use to automatically infer the requirements from the largest dataset of Dockerfiles currently available. We used the obtained dataset, with 670,982 instances, to train and test a Text-to-Text Transfer Transformer (T5) model, following the current state-of-the-art procedure for coding tasks, to automatically generate Dockerfiles from the structured specifications. The results of our evaluation show that T5 performs similarly to the more trivial IR-based baselines we considered. We also report the open challenges associated with the application of deep learning in the context of Dockerfile generation.
Stroke is a serious medical condition that can result in permanent brain damage and other pathological issues. Conditions suffered by survivors ranged in severity from full recovery to significant movement disability. Even though some may recover quickly, many stroke survivors require long-term support to help them achieve as much independence as they can. Thanks to a proper rehabilitation, patients who have experienced a stroke can work to regain skills that are suddenly lost when a section of their brain is injured. Due to the breakdown of neuronal networks in the motor cortex, abnormal gait patterns are a typical disability after a stroke. Therefore, gait analysis can be a powerful tool to support stroke patients during rehabilitation. In this work we propose GIULYO, a Machine Learning based tool that offers support in the assessment of video gait trials in stroke patients by providing an automatic analysis on the muscle activity of the assisted subject. GIULYO is a device-agnostic tool because it accepts motion tracking data in terms of 3d trajectories regardless of the type of instrumentation. GIULYO has been validated on the ARRA Stroke dataset and the results showed an overall accuracy of 0.74 while on a subset a patients—with common clinical assessment of mobility impairments— the accuracy increased to 0.92, therefore demonstrating the feasibility of involving a ML-based approach for the rehabilitation support of post stroke patients.
Software engineering research has always being concerned with the improvement of code completion approaches, which suggest the next tokens a developer will likely type while coding. The recent release of GitHub Copilot constitutes a big step forward, also because of its unprecedented ability to automatically generate even entire functions from their natural language description. While the usefulness of Copilot is evident, it is still unclear to what extent it is robust. Specifically, we do not now the extent to which semantic-preserving changes in the natural language description provided to the model have an effect on the generated code function. In this paper we present an empirical study in which we aim at understanding whether different but semantically equivalent natural language descriptions result in the same recommended function. A negative answer would pose questions on the robustness of deep learning (DL)-based code generators since it would imply that developers using different wordings to describe the same code would obtain different recommendations. We asked Copilot to automatically generate \methods Java methods starting from their original Javadoc description. Then, we generated different semantically equivalent descriptions for each method both manually and automatically, and we analyzed the extent to which predictions generated by Copilot changed. Our results show that modifying the description results in different code recommendations in ~46 of cases. Also, differences in the semantically equivalent descriptions might significantly impact the correctness of the generated code ±28.
Knowing which parts of the source code will be defective can allow practitioners to better allocate testing resources. For this reason, many approaches have been proposed to achieve this goal. Most state-of-the-art predictive models rely on product and process metrics, i. e. they predict the defectiveness of a component by considering what developers did. However, there is still limited evidence of the benefits that can be achieved in this context by monitoring how developers complete a development task. In this paper, we present an empirical study in which we aim at understanding whether measuring human aspects on developers while they write code can help predict the introduction of defects. First, we introduce a new developer-based model which relies on behavioral, psycho-physical, and control factors that can be measured during the execution of development tasks. Then, we run a controlled experiment involving 20 software developers to understand if our developer-based model is able to predict the introduction of bugs. Our results show that a developer-based model is able to achieve a similar accuracy compared to a state-of-the-art code-based model, i. e. a model that uses only features measured from the source code. We also observed that by combining the models it is possible to obtain the best results (~84% accuracy).
The automatic generation of source code is one of the long-lasting dreams in software engineering research. Several techniques have been proposed to speed up the writing of new code. For example, code completion techniques can recommend to developers the next few tokens they are likely to type, while retrieval-based approaches can suggest code snippets relevant to the task at hand. Also, deep learning has been used to automatically generate code statements starting from a natural language description. While research in this field is very active, there is no study investigating what the users of code recommender systems (i.e., software practitioners) actually need from these tools. We present a study involving 80 software developers to investigate the characteristics of code recommender systems they consider important. The output of our study is a taxonomy of 70 “requirements” that should be considered when designing code recommender systems. For example, developers would like the recommended code to use the same coding style of the code under development. Also, code recommenders being “aware” of the developers’ knowledge (e.g., what are the frameworks/libraries they already used in the past) and able to customize the recommendations based on this knowledge would be appreciated by practitioners. The taxonomy output of our study points to a wide set of future research directions for code recommenders.
Refactoring aims at improving the maintainability of source code without modifying its external behavior. Previous works proposed approaches to recommend refactoring solutions to software developers. The generation of the recommended solutions is guided by metrics acting as proxy for maintainability (e.g., number of code smells removed by the recommended solution). These approaches ignore the impact of the recommended refactorings on other non-functional requirements, such as performance, energy consumption, and so forth. Little is known about the impact of refactoring operations on non-functional requirements other than maintainability. We aim to fill this gap by presenting the largest study to date to investigate the impact of refactoring on software performance, in terms of execution time. We mined the change history of 20 systems that defined performance benchmarks in their repositories, with the goal of identifying commits in which developers implemented refactoring operations impacting code components that are exercised by the performance benchmarks. Through a quantitative and qualitative analysis, we show that refactoring operations can significantly impact the execution time. Indeed, none of the investigated refactoring types can be considered “safe” in ensuring no performance regression. Refactoring types aimed at decomposing complex code entities (e.g., Extract Class/Interface, Extract Method) have higher chances of triggering performance degradation, suggesting their careful consideration when refactoring performance-critical code.
Congestive heart failure (CHF) is a chronic heart disease that causes debilitating symptoms and leads to higher mortality and morbidity. In this paper, we present HARPER, a novel automatic detector of CHF episodes able to distinguish between Normal Sinus Rhythm (NSR), CHF, and no-CHF. The main advantages of HARPER are its reliability and its capability of providing an early diagnosis. Indeed, the method is based on evaluating real-time features and observing a brief segment of ECG signal. HARPER is an independent tool meaning that it does not need any ECG annotation or segmentation algorithms to provide detection. The approach was submitted to complete experimentation by involving both the intra- and inter-patient validation schemes. The results are comparable to the state-ofart methods, highlighting the suitability of HARPER to be used in modern IoMT systems as a multi-class, fast, and highly accurate detector of CHF. We also provide guidelines for configuring a temporal window to be used in the automatic detection of CHF episodes.
Atrial fibrillation (AF) is a medical disorder that affects the atria of the heart. AF has emerged as a worldwide cardiovascular epidemic affecting more than 33 million people around the world. Several automated approaches based on the analysis of the ECG have been proposed to facilitate the manual identification of AF episodes. Especially, such approaches analyze the heartbeat morphology (absence of P-wave) or the heart rate (presence of arrhythmia). In this article, we present AMELIA (AutoMatic dEtection of atriaL fIbrillation for heAlthcare), an approach that simulates the doctor’s behavior by considering both the sources of information in a combined way. AMELIA is basically composed of two components; one integrating a LSTM (Long Short-Term Memory) Recurrent Neural Network (RNN) and the second integrating a rhythm analyzer. When the RNN reveals a heartbeat with abnormal morphology, the rhythm analyzer is activated to verify whether or not there is a simultaneous arrhythmia. AMELI A has been experimented by using well-known datasets, namely Physionet-AF and NSR-DB. The achieved results provide evidence of the potential benefits of the approach, especially regarding sensitivity. AMELIA has an incredibly high potential to be used in applications of continuous monitoring, where the detection of AF episodes is a fundamental and crucial activity.
Nowadays, Computerized Decision Support Systems (CDSS) play an important role in medical support and preventative care. In those scenarios, the monitoring of biomedical data, such as the ECG signal, is fundamental. The ECG signal may reveal a variety of abnormalities or pathological conditions. Some examples are Ischemia and Myocardial Infarction (MI), with a significant impact on the world’s population. Both these conditions can be diagnosed by observing changes in specific sections of the ECG, such as the ST segment and/or T-wave of heartbeats. Much effort was devoted by the scientific community to aim at automatically identifying ST anomalies. The main drawback of such approaches is often a trade-off between the accuracy in the classification, the robustness to noise, and the real-time responsiveness. In this work, we present RAST, a robust approach for a Real-time Accurate screening of ST segment anomalies. RAST takes as input a sequence of 10 successive heartbeats extracted from an ECG recording and provides as output the classification of the ST segment trend. We evaluated two versions of RAST, namely RAST-BINARY, and RAST-TERNARY: the first capable of distinguishing only between an ST anomaly and Normal Sinus Rhythm and the second able to distinguishing between ST elevation, ST depression, and normal rhythm. Moreover, we conducted an extensive study by experiment also (i) the validation within the intra- and inter-patient strategies and (ii) the ideal number of successive heartbeats in which to observe an anomalous episode of change in the ST segment. As a result, both RAST-BINARY and RAST-TERNARY can achieve an F1 score of 0.94 with a window of 4 heartbeats in the inter-patient validation. For the intra-patient validation, both versions achieve an F1 score of 0.73 using a longer observation window.
Different from what happens for most types of software systems, testing video games has largely remained a manual activity performed by human testers. This is mostly due to the continuous and intelligent user interaction video games require. Recently, reinforcement learning (RL) has been exploited to partially automate functional testing. RL enables training smart agents that can even achieve super-human performance in playing games, thus being suitable to explore them looking for bugs. We investigate the possibility of using RL for load testing video games. Indeed, the goal of game testing is not only to identify functional bugs, but also to examine the game's performance, such as its ability to avoid lags and keep a minimum number of frames per second (FPS) when high-demanding 3D scenes are shown on screen. We define a methodology employing RL to train an agent able to play the game as a human while also trying to identify areas of the game resulting in a drop of FPS. We demonstrate the feasibility of our approach on three games. Two of them are used as proof-of-concept, by injecting artificial performance bugs. The third one is an open-source 3D game that we load test using the trained agent showing its potential to identify areas of the game resulting in lower FPS.
Docker is the most diffused containerization technology adopted in the DevOps workflow. Docker allows shipping applications in Docker images, along with their dependencies and execution environment. A Docker image is created using a configuration file called Dockerfile. The literature shows that quality issues, such as violations of best practices (i.e., Dockerfile smells), are diffused among Docker artifacts. Smells can negatively impact the reliability, leading to building failures, poor performance, and security issues. In addition, it is unclear to what extent developers are aware of those quality issues and what quality aspects are correlated with the adoption of a Docker image. As evaluated in the literature, composing high-quality Dockerfiles and Docker images is not a trivial task. In this research, we aim to propose approaches and techniques to assess and improve the quality of Dockerfiles and Docker images. First, starting from the resolution of Dockerfile smells, we aim to improve the internal and then the related external quality aspects that also affect the developers’ preference and the perceived quality when they adopt a Docker image. Next, we want to employ that knowledge in the automated generation of high-quality Dockerfiles and Docker images.
Voice-based virtual assistants are becoming increasingly popular. Such systems provide frameworks to developers on which they can build their own apps. End-users can interact with such apps through a Voice User Interface (VUI), which allows to use natural language commands to perform actions. Testing such apps is far from trivial: The same command can be expressed in different ways. To support developers in testing VUIs, Deep Learning (DL)-based tools have been integrated in the development environments (e.g., the Alexa Developer Console, or ADC) to generate paraphrases for the commands (seed utterances) specified by the developers. Such tools, however, generate few paraphrases that do not always cover corner cases. In this paper, we introduce VUI-UPSET, a novel approach that aims at adapting chatbot-testing approaches to VUI-testing. Both systems, indeed, provide a similar natural-language-based interface to users. We conducted an empirical study to understand how VUI-UPSET compares to existing approaches in terms of (i) correctness of the generated paraphrases, and (ii) capability of revealing bugs. Multiple authors analyzed 5,872 generated paraphrases, with a total of 13,310 manual evaluations required for such a process. Our results show that, while the DL-based tool integrated in the ADC generates a higher percentage of meaningful paraphrases compared to VUI-UPSET, VUI-UPSET generates more bug-revealing paraphrases. This allows developers to test more thoroughly their apps at the cost of discarding a higher number of irrelevant paraphrases.
Code smells are poor implementation choices applied by developers during software evolution that often lead to critical flaws or failure. Much in the same way, community smells reflect the presence of organizational and socio-technical issues within a software community that may lead to additional project costs. Recent empirical studies provide evidence that community smells are often-if not always-connected to circumstances such as code smells. In this paper we look deeper into this connection by conducting a mixed-methods empirical study of 117 releases from 9 open-source systems. The qualitative and quantitative sides of our mixed-methods study were run in parallel and assume a mutually-confirmative connotation. On the one hand, we survey 162 developers of the 9 considered systems to investigate whether developers perceive relationship between community smells and the code smells found in those projects. On the other hand, we perform a fine-grained analysis into the 117 releases of our dataset to measure the extent to which community smells impact code smell intensity (i.e., criticality). We then propose a code smell intensity prediction model that relies on both technical and community-related aspects. The results of both sides of our mixed-methods study lead to one conclusion: community-related factors contribute to the intensity of code smells. This conclusion supports the joint use of community and code smells detection as a mechanism for the joint management of technical and social problems around software development communities.
Search-based techniques have been successfully used to automate test case generation. Such approaches allocate a fixed search budget to generate test cases aiming at maximizing code coverage. The search budget plays a crucial role; due to the hugeness of the search space, the higher the assigned budget, the higher the expected coverage. Code components have different structural properties that may affect the ability of search-based techniques to achieve a high coverage level. Thus, allocating a fixed search budget for all the components is not recommended and a component-specific search budget should be preferred. However, deciding the budget to assign to a given component is not a trivial task. In this article, we introduce Budget Optimization for Testing (BOT), an approach to adaptively allocate the search budget to the classes under test. BOT requires information about the branch coverage that will be achieved on each class with a given search budget. Therefore, we also introduce BRANCHOS, an approach that predicts coverage in a budget-aware way. The results of our experiments show that (i) BRANCHOS can approximate the branch coverage in time with a low error, and (ii) BOT can significantly increase the coverage achieved by a test generation tool and the effectiveness of generated tests.
Background and objective:
Equipments generally used for entertainment, such as Microsoft Kinect, have been widely
used for postural control as well. Such systems—compared to professional motion
tracking systems—allow to obtain non-invasive and low-cost tracking. This makes
them particularly suitable for the implementation of home rehabilitation systems.
Microsoft has recently released a new version of Kinect, namely Azure Kinect DK,
that is meant for developers, not consumers, and it has been specifically designed
to implement professional applications. The hardware of this new version of the
Kinect has been substantially improved as compared with previous versions.
However, the accuracy of the Azure Kinect DK has not been evaluated yet in
the context of the assessment of postural control as done for its predecessors.
Methods:
We present a study to compare the motion traces of the Azure Kinect DK with
those of a Vicon 3D system, typically considered the gold standard for high-accuracy
motion tracking. The study involved 26 subjects performing specific functional
reach and functional balance exercises.
Results:
The results clearly indicates that the Azure Kinect DK provides a very accurate
tracking of the main joints of the body for all the recording taken during the
lateral reach movement. The Root Mean Square Error (RMSE) between the two tracking
systems obtained is approximately 0.2 for the lateral and forward exercises while
for the balance exercise it is around 0.47 considering the average of the results
among all the joints. The angular Mean Absolute Error is approximately in the range
5–15 degrees for all the upper joints and independently on the exercise. The lower
body joints show a higher angular error between the two systems. Not surprisingly,
it was found that results are much better in correspondence of slow movements.
Conclusions:
The results achieved that the Azure Kinect DK has an incredibly high potential
to be used in applications of home rehabilitation, where the assessment of
postural control is a fundamental and crucial activity.
This paper presents the design and testing of a novel undershirt embedding dry bio-potential electrodes for biosignals acquisition. This undershirt is aimed to be used for monitoring vital signs of a human subject, such as multi-lead ECG and respiration wave signals. The obtained results demonstrate that the developed undershirt can satisfy the needs for bio-potential transfer to the data acquisition system in case of wearable applications.
The SZZ algorithm for identifying bug-inducing changes has been widely used to evaluate defect prediction techniques and to empirically investigate when, how, and by whom bugs are introduced. Over the years, researchers have proposed several heuristics to improve the SZZ accuracy, providing various implementations of SZZ. However, fairly evaluating those implementations on a reliable oracle is an open problem: SZZ evaluations usually rely on (i) the manual analysis of the SZZ output to classify the identified bug-inducing commits as true or false positives; or (ii) a golden set linking bug-fixing and bug-inducing commits. In both cases, these manual evaluations are performed by researchers with limited knowledge of the studied subject systems. Ideally, there should be a golden set created by the original developers of the studied systems. We propose a methodology to build a "developer-informed" oracle for the evaluation of SZZ variants. We use Natural Language Processing (NLP) to identify bug-fixing commits in which developers explicitly reference the commit(s) that introduced a fixed bug. This was followed by a manual filtering step aimed at ensuring the quality and accuracy of the oracle. Once built, we used the oracle to evaluate several variants of the SZZ algorithm in terms of their accuracy. Our evaluation helped us to distill a set of lessons learned to further improve the SZZ algorithm.
Deep learning (DL) techniques are gaining more and more attention in the software engineering community. They have been used to support several code-related tasks, such as automatic bug fixing and code comments generation. Recent studies in the Natural Language Processing (NLP) field have shown that the Text-To-Text Transfer Transformer (T5) architecture can achieve state-of-the-art performance for a variety of NLP tasks. The basic idea behind T5 is to first pre-train a model on a large and generic dataset using a self-supervised task (e.g., filling masked words in sentences). Once the model is pre-trained, it is fine-tuned on smaller and specialized datasets, each one related to a specific task (e.g., language translation, sentence classification). In this paper, we empirically investigate how the T5 model performs when pre-trained and fine-tuned to support code-related tasks. We pre-train a T5 model on a dataset composed of natural language English text and source code. Then, we fine-tune such a model by reusing datasets used in four previous works that used DL techniques to: (i) fix bugs, (ii) inject code mutants, (iii) generate assert statements, and (iv) generate code comments. We compared the performance of this single model with the results reported in the four original papers proposing DL-based solutions for those four tasks. We show that our T5 model, exploiting additional data for the self-supervised pre-training phase, can achieve performance improvements over the four baselines.
The number of connected medical devices that are able to acquire, analyze, or transmit health data is continuously increasing. This has allowed the rise of Internet of Medical Things (IoMT). IoMT-systems often need to process a massive amount of data. On the one hand, the colossal amount of data available allows the adoption of machine learning techniques to provide automatic diagnosis. On the other hand, it represents a problem in terms of data storage, data transmission, computational cost, and power consumption. To mitigate such problems, modern IoMT systems are adopting machine learning techniques with compressed sensing methods. Following this line of research, we propose a novel heartbeat morphology classifier, called RENEE, that works on compressed ECG signals. The ECG signal compression is realized by means of 1-bit quantization. We used several machine learning techniques to classify the heartbeats from compressed ECG signals. The obtained results demonstrate that RENEE exhi bits comparable results with respect to state-of-the-art methods that achieve the same goal on uncompressed ECG signals.
With the spread of Internet of Medical Things (IoMT) systems, the scientific community has dedicated a lot of effort in the definition of approaches for supporting specialized staff in the early diagnosis of pathological conditions and diseases. Several approaches have been defined for the identification of arrhythmia, a pathological condition that can be detected from an electrocardiogram (ECG) trace. There exist many types of arrhythmia and some of them present a great impact on the patients in terms of worsening of physical conditions or even mortality. In this work we present NEAPOLIS, a novel approach for the accurate detection of arrhythmia conditions. NEAPOLIS takes as input a heartbeat signal, extracted from an ECG trace, and provides as output a 5-class classification of the beat, namely normal sinus rhythm and four main types of arrhythmia conditions. NEAPOLIS is based on ECG characteristics that do not need a long-term observation of an ECG for the classification of the beat. This choice makes NEAPOLIS a (near) real-time detector of arrhythmia because it allows the detection within few seconds of ECG observation. The accuracy of NEAPOLIS has been compared to one of the best and most recent work from the literature. The achieved results show that NEAPOLIS provides a more accurate detection of arrhythmia conditions.
Code reading is one of the most frequent activities in software maintenance. Such an activity aims at acquiring information from the code and, thus, it is a prerequisite for program comprehension: developers need to read the source code they are going to modify before implementing changes. As the code changes, so does its readability; however, it is not clear yet how code readability changes during software evolution. To understand how code readability changes when software evolves, we studied the history of 25 open source systems. We modeled code readability evolution by defining four states in which a file can be at a certain point of time (non-existing, other-name, readable, and unreadable). We used the data gathered to infer the probability of transitioning from one state to another one. In addition, we also manually checked a significant sample of transitions to compute the performance of the state-of-the-art readability prediction model we used to calculate the transition probabilities. With this manual analysis, we found that the tool correctly classifies all the transitions in the majority of the cases, even if there is a loss of accuracy compared to the single-version readability estimation. Our results show that most of the source code files are created readable. Moreover, we observed that only a minority of the commits change the readability state. Finally, we manually carried out qualitative analysis to understand what makes code unreadable and what developers do to prevent this. Using our results we propose some guidelines (i) to reduce the risk of code readability erosion and (ii) to promote best practices that make code readable.
Android fragmentation is a well-known issue referring to the adoption of different versions in the multitude of devices supporting such an operating system. Each Android version features a set of APIs provided to developers. These APIs are subject to changes and may cause compatibility issues. To support app developers, approaches have been proposed to automatically identify API compatibility issues. CID, the state-of-the-art approach, is a data-driven solution learning how to detect those issues by analyzing the change history of Android APIs (“API side” learning). In this paper (extension of our MSR 2019 paper), we present an alternative data-driven approach, named ACRYL. ACRYL learns from changes implemented in apps in response to API changes (“client side” learning). When comparing these two solutions on 668 apps, for a total of 11,863 snapshots, we found that there is no clear winner, since the two techniques are highly complementary, and none of them provides a comprehensive support in detecting API compatibility issues: ACRYL achieves a precision of 7.0% (28.0%, when considering only the severe warnings), while CID achieves a precision of 18.4%. This calls for more research in this field, and led us to run a second empirical study in which we manually analyze 500 pull-requests likely related to the fixing of compatibility issues, documenting the root cause behind the fixed issue. The most common causes are related to changes in the Android APIs (∼ 87%), while about 13% of the issues are related to external causes, such as build and distribution, dependencies, and the app itself. The provided empirical knowledge can inform the building of better tools for the detection of API compatibility issues.
Refactoring aims at improving code non-functional attributes without modifying its external behavior. Previous studies investigated the motivations behind refactoring by surveying developers. With the aim of generalizing and complementing their findings, we present a large-scale study quantitatively and qualitatively investigating why developers perform refactoring in open source projects. First, we mine 287,813 refactoring operations performed in the history of 150 systems. Using this dataset, we investigate the interplay between refactoring operations and process (e.g., previous changes/fixes) and product (e.g., quality metrics) metrics. Then, we manually analyze 551 merged pull requests implementing refactoring operations and classify the motivations behind the implemented refactorings (e.g., removal of code duplication). Our results led to (i) quantitative evidence of the relationship existing between certain process/product metrics and refactoring operations and (ii) a detailed taxonomy, generalizing and complementing the ones existing in the literature, of motivations pushing developers to refactor source code.
Atrial fibrillation (AF) is the most common type of heart arrhythmia. AF is highly associated with other cardiovascular diseases,
such as heart failure, coronary artery disease and can lead to stroke. Unfortunately, in some cases people with atrial fibrillation have no
explicit symptoms and are unaware of their condition until it is discovered during a physical examination. Thus, it is
considered a priority to define highly accurate automatic approaches to detect such a pathology in the context of a massive screening.
For this reason, in the recent years several approaches have been defined to automatically detect AF. These approaches are often
based on machine learning techniques and—most of them—analyse the heart rhythm to make a prediction. Even if AF can be diagnosed
by analysing the rhythm, the analysis of the morphology of a heart beat is also important. Indeed, during an AF events the P wave
could be absent and fibrillation waves may appear in its place. This means that the presence of only arrhythmia could be not enough
to detect an AF events.
Based on the above consideration we have presented MORPHYTHM, an approach that use machine learning to combine rhythm and morphological
features to identify AF events. The results we achieved in an empirical evaluation seems promising. In this paper we present an extension
of MORPHYTHM, called LOCAL MORPHYTHM, aiming at further improving the detection accuracy of AF events. An empirical evaluation of LOCAL
MORPHYTHM has shown significantly better results in the classification process with respect to MORPHYTHM, particularly for what concerns
the true positives and false negatives.
Atrial Fibrillation (AF) is a common cardiac disease which can be diagnosed by analyzing a full electrocardiogram (ECG) layout. The main features that cardiologists observe in the process of AF diagnosis are (i) the morphology of heart beats and (ii) a simultaneous arrhythmia. In the last decades, a lot of effort has been devoted for the definition of approaches aiming to automatic detect such a pathology. The majority of AF detection approaches focus on R-R Intervals (RRI) analysis, neglecting the other side of the coin, i.e., the morphology of heart beats. In this paper, we aim at bridging this gap. First, we present some novel features that can be extracted from an ECG. Then, we combine such features with other classical rhythmic and morphological features in a machine learning based approach to improve the detection accuracy of AF events. The proposed approach, namely MORPHYTHM, has been validated on the Physionet MIT-BIH AF Database. The results of our experiment show that MORPHYTHM improves the classification accuracy of AF events by correctly classifying about 4,400 additional instances compared to the best state of the art approach.
In the last few years wearable devices are becoming always more important. Their usefulness mainly lies in the continuous monitoring of vital parameters and signals, such as electrocardiogram. However, such a monitoring results in an enormous amount of data which cannot be precisely analyzed manually. This recalls the need of approaches and tools for the automatic analysis of acquired data. In this paper we present MIPHAS, a software system devised to meet this need in a well-defined context: the monitoring of athletes during sport activities. MIPHAS is a system composed of several components: a smart t-shirt, an electronic component, a web application, a mobile APP and an advanced decision support system based on machine learning techniques. This latter is the core component of MIPHAS dedicated to the automatic detection of potential anomalies during the monitoring of vital parameters.
Heart Rate (HR) is one of the mostly used electrocardiogram (ECG) feature in many automatic detectors of anomalies. This paper deals with a preliminary study on a novel approach which, through the combination of Machine Learning (ML) and Compressed Sensing (CS), aims at retrieving vital information from a digital compressed single-lead electrocardiogram (ECG) signal. As a potential key information to estimate the heart rate, this study focuses on the identification of R-peak occurrences. The study has been conducted on two different types of signal both obtained from the compressed samples provided by a CS algorithm, already available in literature. The results demonstrate that the use of CS in combination with a ML technique can find high competitiveness when compared to a state of the art method working on the uncompressed ECG signal.
At ICPC 2010 we presented an empirical study to statistically analyze the equivalence of several
traceability recovery methods based on Information Retrieval (IR) techniques. We experimented
the Vector Space Model (VSM), Latent Semantic Indexing (LSI), the Jensen-Shannon (JS)
method, and Latent Dirichlet Allocation (LDA). Unlike previous empirical studies we did
not compare the different IR based traceability recovery methods only using the usual precision
and recall metrics. We introduced some metrics to analyze the overlap of the set of candidate links
recovered by each method. We also based our analysis on Principal Component Analysis (PCA) to analyze
the orthogonality of the experimented methods. The results showed that while the accuracy of LDA was
lower than previously used methods, LDA was able to capture some information missed by the other
exploited IR methods. Instead, JS, VSM, and LSI were almost equivalent. This paved the way to
possible integration of IR based traceability recovery methods.
Our paper was one of the first papers experimenting LDA for traceability recovery. Also,
the overlap metrics and PCA have been used later to compare and possibly integrate different
recommendation approaches not only for traceability recovery, but also for other reverse
engineering and software maintenance tasks, such as code smell detection, design pattern
detection, and bug prediction.
Software development video tutorials have seen a steep increase in popularity in recent years. Their main advantage is that they thoroughly illustrate how certain technologies, programming languages, etc. are to be used. However, they come with a caveat: there is currently little support for searching and browsing their content. This makes it difficult to quickly find the useful parts in a longer video, as the only options are watching the entire video, leading to wasted time, or fast-forwarding through it, leading to missed information. We present an approach to mine video tutorials found on the web and enable developers to query their contents as opposed to just their metadata. The video tutorials are processed and split into coherent fragments, such that only relevant fragments are returned in response to a query. Moreover, fragments are automatically classified according to their purpose, such as introducing theoretical concepts, explaining code implementation steps, or dealing with errors. This allows developers to set filters in their search to target a specific type of video fragment they are interested in. In addition, the video fragments in CodeTube are complemented with information from other sources, such as Stack Overflow discussions, giving more context and useful information for understanding the concepts.
Code smells are symptoms of poor design and implementation choices. Previous studies empirically assessed the impact of smells on code quality and clearly indicate their negative impact on maintainability, including a higher bug-proneness of components affected by code smells. In this paper, we capture previous findings on bug-proneness to build a specialized bug prediction model for smelly classes. Specifically, we evaluate the contribution of a measure of the severity of code smells (i.e., code smell intensity) by adding it to existing bug prediction models based on both product and process metrics, and comparing the results of the new model against the baseline models. Results indicate that the accuracy of a bug prediction model increases by adding the code smell intensity as predictor. We also compare the results achieved by the proposed model with the ones of an alternative technique which considers metrics about the history of code smells in files, finding that our model works generally better. However, we observed interesting complementarities between the set of buggy and smelly classes correctly classified by the two models. By evaluating the actual information gain provided by the intensity index with respect to the other metrics in the model, we found that the intensity index is a relevant feature for both product and process metrics-based models. At the same time, the metric counting the average number of code smells in previous versions of a class considered by the alternative model is also able to reduce the entropy of the model. On the basis of this result, we devise and evaluate a smell-aware combined bug prediction model that included product, process, and smell-related features. We demonstrate how such model classifies bug-prone code components with an F-Measure at least 13 percent higher than the existing state-of-the-art models.
The market for mobile apps is getting bigger and bigger, and it is expected to be worth over 100 Billion dollars in 2020. To have a chance to succeed in such a competitive environment, developers need to build and maintain high-quality apps, continuously astonishing their users with the coolest new features. Mobile app marketplaces allow users to release reviews. Despite reviews are aimed at recommending apps among users, they also contain precious information for developers, reporting bugs and suggesting new features. To exploit such a source of information, developers are supposed to manually read user reviews, something not doable when hundreds of them are collected per day. To help developers dealing with such a task, we developed CLAP (Crowd Listener for releAse Planning), a web application able to (i) categorize user reviews based on the information they carry out, (ii) cluster together related reviews, and (iii) prioritize the clusters of reviews to be implemented when planning the subsequent app release. We evaluated all the steps behind CLAP, showing its high accuracy in categorizing and clustering reviews and the meaningfulness of the recommended prioritizations. Also, given the availability of CLAP as a working tool, we assessed its applicability in industrial environments.
Online code clones are code fragments that are copied from software projects or online sources to Stack Overflow as examples. Due to an absence of a checking mechanism after the code has been copied to Stack Overflow, they can become toxic code snippets, e.g., they suffer from being outdated or violating the original software license. We present a study of online code clones on Stack Overflow and their toxicity by incorporating two developer surveys and a large-scale code clone detection. A survey of 201 high-reputation Stack Overflow answerers (33 percent response rate) showed that 131 participants (65 percent) have ever been notified of outdated code and 26 of them (20 percent) rarely or never fix the code. 138 answerers (69 percent) never check for licensing conflicts between their copied code snippets and Stack Overflow's CC BY-SA 3.0. A survey of 87 Stack Overflow visitors shows that they experienced several issues from Stack Overflow answers: mismatched solutions, outdated solutions, incorrect solutions, and buggy code. 85 percent of them are not aware of CC BY-SA 3.0 license enforced by Stack Overflow, and 66 percent never check for license conflicts when reusing code snippets. Our clone detection found online clone pairs between 72,365 Java code snippets on Stack Overflow and 111 open source projects in the curated Qualitas corpus. We analysed 2,289 non-trivial online clone candidates. Our investigation revealed strong evidence that 153 clones have been copied from a Qualitas project to Stack Overflow. We found 100 of them (66 percent) to be outdated, of which 10 were buggy and harmful for reuse. Furthermore, we found 214 code snippets that could potentially violate the license of their original software and appear 7,112 times in 2,427 GitHub projects.
Understanding software is an inherent requirement for many maintenance and evolution tasks. Without a thorough understanding of the code, developers would not be able to fix bugs or add new features timely. Measuring code understandability might be useful to guide developers in writing better code, and could also help in estimating the effort required to modify code components. Unfortunately, there are no metrics designed to assess the understandability of code snippets. In this work, we perform an extensive evaluation of 121 existing as well as new code-related, documentation-related, and developer-related metrics. We try to (i) correlate each metric with understandability and (ii) build models combining metrics to assess understandability. To do this, we use 444 human evaluations from 63 developers and we obtained a bold negative result: none of the 121 experimented metrics is able to capture code understandability, not even the ones assumed to assess quality attributes apparently related, such as code readability and complexity. While we observed some improvements while combining metrics in models, their effectiveness is still far from making them suitable for practical applications. Finally, we conducted interviews with five professional developers to understand the factors that influence their ability to understand code snippets, aiming at identifying possible new metrics.
Stack Overflow is the most popular question and answer website on computer programming with more than 2.5M users, 16M questions, and a new answer posted, on average, every five seconds. This wide availability of data led researchers to develop techniques to mine Stack Overflow posts. The aim is to find and recommend posts with information useful to developers. However, and not surprisingly, not every Stack Overflow post is useful from a developer's perspective. We empirically investigate what the characteristics of "useful" Stack Overflow posts are. The underlying assumption of our study is that posts that were used (referenced in the source code) in the past by developers are likely to be useful. We refer to these posts as leveraged posts. We study the characteristics of leveraged posts as opposed to the non-leveraged ones, focusing on community aspects (e.g., the reputation of the user who authored the post), the quality of the included code snippets (e.g., complexity), and the quality of the post's textual content (e.g., readability). Then, we use these features to build a prediction model to automatically identify posts that are likely to be leveraged by developers. Results of the study indicate that post meta-data (e.g., the number of comments received by the answer) is particularly useful to predict whether it has been leveraged or not, whereas code readability appears to be less useful. A classifier can classify leveraged posts with a precision of 65% and recall of 49% and non-leveraged ones with a precision of 95% and recall of 97%. This opens the road towards an automatic identification of "high-quality content" in Stack Overflow.
Android apps are inextricably linked to the official Android APIs. Such a strong form of dependency implies that changes introduced in new versions of the Android APIs can severely impact the apps' code, for example because of deprecated or removed APIs. In reaction to those changes, mobile app developers are expected to adapt their code and avoid compatibility issues. To support developers, approaches have been proposed to automatically identify API compatibility issues in Android apps. The state-of-the-art approach, named CiD, is a data-driven solution learning how to detect those issues by analyzing the changes in the history of Android APIs ("API side" learning). While it can successfully identify compatibility issues, it cannot recommend coding solutions. We devised an alternative data-driven approach, named ACRYL. ACRYL learns from changes implemented in other apps in response to API changes ("client side" learning). This allows not only to detect compatibility issues, but also to suggest a fix. When empirically comparing the two tools, we found that there is no clear winner, since the two approaches are highly complementary, in that they identify almost disjointed sets of API compatibility issues. Our results point to the future possibility of combining the two approaches, trying to learn detection/fixing rules on both the API and the client side.
This paper deals with a description of an innovative Internet of Medical Things (IoMT) system for implementing personalized health services. The proposed IoMT system has the following advantages respect to the state-of-the-art systems available in literature: (i) it fuses the data provided by several sensors, inertial measurement unit, the bio-impedance and electrocardiogram, (ii) it uses Compressed Sensing (CS) of data prior to transmission, and (iii) it adopts distributed artificial intelligence at the edge for anomaly detection. A description of the specific features and requirements of the wearable device that will be embedded on a smart T-shirt is reported. According to the delineated requirements, an architecture for the wearable device is proposed. Finally the contribution of Artificial Intelligence in the proposed IoMT system is discussed, aiming at identifying anomalies and supporting the process of decision making in the early diagnosis of diseases both at individuals level (local knowledge) or groups of individuals level (global knowledge).
This paper deals with a description of an innovative Internet of Medical Things (IoMT) system for implementing personalized health services. The proposed IoMT system has the following advantages respect to the state-of-the-art systems available in literature: (i) it fuses the data provided by several sensors, inertial measurement unit, the bio-impedance and electrocardiogram, (ii) it uses Compressed Sensing (CS) of data prior to transmission, and (iii) it adopts distributed artificial intelligence at the edge for anomaly detection. A description of the specific features and requirements of the wearable device that will be embedded on a smart T-shirt is reported. According to the delineated requirements, an architecture for the wearable device is proposed. Finally the contribution of Artificial Intelligence in the proposed IoMT system is discussed, aiming at identifying anomalies and supporting the process of decision making in the early diagnosis of diseases both at individuals level (local knowledge) or groups of individuals level (global knowledge).
Context: Code smells are suboptimal design or implementation choices made by programmers during the development of a software system that possibly lead to low code maintainability and higher maintenance costs.
Objective: Previous research mainly studied the characteristics of code smell instances affecting a source code file, while only few studies analyzed the magnitude and effects of smell co-occurrence, i.e., the co-occurrence of different types of smells on the same code component. This paper aims at studying in details this phenomenon.
Method: We analyzed 13 code smell types detected in 395 releases of 30 software systems to firstly assess the extent to which code smells co-occur, and then we analyze (i) which code smells co-occur together, and (ii) how and why they are introduced and removed by developers.
Results: 59% of smelly classes are affected by more than one smell, and in particular there are six pairs of smell types (e.g., Message Chains and Spaghetti Code) that frequently co-occur. Furthermore, we observed that method-level code smells may be the root cause for the introduction of class-level smells. Finally, code smell co-occurrences are generally removed together as a consequence of other maintenance activities causing the deletion of the affected code components (with a consequent removal of the code smell instances) as well as the result of a major restructuring or scheduled refactoring actions.
Conclusions: Based on our findings, we argue that more research aimed at designing co-occurrence-aware code smell detectors and refactoring approaches is needed.
In recent software development and distribution scenarios, app stores are playing a major role, especially for mobile apps. On one hand, app stores allow continuous releases of app updates. On the other hand, they have become the premier point of interaction between app providers and users. After installing/updating apps, users can post reviews and provide ratings, expressing their level of satisfaction with apps, and possibly pointing out bugs or desired features. In this paper we empirically investigate—by performing a study on the evolution of 100 open source Android apps and by surveying 73 developers—to what extent app developers take user reviews into account, and whether addressing them contributes to apps’ success in terms of ratings. In order to perform the study, as well as to provide a monitoring mechanism for developers and project managers, we devised an approach, named CRISTAL, for tracing informative crowd reviews onto source code changes, and for monitoring the extent to which developers accommodate crowd requests and follow-up user reactions as reflected in their ratings. The results of our study indicate that (i) on average, half of the informative reviews are addressed, and over 75% of the interviewed developers claimed to take them into account often or very often, and that (ii) developers implementing user reviews are rewarded in terms of significantly increased user ratings.
Unreadable code could compromise program comprehension, and it could cause the introduction of bugs. Code consists of mostly natural language text, both in identifiers and comments, and it is a particular form of text. Nevertheless, the models proposed to estimate code readability take into account only structural aspects and visual nuances of source code, such as line length and alignment of characters. In this paper, we extend our previous work in which we use textual features to improve code readability models. We introduce 2 new textual features, and we reassess the readability prediction power of readability models on more than 600 code snippets manually evaluated, in terms of readability, by 5K+ people. We also replicate a study by Buse and Weimer on the correlation between readability and FindBugs warnings, evaluating different models on 20 software systems, for a total of 3M lines of code. The results demonstrate that (1) textual features complement other features and (2) a model containing all the features achieves a significantly higher accuracy as compared with all the other state-of-the-art models. Also, readability estimation resulting from a more accurate model, ie, the combined model, is able to predict more accurately FindBugs warnings.
Software engineering researchers have used sentiment analysis for various purposes, such as analyzing app reviews and detecting developers' emotions. However, most existing sentiment analysis tools do not achieve satisfactory performance when used in software-related contexts, and there are not many ready-to-use datasets in this domain. To facilitate the emergence of better tools and sufficient validation of sentiment analysis techniques, we present two datasets with labeled sentiments, which are extracted from mobile app reviews and Stack Overflow discussions, respectively. The web app we created to support the labeling of the Stack Overflow dataset is also provided.
Automatically generating test cases plays an important role to reduce the time spent by developers during the testing phase. In last years, several approaches have been proposed to tackle such a problem: amongst others, search-based techniques have been shown to be particularly promising. In this paper we describe Ocelot, a search-based tool for the automatic generation of test cases in C. Ocelot allows practitioners to write skeletons of test cases for their programs and researchers to easily implement and experiment new approaches for automatic test-data generation. We show that Ocelot achieves a higher coverage compared to a competitive tool in 81% of the cases. Ocelot is publicly available to support both researchers and practitioners.
Software testing is one of the most crucial tasks in the typical development process. Developers are usually required to write unit test cases for the code they implement. Since this is a time-consuming task, in last years many approaches and tools for automatic test case generation --- such as EvoSuite--- have been introduced. Nevertheless, developers have to maintain and evolve tests to sustain the changes in the source code; therefore, having readable test cases is important to ease such a process. However, it is still not clear whether developers make an effort in writing readable unit tests. Therefore, in this paper, we conduct an explorative study comparing the readability of manually written test cases with the classes they test. Moreover, we deepen such analysis looking at the readability of automatically generated test cases. Our results suggest that developers tend to neglect the readability of test cases and that automatically generated test cases are generally even less readable than manually written ones.
Sentiment analysis has been applied to various software engineering (SE) tasks, such as evaluating app reviews or analyzing developers' emotions in commit messages. Studies indicate that sentiment analysis tools provide unreliable results when used out-of-the-box, since they are not designed to process SE datasets. The silver bullet for a successful application of sentiment analysis tools to SE datasets might be their customization to the specific usage context. We describe our experience in building a software library recommender exploiting developers' opinions mined from Stack Overflow. To reach our goal, we retrained---on a set of 40k manually labeled sentences/words extracted from Stack Overflow---a state-of-the-art sentiment analysis tool exploiting deep learning. Despite such an effort- and time-consuming training process, the results were negative. We changed our focus and performed a thorough investigation of the accuracy of commonly used tools to identify the sentiment of SE related texts. Meanwhile, we also studied the impact of different datasets on tool performance. Our results should warn the research community about the strong limitations of current sentiment analysis tools.
In the last years social media are increasing their importance in the context of digital freelancing by letting companies offer projects to external professionals through the Web. However, the available platforms for digital freelancing are still far from supporting the ecosystem of companies and professionals during the implementation of complex projects through an accurate definition of the required skills, roles, interdependencies and responsibilities. This chapter presents a roundup of the available systems of support to freelancers. The goal is to identify the essential features and structure of a comprehensive social media able to effectively manage and support the potential of an “economy 4.0” characterized by a free and flexible circulation of highly skilled professionals, that can be aggregated to support the needs of organizations.
This paper analyzes developer-related factors that could influence the likelihood for a commit to induce a fix. Specifically, we focus on factors that could potentially hinder developers' ability to correctly understand the code components involved in the change to be committed as follows: (i) the coherence of the commit (i.e., how much it is focused on a specific topic); (ii) the experience level of the developer on the files involved in the commit; and (iii) the interfering changes performed by other developers on the files involved in past commits. The results of our study indicate that "fix-inducing" commits (i.e., commits that induced a fix) are significantly less coherent than "clean" commits (i.e., commits that did not induce a fix). Surprisingly, "fix-inducing" commits are performed by more experienced developers; yet, those are the developers performing more complex changes in the system. Finally, "fix-inducing" commits have a higher number of past interfering changes as compared with "clean" commits. Our empirical study sheds light on previously unexplored factors and presents significant results that can be used to improve approaches for defect prediction.
A broken snapshot represents a snapshot from a project's change history that cannot be compiled. Broken snapshots can have significant implications for researchers, as they could hinder any analysis of the past project history that requires code to be compiled. Noticeably, while some broken snapshots may be observable in change history repositories (e.g., no longer available dependencies), some of them may not necessarily happen during the actual development. In this paper, we systematically study the compilability of broken snapshots in 219 395 snapshots belonging to 100 Java projects from the Apache Software Foundation, all relying on Maven as an automated build tool. We investigated broken snapshots from 2 different perspectives: (1) how frequently they happen and (2) likely causes behind them. The empirical results indicate that broken snapshots occur in most (96%) of the projects we studied and that they are mainly due to problems related to the resolution of dependencies. On average, only 38% of the change history of the analyzed systems is currently successfully compilable.
In the last decades, the research community has devoted a lot of effort in the definition of approaches able to predict the defect proneness of source code files. Such approaches exploit several predictors (e.g., product or process metrics) and use machine learning classifiers to predict classes into buggy or not buggy, or provide the likelihood that a class will exhibit a fault in the near future. The empirical evaluation of all these approaches indicated that there is no machine learning classifier providing the best accuracy in any context, highlighting interesting complementarity among them. For these reasons ensemble methods have been proposed to estimate the bug-proneness of a class by combining the predictions of different classifiers. Following this line of research, in this paper we propose an adaptive method, named ASCI (Adaptive Selection of Classifiers in bug prediction), able to dynamically select among a set of machine learning classifiers the one which better predicts the bug-proneness of a class based on its characteristics. An empirical study conducted on 30 software systems indicates that ASCI exhibits higher performances than five different classifiers used independently and combined with the majority voting ensemble method.
Context: Since the mid-2000s, numerous recommendation systems based on text retrieval (TR) have been proposed to support software engineering (SE) tasks such as concept location, traceability link recovery, code reuse, impact analysis, and so on. The success of TR-based solutions highly depends on the query submitted, which is either formulated by the developer or automatically extracted from software artifacts.
Aim: We aim at predicting the quality of queries submitted to TR-based approaches in SE. This can lead to benefits for developers and for the quality of software systems alike. For example, knowing when a query is poorly formulated can save developers the time and frustration of analyzing irrelevant search results. Instead, they could focus on reformulating the query. Also, knowing if an artifact used as a query leads to irrelevant search results may uncover underlying problems in the query artifact itself.
Method: We introduce an automatic query quality prediction approach for software artifact retrieval by adapting NL-inspired solutions to their use on software data. We present two applications and evaluations of the approach in the context of concept location and traceability link recovery, where TR has been applied most often in SE. For concept location, we use the approach to determine if the list of retrieved code elements is likely to contain code relevant to a particular change request or not, in which case, the queries are good candidates for reformulation. For traceability link recovery, the queries represent software artifacts. In this case, we use the query quality prediction approach to identify artifacts that are hard to trace to other artifacts and may therefore have a low intrinsic quality for TR-based traceability link recovery.
Results: For concept location, the evaluation shows that our approach is able to correctly predict the quality of queries in 82% of the cases, on average, using very little training data. In the case of traceability recovery, the proposed approach is able to detect hard to trace artifacts in 74% of the cases, on average.
Conclusions: The results of our evaluation on applications for concept location and traceability link recovery indicate that our approach can be used to predict the results of a TR-based approach by assessing the quality of the text query. This can lead to saved effort and time, as well as the identification of software artifacts that may be difficult to trace using TR.
Release notes document corrections, enhancements, and, in general, changes that were implemented in a new release of a software project. They are usually created manually and may include hundreds of different items, such as descriptions of new features, bug fixes, structural changes, new or deprecated APIs, and changes to software licenses. Thus, producing them can be a time-consuming and daunting task. This paper describes ARENA (Automatic RElease Notes generAtor), an approach for the automatic generation of release notes. ARENA extracts changes from the source code, summarizes them, and integrates them with information from versioning systems and issue trackers. ARENA was designed based on the manual analysis of 990 existing release notes. In order to evaluate the quality of the release notes automatically generated by ARENA, we performed four empirical studies involving a total of 56 participants (48 professional developers and eight students). The obtained results indicate that the generated release notes are very good approximations of the ones manually produced by developers and often include important information that is missing in the manually created release notes.
Program understanding plays a pivotal role in software maintenance and evolution: a deep understanding of code is the stepping stone for most software-related activities, such as bug fixing or testing. Being able to measure the understandability of a piece of code might help in estimating the effort required for a maintenance activity, in comparing the quality of alternative implementations, or even in predicting bugs. Unfortunately, there are no existing metrics specifically designed to assess the understandability of a given code snippet. In this paper we perform a first step in this direction, by studying the extent to which several types of metrics computed on code, documentation and developers correlate with code understandability. To perform such an investigation we ran a study with 46 participants who were asked to understand eight code snippets each. We collected a total of 324 evaluations aiming at assessing the perceived understandability, the actual level of understanding and the time needed to understand a code snippet. Our results demonstrate that none of the (existing and new) metrics we considered is able to capture code understandability, not even the ones assumed to assess quality attributes strongly related with it, such as code readability and complexity.
Meaningless identifiers as well as inconsistent use of identifiers in the source code might hinder code readability and result in increased software maintenance efforts. Over the past years, effort has been devoted to promoting a consistent usage of identifiers across different parts of a system through approaches exploiting static code analysis and Natural Language Processing (NLP). These techniques have been evaluated in small-scale studies, but it is unclear how they compare to each other and how they complement each other. Furthermore, a full-fledged larger empirical evaluation is still missing. We aim at bridging this gap. We asked the developers of five Java projects to assess the meaningfulness of the recommendations generated by three different techniques, two already existing in the literature (one exploiting static analysis, one using NLP) and a novel one proposed in this paper. With a total of 922 rename refactorings evaluated, this is, to the best of our knowledge, the largest empirical study conducted to assess and compare rename refactoring tools aiming at promoting a consistent use of identifiers. Our study sheds light on the current state-of-the-art in rename refactoring recommenders, and indicates directions for future work.
This tool demonstration describes GEMMA, a tool aimed at optimizing the colors used by Android apps, with the goal of reducing the energy consumption on (AM)OLED displays while keeping the user interface visually attractive for end-users. GEMMA has been developed as a distributed architecture to ensure scalability. It is composed of a Web-based client and processing nodes that are capable of analyzing multiple requests (apps) concurrently. The underlying approach makes use of power models, color theory, and multi-objective genetic algorithms. The empirical evaluation of GEMMA indicated its ability to reduce energy consumption while producing color combinations pleasant enough for the users. Also, a qualitative analysis conducted with app developers highlighted the potential applicability of the tool in an industrial context. VIDEO.
The promise of recommender systems is to provide intelligent support to developers during their programming tasks. Such support ranges from suggesting program entities to taking into account pertinent Q&A pages. However, current recommender systems limit the context analysis to change history and developers' activities in the IDE, without considering what a developer has already consulted or perused, e.g., by performing searches from the Web browser. Given the faceted nature of many programming tasks, and the incompleteness of the information provided by a single artifact, several heterogeneous resources are required to obtain the broader picture needed by a developer to accomplish a task. We present Libra, a holistic recommender system. It supports the process of searching and navigating the information needed by constructing a holistic meta-information model of the resources perused by a developer, analyzing their semantic relationships, and augmenting the web browser with a dedicated interactive navigation chart. The quantitative and qualitative evaluation of Libra provides evidence that a holistic analysis of a developer's information context can indeed offer comprehensive and contextualized support to information navigation and retrieval during software development.
Refactoring aims at improving the internal structure of a software system without changing its external behavior. Previous studies empirically assessed, on the one hand, the benefits of refactoring in terms of code quality and developers' productivity, and on the other hand, the underlying reasons that push programmers to apply refactoring. Results achieved in the latter investigations indicate that besides personal motivation such as the responsibility concerned with code authorship, refactoring is mainly performed as a consequence of changes in the requirements rather than driven by software quality. However, these findings have been derived by surveying developers, and therefore no software repository study has been carried out to corroborate the achieved findings. To bridge this gap, we provide a quantitative investigation on the relationship between different types of code changes (i.e., Fault Repairing Modification, Feature Introduction Modification, and General Maintenance Modification) and 28 different refactoring types coming from 3 open source projects. Results showed that developers tend to apply a higher number of refactoring operations aimed at improving maintainability and comprehensibility of the source code when fixing bugs. Instead, when new features are implemented, more complex refactoring operations are performed to improve code cohesion. Most of the times, the underlying reasons behind the application of such refactoring operations are represented by the presence of duplicate code or previously introduced self-admitted technical debts.
Static analysis tools are often used by software developers to entail early detection of potential faults, vulnerabilities, code smells, or to assess the source code adherence to coding standards and guidelines. Also, their adoption within Continuous Integration (CI) pipelines has been advocated by researchers and practitioners. This paper studies the usage of static analysis tools in 20 Java open source projects hosted on GitHub and using Travis CI as continuous integration infrastructure. Specifically, we investigate (i) which tools are being used and how they are configured for the CI, (ii) what types of issues make the build fail or raise warnings, and (iii) whether, how, and after how long are broken builds and warnings resolved. Results indicate that in the analyzed projects build breakages due to static analysis tools are mainly related to adherence to coding standards, and there is also some attention to missing licenses. Build failures related to tools identifying potential bugs or vulnerabilities occur less frequently, and in some cases such tools are activated in a "softer" mode, without making the build fail. Also, the study reveals that build breakages due to static analysis tools are quickly fixed by actually solving the problem, rather than by disabling the warning, and are often properly documented.
Previous research demonstrated how code smells (i.e., symptoms of the presence of poor design or implementation choices) threat software maintainability. Moreover, some studies showed that their interaction has a stronger negative impact on the ability of developers to comprehend and enhance the source code when compared to cases when a single code smell instance affects a code element (i.e., a class or a method). While such studies analyzed the effect of the co-presence of more smells from the developers' perspective, a little knowledge regarding which code smell types tend to co-occur in the source code is currently available. Indeed, previous papers on smell co-occurrence have been conducted on a small number of code smell types or on small datasets, thus possibly missing important relationships. To corroborate and possibly enlarge the knowledge on the phenomenon, in this paper we provide a large-scale replication of previous studies, taking into account 13 code smell types on a dataset composed of 395 releases of 30 software systems. Code smell co-occurrences have been captured by using association rule mining, an unsupervised learning technique able to discover frequent relationships in a dataset. The results highlighted some expected relationships, but also shed light on co-occurrences missed by previous research in the field.
Developers often require knowledge beyond the one they possess, which boils down to asking co-workers for help or consulting additional sources of information, such as Application Programming Interfaces (API) documentation, forums, and Q&A websites. However, it requires time and energy to formulate one’s problem, peruse and process the results. We propose a novel approach that, given a context in the Integrated Development Environment (IDE), automatically retrieves pertinent discussions from Stack Overflow, evaluates their relevance using a multi-faceted ranking model, and, if a given confidence threshold is surpassed, notifies the developer. We have implemented our approach in PROMPTER, an Eclipse plug-in. PROMPTER was evaluated in two empirical studies. The first study was aimed at evaluating PROMPTER’s ranking model and involved 33 participants. The second study was conducted with 12 participants and aimed at evaluating PROMPTER’s usefulness when supporting developers during development and maintenance tasks. Since PROMPTER uses “volatile information” crawled from the web, we also replicated Study I after one year to assess the impact of such a “volatility” on recommenders like PROMPTER. Our results indicate that (i) PROMPTER recommendations were positively evaluated in 74 % of the cases on average, (ii) PROMPTER significantly helps developers to improve the correctness of their tasks by 24 % on average, but also (iii) 78 % of the provided recommendations are “volatile” and can change at one year of distance. While PROMPTER revealed to be effective, our studies also point out issues when building recommenders based on information available on online forums.
Refactoring, and in particular, remodularization operations can be performed to repair the design of a software system and remove the erosion caused by software evolution. Various approaches have been proposed to support developers during the remodularization of a software system. Most of these approaches are based on the underlying assumption that developers pursue an optimal balance between quality metrics—such as cohesion and coupling—when modularizing the classes of their systems. Thus, a remodularization recommender proposes a solution that implicitly provides a (near) optimal balance between such quality metrics. However, there is still a lack of empirical evidence that such a balance is the desideratum by developers. This paper aims at bridging this gap by analyzing both objectively and subjectively the aforementioned phenomenon. Specifically, we present the results of (i) a large study analyzing the modularization quality, in terms of package cohesion and coupling, of 100 open source systems, and (ii) a survey conducted with 34 developers aimed at understanding the driving factors they consider when performing modularization tasks. The results achieved have been used to distill a set of lessons learned that might be considered to design more effective remodularization recommenders.
Developers often require knowledge beyond the one they possess, which boils down to asking co-workers for help or consulting additional sources of information, such as Application Programming Interfaces (API) documentation, forums, and Q&A websites. However, it requires time and energy to formulate one's problem, peruse and process the results. We propose a novel approach that, given a context in the Integrated Development Environment (IDE), automatically retrieves pertinent discussions from StackOverflow, evaluates their relevance using a multi-faceted ranking model, and, if a given confidence threshold is surpassed, notifies the developer. We have implemented our approach in Prompter, an Eclipse plug-in. Prompter was evaluated in two empirical studies. The first study was aimed at evaluating Prompter's ranking model and involved 33 participants. The second study was conducted with 12 participants and aimed at evaluating Prompter's usefulness when supporting developers during development and maintenance tasks. Since Prompter uses "volatile information" crawled from the web, we also replicated Study I after one year to assess the impact of such a "volatility" on recommenders like Prompter. Our results indicate that (i) Prompter recommendations were positively evaluated in 74% of the cases on average, (ii) Prompter significantly helps developers to improve the correctness of their tasks by 24% on average, but also (iii) 78% of the provided recommendations are ``volatile" and can change at one year of distance. While Prompter revealed to be effective, our studies also point out issues when building recommenders based on information available on online forums.
Information Retrieval (IR) approaches are nowadays used to support various software engineering tasks, such as feature location, traceability link recovery, clone detection, or refactoring. However, previous studies showed that inadequate instantiation of an IR technique and underlying process could significantly affect the performance of such approaches in terms of precision and recall. This paper proposes the use of Genetic Algorithms (GAs) to automatically configure and assemble an IR process for software engineering tasks. The approach (named GA-IR) determines the (near) optimal solution to be used for each stage of the IR process, i.e., term extraction, stop word removal, stemming, indexing and an IR algebraic method calibration. We applied GA-IR on two different software engineering tasks, namely traceability link recovery and identification of duplicate bug reports. The results of the study indicate that GA-IR outperforms approaches previously published in the literature, and that it does not significantly differ from an ideal upper bound that could be achieved by a supervised and combinatorial approach.
In the context of testing of Object-Oriented (OO) software systems, researchers have recently proposed search based approaches to automatically generate whole test suites by considering simultaneously all targets (e.g., branches) defined by the coverage criterion (multi-target approach). The goal of whole suite approaches is to overcome the problem of wasting search budget that iterative single-target approaches (which iteratively generate test cases for each target) can encounter in case of infeasible targets. However, whole suite approaches have not been implemented and experimented in the context of procedural programs. In this paper we present OCELOT (Optimal Coverage sEarch-based tooL for sOftware Testing), a test data generation tool for C programs which implements both a state-of-the-art whole suite approach and an iterative single-target approach designed for a parsimonious use of the search budget. We also present an empirical study conducted on 35 open-source C programs to compare the two approaches implemented in OCELOT. The results indicate that the iterative single-target approach provides a higher efficiency while achieving the same or an even higher level of coverage than the whole suite approach.
Test smells have been defined as poorly designed tests and, as reported by recent empirical studies, their presence may negatively affect comprehension and maintenance of test suites. Despite this, there are no available automated tools to support identification and repair of test smells. In this paper, we firstly investigate developers' perception of test smells in a study with 19 participants. The results show that developers generally do not recognize (potentially harmful) test smells, highlighting that automated tools for identifying such smells are much needed. However, to build effective tools, deeper insights into the test smells phenomenon are required. To this aim, we conducted a large-scale empirical investigation aimed at analyzing (i) when test smells occur in source code, (ii) what their survivability is, and (iii) whether their presence is associated with the presence of design problems in production code (code smells). The results indicate that test smells are usually introduced when the corresponding test code is committed in the repository for the first time, and they tend to remain in a system for a long time. Moreover, we found various unexpected relationships between test and code smells. Finally, we show how the results of this study can be used to build effective automated tools for test smell detection and refactoring.
Code smells are symptoms of poor design and implementation choices. Previous studies empirically assessed the impact of smells on code quality and clearly indicate their negative impact on maintainability, including a higher bug- proneness of components affected by code smells. In this paper we capture previous findings on bug-proneness to build a specialized bug prediction model for smelly classes. Specifically, we evaluate the contribution of a measure of the severity of code smells (i.e., code smell intensity) by adding it to existing bug prediction models and comparing the results of the new model against the baseline model. Results indicate that the accuracy of a bug prediction model increases by adding the code smell intensity as predictor. We also evaluate the actual gain provided by the intensity index with respect to the other metrics in the model, including the ones used to compute the code smell intensity. We observe that the intensity index is much more important as compared to other metrics used for predicting the buggyness of smelly classes.
Test case generation tools that optimize code coverage have been extensively investigated. Recently, researchers have suggested to add other non-coverage criteria, such as memory consumption or readability, to increase the practical usefulness of generated tests. In this paper, we observe that test code quality metrics, and test cohesion and coupling in particular, are valuable candidates as additional criteria. Indeed, tests with low cohesion and/or high coupling have been shown to have a negative impact on future maintenance activities. In an exploratory investigation we show that most generated tests are indeed affected by poor test code quality. For this reason, we incorporate cohesion and coupling metrics into the main loop of search-based algorithm for test case generation. Through an empirical study we show that our approach is not only able to generate tests that are more cohesive and less coupled, but can (i) increase branch coverage up to 10% when enough time is given to the search and (ii) result in statistically shorter tests.
Code reading is one of the most frequent activities in software maintenance; before implementing changes, it is necessary to fully understand source code often written by other developers. Thus, readability is a crucial aspect of source code that might significantly influence program comprehension effort. In general, models used to estimate software readability take into account only structural aspects of source code, e.g., line length and a number of comments. However, code is a particular form of text; therefore, a code readability model should not ignore the textual aspects of source code encapsulated in identifiers and comments. In this paper, we propose a set of textual features that could be used to measure code readability. We evaluated the proposed textual features on 600 code snippets manually evaluated (in terms of readability) by 5K+ people. The results show that the proposed features complement classic structural features when predicting readability judgments. Consequently, a code readability model based on a richer set of features, including the ones proposed in this paper, achieves a significantly better accuracy as compared to all the state-of-the-art readability models.
In this paper, we present TACO (Textual Analysis for Code Smell Detection), a technique that exploits textual analysis to detect a family of smells of different nature and different levels of granularity. We run TACO on 10 open source projects, comparing its performance with existing smell detectors purely based on structural information extracted from code components. The analysis of the results indicates that TACO’s precision ranges between 67% and 77%, while its recall ranges between 72% and 84%. Also, TACO often outperforms alternative structural approaches confirming, once again, the usefulness of information that can be derived from the textual part of code components.
The role of software testing in the software development process is widely recognized as a key activity for successful projects. This is the reason why in the last decade several automatic unit test generation tools have been proposed, focusing particularly on high code coverage. Despite the effort spent by the research community, there is still a lack of empirical investigation aimed at analyzing the characteristics of the produced test code. Indeed, while some studies inspected the effectiveness and the usability of these tools in practice, it is still unknown whether test code is maintainable. In this paper, we conducted a large scale empirical study in order to analyze the diffusion of bad design solutions, namely test smells, in automatically generated unit test classes. Results of the study show the high diffusion of test smells as well as the frequent co-occurrence of different types of design problems. Finally we found that all test smells have strong positive correlation with structural characteristics of the systems such as size or number of classes.
When facing difficulties solving a task at hand, and knowledgeable colleagues are not available, developers resort to offline and online resources, e.g. official documentation, third-party tutorials, mailing lists, and Q&A websites. These, however, need to be found, read, and understood, which takes its toll in terms of time and mental energy. A more immediate and accessible resource are video tutorials found on the web, which in recent years have seen a steep increase in popularity. Nonetheless, videos are an intrinsically noisy data source, and finding the right piece of information might be even more cumbersome than using the previously mentioned resources. We present CodeTube, an approach which mines video tutorials found on the web, and enables developers to query their contents. The video tutorials are processed and split into coherent fragments, to return only fragments related to the query. As an added benefit, the relevant video fragments are complemented with information from additional sources, such as Stack Overflow discussions. The results of two studies to assess CodeTube indicate that video tutorials - if appropriately processed - represent a useful, yet still under-utilized source of information for software development.
Developers have to to constantly improve their apps by fixing critical bugs and implementing the most desired features in order to gain shares in the continuously increasing and competitive market of mobile apps. A precious source of information to plan such activities is represented by reviews left by users on the app store. However, in order to exploit such information developers need to manually analyze such reviews. This is something not doable if, as frequently happens, the app receives hundreds of reviews per day. In this paper we introduce CLAP (Crowd Listener for releAse Planning), a thorough solution to (i) categorize user reviews based on the information they carry out (e.g., bug reporting), (ii) cluster together related reviews (e.g., all reviews reporting the same bug), and (iii) automatically prioritize the clusters of reviews to be implemented when planning the subsequent app release. We evaluated all the steps behind CLAP, showing its high accuracy in categorizing and clustering reviews and the meaningfulness of the recommended prioritizations. Also, given the availability of CLAP as a working tool, we assessed its practical applicability in industrial environments.
Nowadays developers heavily rely on sources of informal documentation, including Q&A forums, slides, or video tutorials, the latter being particularly useful to provide introductory notions for a piece of technology. The current practice is that developers have to browse sources individually, which in the case of video tutorials is cumbersome, as they are lengthy and cannot be searched based on their contents. We present CodeTube, a Web-based recommender system that analyzes the contents of video tutorials and is able to provide, given a query, cohesive and self-contained video fragments, along with links to relevant Stack Overflow discussions. CodeTube relies on a combination of textual analysis and image processing applied on video tutorial frames and speech transcripts to split videos into cohesive fragments, index them and identify related Stack Overflow discussions.
Previous studies have investigated the reasons behind refactoring operations performed by developers, and proposed methods and tools to recommend refactorings based on quality metric profiles, or on the presence of poor design and implementation choices, i.e., code smells. Nevertheless, the existing literature lacks of observations about the relations between metrics/code smells and refactoring operations performed by developers. In other words, the characteristics of code components pushing developers to refactor them are still unknown. This paper aims at bridging this gap by analyzing which code characteristics trigger the developers' refactoring attentions. Specifically, we mined the evolution history of three Java open source projects to investigate whether developers' refactoring activities occur on code components for which certain indicators - such as quality metrics or the presence of smells as detected by tools - suggest there might be need for refactoring operations. Results indicate that, more often than not, quality metrics do not show a clear relationship with refactoring. In other words, refactoring operations performed by developers are generally focused on code components for which quality metrics do not suggest there might be need for refactoring operations. We also observed that code components having a high change-proneness attract more refactoring operations aiming at improving code readability. Finally, 42% of refactoring operations are performed by developers on code smells. However, the effectiveness of such operations is quite low; only 7% of the performed operations actually remove the code smells from the affected class.
In this paper we formalize the defect prediction problem as a multi-objective optimization problem. Specifically, we propose an approach, coined as MODEP (Multi-Objective DEfect Predictor), based on multi-objective forms of machine learning techniques-logistic regression and decision trees specifically-trained using a genetic algorithm. The multi-objective approach allows software engineers to choose predictors achieving a specific compromise between the number of likely defect-prone classes, or the number of defects that the analysis would likely discover (effectiveness), and LOC to be analyzed/tested (which can be considered as a proxy of the cost of code inspection). Results of an empirical evaluation on 10 datasets from the PROMISE repository indicate the quantitative superiority of MODEP with respect to single-objective predictors, and with respect to trivial baseline ranking classes by size in ascending or descending order. Also, MODEP outperforms an alternative approach for cross-project prediction, based on local prediction upon clusters of similar classes.
Software ecosystems consist of multiple software projects, often interrelated by means of dependency relations. When one project undergoes changes, other projects may decide to upgrade their dependency. For example, a project could use a new version of a component from another project because the latter has been enhanced or subject to some bug-fixing activities. In this paper we study the evolution of dependencies between projects in the Java subset of the Apache ecosystem, consisting of 147 projects, for a period of 14 years, resulting in 1,964 releases. Specifically, we investigate (i) how dependencies between projects evolve over time when the ecosystem grows, (ii) what are the product and process factors that can likely trigger dependency upgrades, (iii) how developers discuss the needs and risks of such upgrades, and (iv) what is the likely impact of upgrades on client projects. The study results—qualitatively confirmed by observations made by analyzing the developers’ discussion—indicate that when a new release of a project is issued, it triggers an upgrade when the new release includes major changes (e.g., new features/services) as well as large amount of bug fixes. Instead, developers are reluctant to perform an upgrade when some APIs are removed. The impact of upgrades is generally low, unless it is related to frameworks/libraries used in crosscutting concerns. Results of this study can support the understanding of the of library/component upgrade phenomenon, and provide the basis for a new family of recommenders aimed at supporting developers in the complex (and risky) activity of managing library/component upgrade within their software projects.
Bad code smells have been defined as indicators of potential problems in source code. Techniques to identify and mitigate bad code smells have been proposed and studied. Recently bad test code smells (test smells for short) have been put forward as a kind of bad code smell specific to tests such a unit tests. What has been missing is empirical investigation into the prevalence and impact of bad test code smells. Two studies aimed at providing this missing empirical data are presented. The first study finds that there is a high diffusion of test smells in both open source and industrial software systems with 86 % of JUnit tests exhibiting at least one test smell and six tests having six distinct test smells. The second study provides evidence that test smells have a strong negative impact on program comprehension and maintenance. Highlights from this second study include the finding that comprehension is 30 % better in the absence of test smells.
Code smells are symptoms of poor design and implementation choices that may hinder code comprehension, and possibly increase change- and fault-proneness. While most of the detection techniques just rely on structural information, many code smells are intrinsically characterized by how code elements change over time. In this paper, we propose HIST (Historical Information for Smell deTection), an approach exploiting change history information to detect instances of five different code smells, namely Divergent Change, Shotgun Surgery, Parallel Inheritance, Blob, and Feature Envy.We evaluate HIST in two empirical studies. The first, conducted on twenty open source projects, aimed at assessing the accuracy of HIST in detecting instances of the code smells mentioned above. The results indicate that the precision of HIST ranges between 72% and 86%, and its recall ranges between 58% and 100%. Also, results of the first study indicate that HIST is able to identify code smells that cannot be identified by competitive approaches solely based on code analysis of a single system’s snapshot. Then, we conducted a second study aimed at investigating to what extent the code smells detected by HIST (and by competitive code analysis techniques) reflect developers’ perception of poor design and implementation choices. We involved twelve developers of four open source projects that recognized more than 75% of the code smell instances identified by HIST as actual design/implementation problems.
The mobile apps market is one of the fastest growing areas in the information technology. In digging their market share, developers must pay attention to building robust and reliable apps. In fact, users easily get frustrated by repeated failures, crashes, and other bugs; hence, they abandon some apps in favor of their competition. In this paper we investigate how the faultand change-proneness of APIs used by Android apps relates to their success estimated as the average rating provided by the users to those apps. First, in a study conducted on 5,848 (free) apps, we analyzed how the ratings that an app had received correlated with the fault- and change-proneness of the APIs such app relied upon. After that, we surveyed 45 professional Android developers to assess (i) to what extent developers experienced problems when using APIs, and (ii) how much they felt these problems could be the cause for unfavorable user ratings. The results of our studies indicate that apps having high user ratings use APIs that are less fault- and change-prone than the APIs used by low rated apps. Also, most of the interviewed Android developers observed, in their development experience, a direct relationship between problems experienced with the adopted APIs and the users’ ratings that their apps received.
A way to reduce the cost of regression testing consists of selecting or prioritizing subsets of test cases from a test suite according to some criteria. Besides greedy algorithms, cost cognizant additional greedy algorithms, multi-objective optimization algorithms, and Multi-Objective Genetic Algorithms (MOGAs), have also been proposed to tackle this problem. However, previous studies have shown that there is no clear winner between greedy and MOGAs, and that their combination does not necessarily produce better results. In this paper we show that the optimality of MOGAs can be significantly improved by diversifying the solutions (sub-sets of the test suite) generated during the search process. Specifically, we introduce a new MOGA, coined as DIV-GA (DIversity based Genetic Algorithm), based on the mechanisms of orthogonal design and orthogonal evolution that increase diversity by injecting new orthogonal individuals during the search process. Results of an empirical study conducted on eleven programs show that DIV-GA outperforms both greedy algorithms and the traditional MOGAs from the optimality point of view. Moreover, the solutions (sub-sets of the test suite) provided by DIV-GA are able to detect more faults than the other algorithms, while keeping the same test execution cost.
This paper presents the results of an empirical study aiming at comparing the support provided by ER and UML class diagrams during maintenance of data models. We performed one controlled experiment and two replications that focused on comprehension activities (the first activity in the maintenance process) and another controlled experiment on modification activities related to the implementation of given change requests. The results achieved were analyzed at a fine-grained level aiming at comparing the support given by each single building block of the two notations. Such an analysis is used to identify weaknesses (i.e., building blocks not easy to comprehend) in a notation and/or can justify the need of preferring ER or UML for data modeling. The analysis revealed that the UML class diagrams generally provided a better support for both comprehension and modification activities performed on data models as compared to ER diagrams. Nevertheless, the former has some weaknesses related to three building blocks, i.e., multi-value attribute, composite attribute, and weak entity. These findings suggest that an extension of UML class diagrams should be considered to overcome these weaknesses and improve the support provided by UML class diagrams during maintenance of data models.
The importance of human-related factors in the introduction of bugs has recently been the subject of a number of empirical studies. However, these observations have not been captured yet in bug prediction models which simply exploit product metrics or process metrics based on the number and type of changes or on the number of developers working on a software component. Some previous studies have demonstrated that focused developers are less prone to introduce defects than non focused developers. According to this observation, software components changed by focused developers should also be less error prone than software components changed by less focused developers. In this paper we capture this observation by measuring the structural and semantic scattering of changes performed by the developers working on a software component and use these two measures to build a bug prediction model. Such a model has been evaluated on five open source systems and compared with two competitive prediction models: the first exploits the number of developers working on a code component in a given time period as predictor, while the second is based on the concept of code change entropy. The achieved results show the superiority of our model with respect to the two competitive approaches, and the complementarity of the defined scattering measures with respect to standard predictors commonly used in the literature.
Nowadays software applications, and especially mobile apps, undergo frequent release updates through app stores. After installing/updating apps, users can post reviews and provide ratings, expressing their level of satisfaction with apps, and possibly pointing out bugs or desired features. In this paper we show—by performing a study on 100 Android apps—how applications addressing user reviews increase their success in terms of rating. Specifically, we devise an approach, named CRISTAL, for tracing informative crowd reviews onto source code changes, and for monitoring the extent to which developers accommodate crowd requests and follow-up user reactions as reflected in their ratings. The results indicate that developers implementing user reviews are rewarded in terms of ratings. This poses the need for specialized recommendation systems aimed at analyzing informative crowd reviews and prioritizing feedback to be satisfied in order to increase the apps success.
The wide diffusion of mobile devices has motivated research towards optimizing energy consumption of software systems - including apps - targeting such devices. Besides efforts aimed at dealing with various kinds of energy bugs, the adoption of Organic Light-Emitting Diode (OLED) screens has motivated research towards reducing energy consumption by choosing an appropriate color palette. Whilst past research in this area aimed at optimizing energy while keeping an acceptable level of contrast, this paper proposes an approach, named GEMMA (Gui Energy Multi-objective optiMization for Android apps), for generating color palettes using a multi-objective optimization technique, which produces color solutions optimizing energy consumption and contrast while using consistent colors with respect to the original color palette. An empirical evaluation that we performed on 25 Android apps demonstrates not only significant improvements in terms of the three different objectives, but also confirmed that in most cases users still perceived the choices of colors as attractive. Finally, for several apps we interviewed the original developers, who in some cases expressed the intent to adopt the proposed choice of color palette, whereas in other cases pointed out directions for future improvements.
Text Retrieval (TR) approaches have been used to leverage the textual information contained in software artifacts to address a multitude of software engineering tasks. However, TR approaches need to be configured properly in order to lead to good results. Current approaches for automatic TR configuration in SE configure a single TR approach and then use it for all possible queries that can be formulated. In this paper, we show that such a configuration strategy leads to suboptimal results and propose QUEST, the first approach bringing TR configuration selection to the query level. QUEST recommends the best TR configuration for a given query, based on a supervised learning approach which determines the TR configuration that performs the best for each query based on its properties. We evaluated QUEST in the context of feature and bug localization, using a dataset with more than 1,000 queries. We found that QUEST is able to recommend one of the top three TR configurations for a query with a 69% accuracy, on average. We compared the results obtained with the configurations recommended by QUEST for every query with those obtained using a single TR configuration for all queries in a system and in the entire dataset. We found that using QUEST we obtain better results than with any of the considered TR configurations.
Code smells are symptoms of poor design and implementation choices that may hinder code comprehension and possibly increase change- and fault-proneness of source code. Several techniques have been proposed in the literature for detecting code smells. These techniques are generally evaluated by comparing their accuracy on a set of detected candidate code smells against a manually-produced oracle. Unfortunately, such comprehensive sets of annotated code smells are not available in the literature with only few exceptions. In this paper we contribute (i) a dataset of 243 instances of five types of code smells identified from 20 open source software projects, (ii) a systematic procedure for validating code smell datasets, (iii) LANDFILL, a Web-based platform for sharing code smell datasets, and (iv) a set of APIs for programmatically accessing LANDFILL's contents. Anyone can contribute to Landfill by (i) improving existing datasets (e.g., adding missing instances of code smells, flagging possibly incorrectly classified instances), and (ii) sharing and posting new datasets. Landfill is available at www.sesa.unisa.it/landfill/, while the video demonstrating its features in action is available at http://www.sesa.unisa.it/tools/landfill.jsp.
Software evolution often leads to the degradation of software design quality. In Object-Oriented (OO) systems, this often results in packages that are hard to understand and maintain, as they group together heterogeneous classes with unrelated responsibilities. In such cases, state-of-the-art re-modularization tools solve the problem by proposing a new organization of the existing classes into packages. However, as indicated by recent empirical studies, such approaches require changing thousands of lines of code to implement the new recommended modularization. In this demo, we present the implementation of an Extract Package refactoring approach in ARIES (Automated Refactoring In EclipSe), a tool supporting refactoring operations in Eclipse. Unlike state-of-the-art approaches, ARIES automatically identifies and removes single low-cohesive packages from software systems, which represent localized design flaws in the package organization, with the aim to incrementally improve the overall quality of the software modularisation.
In past and recent years, the issues related to managing technical debt received significant attention by researchers from both industry and academia. There are several factors that contribute to technical debt. One of these is represented by code bad smells, i.e. symptoms of poor design and implementation choices. While the repercussions of smells on code quality have been empirically assessed, there is still only anecdotal evidence on when and why bad smells are introduced. To fill this gap, we conducted a large empirical study over the change history of 200 open source projects from different software ecosystems and investigated when bad smells are introduced by developers, and the circumstances and reasons behind their introduction. Our study required the development of a strategy to identify smell-introducing commits, the mining of over 0.5M commits, and the manual analysis of 9,164 of them (i.e. those identified as smell-introducing). Our findings mostly contradict common wisdom stating that smells are being introduced during evolutionary tasks. In the light of our results, we also call for the need to develop a new generation of recommendation systems aimed at properly planning smell refactoring activities.
Code examples are small source code fragments whose purpose is to illustrate how a programming language construct, an API, or a specific function/method works. Since code examples are not always available in the software documentation, researchers have proposed techniques to automatically extract them from existing software or to mine them from developer discussions. In this paper we propose MUSE (Method USage Examples), an approach for mining and ranking actual code examples that show how to use a specific method. MUSE combines static slicing (to simplify examples) with clone detection (to group similar examples), and uses heuristics to select and rank the best examples in terms of reusability, understandability, and popularity. MUSE has been empirically evaluated using examples mined from six libraries, by performing three studies involving a total of 140 developers to: (i) evaluate the selection and ranking heuristics, (ii) provide their perception on the usefulness of the selected examples, and (iii) perform specific programming tasks using the MUSE examples. The results indicate that MUSE selects and ranks examples close to how humans do, most of the code examples (82%). are perceived as useful, and they actually help when performing programming tasks.
Anti-patterns are poor solutions to recurring design problems. They occur in object-oriented systems when developers unwillingly introduce them while designing and implementing the classes of their systems. Several empirical studies have highlighted that anti-patterns have a negative impact on the comprehension and maintainability of a software systems. Consequently, their identification has received recently more attention from both researchers and practitioners who have proposed various approaches to detect them. This chapter discusses on the approaches proposed in the literature. In addition, from the analysis of the state of the art, we will (i) derive a set of guidelines for building and evaluating recommendation systems supporting the detection of anti-patterns; and (ii) discuss some problems that are still open, to trace future research directions in the field. For this reason, the chapter provides a support to both researchers, who are interested in comprehending the results achieved so far in the identification of anti-patterns, and practitioner, who are interested in adopting a tool to identify anti-patterns in their software systems.
During software evolution the internal structure of the system undergoes continuous modifications. These continuous changes push away the source code from its original design, often reducing its quality, including class cohesion. In this paper we propose a method for automating the Extract Class refactoring. The proposed approach analyzes (structural and semantic) relationships between the methods in a class to identify chains of strongly related methods. The identified method chains are used to define new classes with higher cohesion than the original class, while preserving the overall coupling between the new classes and the classes interacting with the original class. The proposed approach has been first assessed in an artificial scenario in order to calibrate the parameters of the approach. The data was also used to compare the new approach with previous work. Then it has been empirically evaluated on real Blobs from existing open source systems in order to assess how good and useful the proposed refactoring solutions are considered by software engineers and how well the proposed refactorings approximate refactorings done by the original developers. We found that the new approach outperforms a previously proposed approach and that developers find the proposed solutions useful in guiding refactorings.
To support program comprehension, software artifacts can be labeled—for example within software visualization tools—with a set of representative words, hereby referred to as labels. Such labels can be obtained using various approaches, including Information Retrieval (IR) methods or other simple heuristics. They provide a bird-eye’s view of the source code, allowing developers to look over software components fast and make more informed decisions on which parts of the source code they need to analyze in detail. However, few empirical studies have been conducted to verify whether the extracted labels make sense to software developers. This paper investigates (i) to what extent various IR techniques and other simple heuristics overlap with (and differ from) labeling performed by humans; (ii) what kinds of source code terms do humans use when labeling software artifacts; and (iii) what factors—in particular what characteristics of the artifacts to be labeled—influence the performance of automatic labeling techniques. We conducted two experiments in which we asked a group of students (38 in total) to label 20 classes from two Java software systems, JHotDraw and eXVantage. Then, we analyzed to what extent the words identified with an automated technique—including Vector Space Models, Latent Semantic Indexing (LSI), latent Dirichlet allocation (LDA), as well as customized heuristics extracting words from specific source code elements—overlap with those identified by humans. Results indicate that, in most cases, simpler automatic labeling techniques—based on the use of words extracted from class and method names as well as from class comments—better reflect human-based labeling. Indeed, clustering-based approaches (LSI and LDA) are more worthwhile to be used for source code artifacts having a high verbosity, as well as for artifacts requiring more effort to be manually labeled. The obtained results help to define guidelines on how to build effective automatic labeling techniques, and provide some insights on the actual usefulness of automatic labeling techniques during program comprehension tasks.
During software maintenance and evolution the internal structure of the software system undergoes continuous changes. These modifications drift the source code away from its original design, thus deteriorating its quality, including cohesion and coupling of classes. Several refactoring methods have been proposed to overcome this problem. In this paper we propose a novel technique to identify Move Method refactoring opportunities and remove the Feature Envy bad smell from source code. Our approach, coined as Methodbook, is based on relational topic models (RTM), a probabilistic technique for representing and modeling topics, documents (in our case methods) and known relationships among these. Methodbook uses RTM to analyze both structural and textual information gleaned from software to better support move method refactoring. We evaluated Methodbook in two case studies. The first study has been executed on six software systems to analyze if the move method operations suggested by Methodbook help to improve the design quality of the systems as captured by quality metrics. The second study has been conducted with eighty developers that evaluated the refactoring recommendations produced by Methodbook. The achieved results indicate that Methodbook provides accurate and meaningful recommendations for move method refactoring operations.
Source code lexicon plays a paramount role in software quality: poor lexicon can lead to poor comprehensibility and even increase software fault-proneness. For this reason, renaming a program entity, i.e., altering the entity identifier, is an important activity during software evolution. Developers rename when they feel that the name of an entity is not (anymore) consistent with its functionality, or when such a name may be misleading. A survey that we performed with 71 developers suggests that 39 percent perform renaming from a few times per week to almost every day and that 92 percent of the participants consider that renaming is not straightforward. However, despite the cost that is associated with renaming, renamings are seldom if ever documented—for example, less than 1 percent of the renamings in the five programs that we studied. This explains why participants largely agree on the usefulness of automatically documenting renamings. In this paper we propose REanaming Program ENTities (REPENT), an approach to automatically document—detect and classify—identifier renamings in source code. REPENT detects renamings based on a combination of source code differencing and data flow analyses. Using a set of natural language tools, REPENT classifies renamings into the different dimensions of a taxonomy that we defined. Using the documented renamings, developers will be able to, for example, look up methods that are part of the public API (as they impact client applications), or look for inconsistencies between the name and the implementation of an entity that underwent a high risk renaming (e.g., towards the opposite meaning). We evaluate the accuracy and completeness of REPENT on the evolution history of five open-source Java programs. The study indicates a precision of 88 percent and a recall of 92 percent. In addition, we report an exploratory study investigating and discussing how identifiers are renamed in the five programs, according to our taxonomy.
Oftentimes, during software maintenance the original program modularization decays, thus reducing its quality. One of the main reasons for such architectural erosion is suboptimal placement of source-code classes in software packages. To alleviate this issue, we propose an automated approach to help developers improve the quality of software modularization. Our approach analyzes underlying latent topics in source code as well as structural dependencies to recommend (and explain) refactoring operations aiming at moving a class to a more suitable package. The topics are acquired via Relational Topic Models (RTM), a probabilistic topic modeling technique. The resulting tool, coined as R3 (Rational Refactoring via RTM), has been evaluated in two empirical studies. The results of the first study conducted on nine software systems indicate that R3 provides a coupling reduction from 10% to 30% among the software modules. The second study with 62 developers confirms that R3 is able to provide meaningful recommendations (and explanations) for move class refactoring. Specifically, more than 70% of the recommendations were considered meaningful from a functional point of view.
Test suites are a valuable source of up-to-date documentation as developers continuously modify them to reflect changes in the production code and preserve an effective regression suite. While maintaining traceability links between unit test and the classes under test can be useful to selectively retest code after a change, the value of having traceability links goes far beyond this potential savings. One key use is to help developers better comprehend the dependencies between tests and classes and help maintain consistency during refactoring. Despite its importance, test-to-code traceability is not common in software development and, when needed, traceability information has to be recovered during software development and evolution. We propose an advanced approach, named SCOTCH+ (Source code and COncept based Test to Code traceability Hunter), to support the developer during the identification of links between unit tests and tested classes. Given a test class, represented by a JUnit class, the approach first exploits dynamic slicing to identify a set of candidate tested classes. Then, external and internal textual information associated with the classes retrieved by slicing is analyzed to refine this set of classes and identify the final set of candidate tested classes. The external information is derived from the analysis of the class name, while internal information is derived from identifiers and comments. The approach is evaluated on five software systems. The results indicate that the accuracy of the proposed approach far exceeds the leading techniques found in the literature.
Context: The intensive human effort needed to manually manage traceability information has increased the interest in using semi-automated traceability recovery techniques. In particular, Information Retrieval (IR) techniques have been largely employed in the last ten years to partially automate the traceability recovery process. Aim: Previous studies mainly focused on the analysis of the performances of IR-based traceability recovery methods and several enhancing strategies have been proposed to improve their accuracy. Very few papers investigate how developers (i) use IR-based traceability recovery tools and (ii) analyse the list of suggested links to validate correct links or discard false positives. We focus on this issue and suggest exploiting link count information in IR-based traceability recovery tools to improve the performances of the developers during a traceability recovery process. Method: Two empirical studies have been conducted to evaluate the usefulness of link count information. The two studies involved 135 University students that had to perform (with and without link count information) traceability recovery tasks on two software project repositories. Then, we evaluated the quality of the recovered traceability links in terms of links correctly and erroneously traced by the students. Results: The results achieved indicate that the use of link count information significantly increases the number of correct links identified by the participants. Conclusions: The results can be used to derive guidelines on how to effectively use traceability recovery approaches and tools proposed in the literature.
This paper introduces ARENA (Automatic RElease Notes generAtor), an approach for the automatic generation of release notes. ARENA extracts changes from the source code, summarizes them, and integrates them with information from versioning systems and issue trackers. It was designed based on the manual analysis of 1,000 existing release notes. To evaluate the quality of the ARENA release notes, we performed three empirical studies involving a total of 53 participants (45 professional developers and 8 students). The results indicate that the ARENA release notes are very good approximations of those produced by the developers and often include important information that is missing in the manually produced release notes.
Refactoring aims at restructuring existing source code when undisciplined development activities have deteriorated its comprehensibility and maintainability. There exist various approaches for suggesting refactoring opportunities, based on different sources of information, e.g., structural, semantic, and historical. In this paper we claim that an additional source of information for identifying refactoring opportunities, sometimes orthogonal to the ones mentioned above, is team development activity. When the activity of a team working on common modules is not aligned with the current design structure of a system, it would be possible to recommend appropriate refactoring operations - e.g., extract class/method/package - to adjust the design according to the teams' activity patterns. Results of a preliminary study - conducted in the context of extract class refactoring - show the feasibility of the approach, and also suggest that this new refactoring dimension can be complemented with others to build better refactoring recommendation tools.
Developers often consult different sources of information like Application Programming Interfaces (API) documentation, forums, Q&A websites, etc. With the aim of gathering additional knowledge for the programming task at hand. The process of searching and identifying valuable pieces of information requires developers to spend time and energy in formulating the right queries, assessing the returned results, and integrating the obtained knowledge into the code base. All of this is often done manually. We present Prompter, a plug-in for the Eclipse IDE which automatically searches and identifies relevant Stack Overflow discussions, evaluates their relevance given the code context in the IDE, and notifies the developer if and only if a user-defined confidence threshold is surpassed.
In the last decade several catalogues have been defined to characterize bad code smells, i.e., symptoms of poor design and implementation choices. On top of such catalogues, researchers have defined methods and tools to automatically detect and/or remove bad smells. Nevertheless, there is an ongoing debate regarding the extent to which developers perceive bad smells as serious design problems. Indeed, there seems to be a gap between theory and practice, i.e., what is believed to be a problem (theory) and what is actually a problem (practice). This paper presents a study aimed at providing empirical evidence on how developers perceive bad smells. In this study, we showed to developers code entities - belonging to three systems - affected and not by bad smells, and we asked them to indicate whether the code contains a potential design problem, and if any, the nature and severity of the problem. The study involved both original developers from the three projects and outsiders, namely industrial developers and Master's students. The results provide insights on characteristics of bad smells not yet explored sufficiently. Also, our findings could guide future research on approaches for the detection and removal of bad smells.
Developers contributing to open source projects spontaneously group into "emerging'' teams, reflected by messages exchanged over mailing lists, issue trackers and other communication means. Previous studies suggested that such teams somewhat mirror the software modularity. This paper empirically investigates how, when a project evolves, emerging teams re-organize themselves-e.g., by splitting or merging. We relate the evolution of teams to the files they change, to investigate whether teams split to work on cohesive groups of files. Results of this study-conducted on the evolution history of four open source projects, namely Apache httpd, Eclipse JDT, Netbeans, and Samba-provide indications of what happens in the project when teams reorganize. Specifically, we found that emerging team splits imply working on more cohesive groups of files and emerging team merges imply working on groups of files that are cohesive from structural perspective. Such indications serve to better understand the evolution of software projects. More important, the observation of how emerging teams change can serve to suggest software remodularization actions.
The growing number of questions related to mobile development in StackOverflow highlights an increasing interest of software developers in mobile programming. For the Android platform, 213,836 questions were tagged with Android-related labels in StackOverflow between July 2008 and August 2012. This paper aims at investigating how changes occurring to Android APIs trigger questions and activity in StackOverflow, and whether this is particularly true for certain kinds of changes. Our findings suggest that Android developers usually have more questions when the behavior of APIs is modified. In addition, deleting public methods from APIs is a trigger for questions that are (i) more discussed and of major interest for the community, and (ii) posted by more experienced developers. In general, results of this paper provide important insights about the use of social media to learn about changes in software ecosystems, and establish solid foundations for building new recommenders for notifying developers/managers about important changes and recommending them relevant crowdsourced solutions.
Energy consumption of mobile applications is nowadays a hot topic, given the widespread use of mobile devices. The high demand for features and improved user experience, given the available powerful hardware, tend to increase the apps’ energy consumption. However, excessive energy consumption in mobile apps could also be a consequence of energy greedy hardware, bad programming practices, or particular API usage patterns. We present the largest to date quantitative and qualitative empirical investigation into the categories of API calls and usage patterns that—in the context of the Android development framework—exhibit particularly high energy consumption profiles. By using a hardware power monitor, we measure energy consumption of method calls when executing typical usage scenarios in 55 mobile apps from different domains. Based on the collected data, we mine and analyze energy-greedy APIs and usage patterns. We zoom in and discuss the cases where either the anomalous energy consumption is unavoidable or where it is due to suboptimal usage or choice of APIs. Finally, we synthesize our findings into actionable knowledge and recipes for developers on how to reduce energy consumption while using certain categories of Android APIs and patterns
Developers often require knowledge beyond the one they possess, which often boils down to consulting sources of information like Application Programming Interfaces (API) documentation, forums, Q&A websites, etc. Knowing what to search for and how is non- trivial, and developers spend time and energy to formulate their problems as queries and to peruse and process the results. We propose a novel approach that, given a context in the IDE, automatically retrieves pertinent discussions from Stack Overflow, evaluates their relevance, and, if a given confidence threshold is surpassed, notifies the developer about the available help. We have implemented our approach in Prompter, an Eclipse plug-in. Prompter has been evaluated through two studies. The first was aimed at evaluating the devised ranking model, while the second was conducted to evaluate the usefulness of Prompter.
Extract Class refactoring (ECR) is used to divide large classes with low cohesion into smaller, more cohesive classes. However, splitting a class might result in increased coupling in the system due to new dependencies between the extracted classes. Thus, ECR requires that a software engineer identifies a trade off between cohesion and coupling. Such a trade off may be difficult to identify manually because of the high complexity of the class to be refactored. In this paper, we present an approach based on game theory to identify refactoring solutions that provide a compromise between the desired increment in cohesion and the undesired increment in coupling. The results of an empirical evaluation indicate that the approach identifies meaningful ECRs from a developer's point-of-view.
Existing defect prediction models use product or process metrics and machine learning methods to identify defect-prone source code entities. Different classifiers (e.g., linear regression, logistic regression, or classification trees) have been investigated in the last decade. The results achieved so far are sometimes contrasting and do not show a clear winner. In this paper we present an empirical study aiming at statistically analyzing the equivalence of different defect predictors. We also propose a combined approach, coined as CODEP (COmbined DEfect Predictor), that employs the classification provided by different machine learning techniques to improve the detection of defect-prone entities. The study was conducted on 10 open source software systems and in the context of cross-project defect prediction, that represents one of the main challenges in the defect prediction field. The statistical analysis of the results indicates that the investigated classifiers are not equivalent and they can complement each other. This is also confirmed by the superior prediction accuracy achieved by CODEP when compared to stand-alone defect predictors.
During its lifecycle, the internal structure of a software system undergoes continuous modifications. These changes push away the source code from its original design, often reducing its quality. In such cases, refactoring techniques can be applied to improve the readability and reducing the complexity of source code, to improve the architecture and provide for better software extensibility. Despite its advantages, performing refactoring in large and nontrivial software systems might be very challenging. Thus, a lot of effort has been devoted to the definition of automatic or semi-automatic approaches to support developer during software refactoring. Many of the proposed techniques are for recommending refactoring operations. In this chapter, we present guidelines on how to build such recommendation systems and how to evaluate them. We also highlight some of the challenges that exist in the field, pointing toward future research directions.
Software evolution is an effort-prone activity, and requires developers to make complex and difficult decisions. This entails the development of automated approaches to support various software evolution-related tasks, for example aimed at suggesting refactoring or remodularization actions. Finding a solution to these problems is intrinsically NP-hard, and exhaustive approaches are not viable due to the size and complexity of many software projects. Therefore, during recent years, several software-evolution problems have been formulated as optimization problems, and resolved with meta-heuristics. This chapter overviews how search-based optimization techniques can support software engineers in a number of software evolution tasks. For each task, we illustrate how the problem can be encoded as a search-based optimization problem, and how meta-heuristics can be used to solve it. Where possible, we refer to some tools that can be used to deal with such tasks.
Recently, different methods and tools have been proposed to automate or semi-automate test-to-code traceability recovery. Among these, Slicing and Coupling based Test to Code trace Hunter (SCOTCH) exploits slicing and conceptual coupling to identify the classes tested by a JUnit test. However, until now the evaluation of test-to-code traceability recovery methods has been limited to experiments assessing their tracing accuracy rather than the actual support these methods provide to a software engineer during traceability recovery tasks. Indeed, a research method or tool has a better chance of being transferred to practitioners if it is supported by empirical evidence. In this paper, we present the results of two controlled experiments carried out to evaluate the support given by SCOTCH during traceability recovery, when compared with other traceability recovery methods. The results show that SCOTCH is able to suggest a higher number of correct links with higher accuracy, thus sensibly improving the performances of software engineers during test-to-code traceability recovery tasks.
Changes during software evolution and poor design decisions often lead to packages that are hard to understand and maintain, because they usually group together classes with unrelated responsibilities. One way to improve such packages is to decompose them into smaller, more cohesive packages. The difficulty lies in the fact that most definitions and interpretations of cohesion are rather vague and the multitude of measures proposed by researchers usually capture only one aspect of cohesion. We propose a new technique for automatic re-modularization of packages, which uses structural and semantic measures to decompose a package into smaller, more cohesive ones. The paper presents the new approach as well as an empirical study, which evaluates the decompositions proposed by the new technique. The results of the evaluation indicate that the decomposed packages have better cohesion without a deterioration of coupling and the re-modularizations proposed by the tool are also meaningful from a functional point of view.
One of the most successful applications of textual analysis in software engineering is the use of information retrieval (IR) methods to reconstruct traceability links between software artifacts. Unfortunately, because of the limitations of both the humans developing artifacts and the IR techniques any IR-based traceability recovery method fails to retrieve some of the correct links, while on the other hand it also retrieves links that are not correct. This limitation has posed challenges for researchers that have proposed several methods to improve the accuracy of IR-based traceability recovery methods by removing the "noise" in the textual content of software artifacts (e.g., by removing common words or increasing the importance of critical terms). In this paper, we propose a heuristic to remove the "noise" taking into account the linguistic nature of words in the software artifacts. In particular, the language used in software documents can be classified as a technical language, where the words that provide more indication on the semantics of a document are the nouns. The results of a case study conducted on five software artifact repositories indicate that characterizing the context of software artifacts considering only nouns significantly improves the accuracy of IR-based traceability recovery methods.
Context: Traceability relations among software artifacts often tend to be missing, outdated, or lost. For this reason, various traceability recovery approaches—based on Information Retrieval (IR) techniques—have been proposed. The performances of such approaches are often influenced by "noise" contained in software artifacts (e.g., recurring words in document templates or other words that do not contribute to the retrieval itself). Aim: As a complement and alternative to stop word removal approaches, this paper proposes the use of a smoothing filter to remove "noise" from the textual corpus of artifacts to be traced. Method: We evaluate the effect of a smoothing filter in traceability recovery tasks involving different kinds of artifacts from five software projects, and applying three different IR methods, namely Vector Space Models, Latent Semantic Indexing, and Jensen–Shannon similarity model. Results: Our study indicates that, with the exception of some specific kinds of artifacts (i.e., tracing test cases to source code) the proposed approach is able to significantly improve the performances of traceability recovery, and to remove "noise" that simple stop word filters cannot remove. Conclusions: The obtained results not only help to develop traceability recovery approaches able to work in presence of noisy artifacts, but also suggest that smoothing filters can be used to improve performances of other software engineering approaches based on textual analysis.
Feature location is a fundamental step in software evolution tasks such as debugging, understanding, and reuse. Numerous automated and semi-automated feature location techniques (FLTs) have been proposed, but the question remains: How do we objectively determine which FLT is most effective? Existing evaluations frequently use bug fix data, which includes the location of the fix, but not what other code needs to be understood to make the fix. Existing evaluation measures such as precision, recall, effectiveness, mean average precision (MAP), and mean reciprocal rank (MRR) will not differentiate between a FLT that ranks higher these related elements over completely irrelevant ones. We propose an alternative measure of relevance based on the likelihood of a developer finding the bug fix locations from a ranked list of results. Our initial evaluation shows that by modeling user behavior, our proposed evaluation methodology can compare and evaluate FLTs fairly.
Code smells represent symptoms of poor implementation choices. Previous studies found that these smells make source code more difficult to maintain, possibly also increasing its fault-proneness. There are several approaches that identify smells based on code analysis techniques. However, we observe that many code smells are intrinsically characterized by how code elements change over time. Thus, relying solely on structural information may not be sufficient to detect all the smells accurately. We propose an approach to detect five different code smells, namely Divergent Change, Shotgun Surgery, Parallel Inheritance, Blob, and Feature Envy, by exploiting change history information mined from versioning systems. We applied approach, coined as HIST (Historical Information for Smell deTection), to eight software projects written in Java, and wherever possible compared with existing state-of-the-art smell detectors based on source code analysis. The results indicate that HIST's precision ranges between 61% and 80%, and its recall ranges between 61% and 100%. More importantly, the results confirm that HIST is able to identify code smells that cannot be identified through approaches solely based on code analysis.
During the recent years, the market of mobile software applications (apps) has maintained an impressive upward trajectory. Many small and large software development companies invest considerable resources to target available opportunities. As of today, the markets for such devices feature over 850K+ apps for Android and 900K+ for iOS. Availability, cost, functionality, and usability are just some factors that determine the success or lack of success for a given app. Among the other factors, reliability is an important criteria: users easily get frustrated by repeated failures, crashes, and other bugs; hence, abandoning some apps in favor of others. This paper reports a study analyzing how the fault- and change-proneness of APIs used by 7,097 (free) Android apps relates to applications' lack of success, estimated from user ratings. Results of this study provide important insights into a crucial issue: making heavy use of fault- and change-prone APIs can negatively impact the success of these apps.
Software ecosystems consist of multiple software projects, often interrelated each other by means of dependency relations. When one project undergoes changes, other projects may decide to upgrade the dependency. For example, a project could use a new version of another project because the latter has been enhanced or subject to some bug-fixing activities. This paper reports an exploratory study aimed at observing the evolution of the Java subset of the Apache ecosystem, consisting of 147 projects, for a period of 14 years, and resulting in 1,964 releases. Specifically, we analyze (i) how dependencies change over time, (ii) whether a dependency upgrade is due to different kinds of factors, such as different kinds of API changes or licensing issues, and (iii) how an upgrade impacts on a related project. Results of this study help to comprehend the phenomenon of library/component upgrade, and provides the basis for a new family of recommenders aimed at supporting developers in the complex (and risky) activity of managing library/component upgrade within their software projects.
When developers perform a software maintenance task, they need to identify artifacts-e.g., classes or more specifically methods-that need to be modified. To this aim, they can browse various kind of artifacts, for example use case descriptions, UML diagrams, or source code. This paper reports the results of a study-conducted with 33 participants- aimed at investigating (i) to what extent developers use different kinds of documentation when identifying artifacts to be changed, and (ii) whether they follow specific navigation patterns among different kinds of artifacts. Results indicate that, although participants spent a conspicuous proportion of the available time by focusing on source code, they browse back and forth between source code and either static (class) or dynamic (sequence) diagrams. Less frequently, participants-especially more experienced ones-follow an "integrated" approach by using different kinds of artifacts.
The effectiveness of evolutionary test case generation based on Genetic Algorithms (GAs) can be seriously impacted by genetic drift, a phenomenon that inhibits the ability of such algorithms to effectively diversify the search and look for alternative potential solutions. In such cases, the search becomes dominated by a small set of similar individuals that lead GAs to converge to a sub-optimal solution and to stagnate, without reaching the desired objective. This problem is particularly common for hard-to-cover program branches, associated with an extremely large solution space. In this paper, we propose an approach to solve this problem by integrating a mechanism for orthogonal exploration of the search space into standard GA. The diversity in the population is enriched by adding individuals in orthogonal directions, hence providing a more effective exploration of the solution space. To the best of our knowledge, no prior work has addressed explicitly the issue of evolution direction based diversification in the context of evolutionary testing. Results achieved on 17 Java classes indicate that the proposed enhancements make GA much more effective and efficient in automating the testing process. In particular, effectiveness (coverage) was significantly improved in 47% of the subjects and efficiency (search budget consumed) was improved in 85% of the subjects on which effectiveness remains the same.
Cross-project defect prediction is very appealing because (i) it allows predicting defects in projects for which the availability of data is limited, and (ii) it allows producing generalizable prediction models. However, existing research suggests that cross-project prediction is particularly challenging and, due to heterogeneity of projects, prediction accuracy is not always very good. This paper proposes a novel, multi-objective approach for cross-project defect prediction, based on a multi-objective logistic regression model built using a genetic algorithm. Instead of providing the software engineer with a single predictive model, the multi-objective approach allows software engineers to choose predictors achieving a compromise between number of likely defect-prone artifacts (effectiveness) and LOC to be analyzed/tested (which can be considered as a proxy of the cost of code inspection). Results of an empirical evaluation on 10 datasets from the Promise repository indicate the superiority and the usefulness of the multi-objective approach with respect to single-objective predictors. Also, the proposed approach outperforms an alternative approach for cross-project prediction, based on local prediction upon clusters of similar classes.
Information Retrieval (IR) techniques have gained wide-spread acceptance as a method for automating traceability recovery. These techniques recover links between software artifacts based on their textual similarity, i.e., the higher the similarity, the higher the likelihood that there is a link between the two artifacts. A common problem with all IR-based techniques is filtering out noise from the list of candidate links, in order to improve the recovery accuracy. Indeed, software artifacts may be related in many ways and the textual information captures only one aspect of their relationships. In this paper we propose to leverage code ownership information to capture relationships between source code artifacts for improving the recovery of traceability links between documentation and source code. Specifically, we extract the author of each source code component and for each author we identify the “context” she worked on. Thus, for a given query from the external documentation we compute the similarity between it and the context of the authors. When retrieving classes that relate to a specific query using a standard IR-based approach we reward all the classes developed by the authors having their context most similar to the query, by boosting their similarity to the query. The proposed approach, named TYRION (TraceabilitY link Recovery using Information retrieval and code OwNership), has been instantiated for the recovery of traceability links between use cases and Java classes of two software systems. The results indicate that code ownership information can be used to improve the accuracy of an IR-based traceability link recovery technique.
A number of approaches in traceability link recovery and other software engineering tasks incorporate topic models, such as Latent Dirichlet Allocation (LDA). Although in theory these topic models can produce very good results if they are configured properly, in reality their potential may be undermined by improper calibration of their parameters (e.g., number of topics, hyper-parameters), which could potentially lead to sub-optimal results. In our previous work we addressed this issue and proposed LDA-GA, an approach that uses Genetic Algorithms (GA) to find a near-optimal configuration of parameters for LDA, which was shown to produce superior results for traceability link recovery and other tasks than reported ad-hoc configurations. LDA-GA works by optimizing the coherence of topics produced by LDA for a given dataset. In this paper, we instantiate LDA-GA as a TraceLab experiment, making publicly available all the implemented components, the datasets and the results from our previous work. In addition, we provide guidelines on how to extend our LDA-GA approach to other IR techniques and other software engineering tasks using existing TraceLab components.
Latent Semantic Indexing (LSI) is an advanced method widely and successfully employed in Information Retrieval (IR). It is an extension of Vector Space Model (VSM) and it is able to overcome VSM in canonical IR scenarios where it is used on very large document repositories. LSI has also been used to semi-automatically generate traceability links between software artefacts. However, in such a scenario LSI is not able to overcome VSM. This contradicting result is probably due to the different characteristics of software artefact repositories as compared to document repositories. In this paper we present a preliminary empirical study to analyze how the size and the vocabulary of the repository-in terms of number of documents and terms (i.e., the vocabulary)-affects the retrieval accuracy. Even if replications are needed to generalize our findings, the study presented in this paper provides some insights that might be used as guidelines for selecting the more adequate methods to be used for traceability recovery depending on the particular application context.
Information Retrieval (IR) has been widely accepted as a method for automated traceability recovery based on the textual similarity among the software artifacts. However, a notorious difficulty for IR-based methods is that artifacts may be related even if they are not textually similar. A growing body of work addresses this challenge by combining IR-based methods with structural information from source code. Unfortunately, the accuracy of such methods is highly dependent on the IR methods. If the IR methods perform poorly, the combined approaches may perform even worse. In this paper, we propose to use the feedback provided by the software engineer when classifying candidate links to regulate the effect of using structural information. Specifically, our approach only considers structural information when the traceability links from the IR methods are verified by the software engineer and classified as correct links. An empirical evaluation conducted on three systems suggests that our approach outperforms both a pure IR-based method and a simple approach for combining textual and structural information.
Developers search source code frequently during their daily tasks, to find pieces of code to reuse, to find where to implement changes, etc. Code search based on text retrieval (TR) techniques has been widely used in the software engineering community during the past decade. The accuracy of the TR-based search results depends largely on the quality of the query used. We introduce Refoqus, an Eclipse plugin which is able to automatically detect the quality of a text retrieval query and to propose reformulations for it, when needed, in order to improve the results of TR-based code search.
Mentoring project newcomers is a crucial activity in software projects, and requires to identify people having good communication and teaching skills, other than high expertise on specific technical topics. In this demo we present Yoda (Young and newcOmer Developer Assistant), an Eclipse plugin that identifies and recommends mentors for newcomers joining a software project. Yoda mines developers' communication (e.g., mailing lists) and project versioning systems to identify mentors using an approach inspired to what ArnetMiner does when mining advisor/student relations. Then, it recommends appropriate mentors based on the specific expertise required by the newcomer. The demo shows Yoda in action, illustrating how the tool is able to identify and visualize mentoring relations in a project, and suggest appropriate mentors for a developer who is going to work on certain source code files, or on a given topic.
Coupling is a fundamental property of software systems, and numerous coupling measures have been proposed to support various development and maintenance activities. However, little is known about how developers actually perceive coupling, what mechanisms constitute coupling, and if existing measures align with this perception. In this paper we bridge this gap, by empirically investigating how class coupling - as captured by structural, dynamic, semantic, and logical coupling measures - aligns with developers' perception of coupling. The study has been conducted on three Java open-source systems - namely ArgoUML, JHotDraw and jEdit - and involved 64 students, academics, and industrial practitioners from around the world, as well as 12 active developers of these three systems. We asked participants to assess the coupling between the given pairs of classes and provide their ratings and some rationale. The results indicate that the peculiarity of the semantic coupling measure allows it to better estimate the mental model of developers than the other coupling measures. This is because, in several cases, the interactions between classes are encapsulated in the source code vocabulary, and cannot be easily derived by only looking at structural relationships, such as method calls.
There are more than twenty distinct software engineering tasks addressed with text retrieval (TR) techniques, such as, traceability link recovery, feature location, refactoring, reuse, etc. A common issue with all TR applications is that the results of the retrieval depend largely on the quality of the query. When a query performs poorly, it has to be reformulated and this is a difficult task for someone who had trouble writing a good query in the first place. We propose a recommender (called Refoqus) based on machine learning, which is trained with a sample of queries and relevant results. Then, for a given query, it automatically recommends a reformulation strategy that should improve its performance, based on the properties of the query. We evaluated Refoqus empirically against four baseline approaches that are used in natural language document retrieval. The data used for the evaluation corresponds to changes from five open source systems in Java and C++ and it is used in the context of TR-based concept location in source code. Refoqus outperformed the baselines and its recommendations lead to query performance improvement or preservation in 84% of the cases (in average).
Information Retrieval (IR) methods, and in particular topic models, have recently been used to support essential software engineering (SE) tasks, by enabling software textual retrieval and analysis. In all these approaches, topic models have been used on software artifacts in a similar manner as they were used on natural language documents (e.g., using the same settings and parameters) because the underlying assumption was that source code and natural language documents are similar. However, applying topic models on software data using the same settings as for natural language text did not always produce the expected results. Recent research investigated this assumption and showed that source code is much more repetitive and predictable as compared to the natural language text. Our paper builds on this new fundamental finding and proposes a novel solution to adapt, configure and effectively use a topic modeling technique, namely Latent Dirichlet Allocation (LDA), to achieve better (acceptable) performance across various SE tasks. Our paper introduces a novel solution called LDA-GA, which uses Genetic Algorithms (GA) to determine a near-optimal configuration for LDA in the context of three different SE tasks: (1) traceability link recovery, (2) feature location, and (3) software artifact labeling. The results of our empirical studies demonstrate that LDA-GA is ableto identify robust LDA configurations, which lead to a higher accuracy on all the datasets for these SE tasks as compared to previously published results, heuristics, and the results of a combinatorial search.
When newcomers join a software project, they need to be properly trained to understand the technical and organizational aspects of the project. Inadequate training could likely lead to project delay or failure. In this paper we propose an approach, named Yoda (Young and newcOmer Developer Assistant) aimed at identifying and recommending mentors in software projects by mining data from mailing lists and versioning systems. Candidate mentors are identified among experienced developers who actively interact with newcomers. Then, when a newcomer joins the project, Yoda recommends her a mentor that, among the available ones, has already discussed topics relevant for the newcomer. Yoda has been evaluated on software repositories of five open source projects. We have also surveyed some developers of these projects to understand whether mentoring was actually performed in their projects, and asked them to evaluate the mentoring relations Yoda identified. Results indicate that top committers are not always the most appropriate mentors, and show the potential usefulness of Yoda as a recommendation system to aid project managers in supporting newcomers joining a software project.
Text-based search and retrieval is used by developers in the context of many SE tasks, such as, concept location, traceability link retrieval, reuse, impact analysis, etc. Solutions for software text search range from regular expression matching to complex techniques using text retrieval. In all cases, the results of a search depend on the query formulated by the developer. A developer needs to run a query and look at the results before realizing that it needs reformulating. Our aim is to automatically assess the performance of a query before it is executed. We introduce an automatic query performance assessment approach for software artifact retrieval, which uses 21 measures from the field of text retrieval. We evaluate the approach in the context of concept location in source code. The evaluation shows that our approach is able to predict the performance of queries with 79% accuracy, using very little training data.
Refactorings are - as defined by Fowler - behavior preserving source code transformations. Their main purpose is to improve maintainability or comprehensibility, or also reduce the code footprint if needed. In principle, refactorings are defined as simple operations so that are "unlikely to go wrong" and introduce faults. In practice, refactoring activities could have their risks, as other changes. This paper reports an empirical study carried out on three Java software systems, namely Apache Ant, Xerces, and Ar-go UML, aimed at investigating to what extent refactoring activities induce faults. Specifically, we automatically detect (and then manually validate) 15,008 refactoring operations (of 52 different kinds) using an existing tool (Ref-Finder). Then, we use the SZZ algorithm to determine whether it is likely that refactorings induced a fault. Results indicate that, while some kinds of refactorings are unlikely to be harmful, others, such as refactorings involving hierarchies (e.g., pull up method), tend to induce faults very frequently. This suggests more accurate code inspection or testing activities when such specific refactorings are performed.
This paper proposes the use of Interactive Genetic Algorithms (IGAs) to integrate developer’s knowledge in a re-modularization task. Specifically, the proposed algorithm uses a fitness composed of automatically-evaluated factors—accounting for the modularization quality achieved by the solution—and a human-evaluated factor, penalizing cases where the way re-modularization places components into modules is considered meaningless by the developer. The proposed approach has been evaluated to re-modularize two software systems, SMOS and GESA. The obtained results indicate that IGA is able to produce solutions that, from a developer’s perspective, are more meaningful than those generated using the full-automated GA. While keeping feedback into account, the approach does not sacrifice the modularization quality, and may work requiring a very limited set of feedback only, thus allowing its application also for large systems without requiring a substantial human effort.
In this demo we present TraceME (Traceability Management in Eclipse), an Eclipse plug-in, that supports the software engineer in capturing and maintaining traceability links between different types of artifacts. A comparative analysis of the functionalities of the tools supporting traceability recovery highlights that TraceME is the more comprehensive tool for supporting such a critical activity during software development.
Unit testing represents a key activity in software development and maintenance. Test suites with high internal quality facilitate maintenance activities, such as code comprehension and regression testing. Several guidelines have been proposed to help developers write good test suites. Unfortunately, such rules are not always followed resulting in the presence of bad test code smells (or simply test smells). Test smells have been defined as poorly designed tests and their presence may negatively affect the maintainability of test suites and production code. Despite the many studies that address code smells in general, until now there has been no empirical evidence regarding test smells (i) distribution in software systems nor (ii) their impact on the maintainability of software systems. This paper fills this gap by presenting two empirical studies. The first study is an exploratory analysis of 18 software systems (two industrial and 16 open source) aimed at analyzing the distribution of test smells in source code. The second study, a controlled experiment involving twenty master students, is aimed at analyzing whether the presence of test smells affects the comprehension of source code during software maintenance. The results show that (i) test smells are widely spread throughout the software systems studied and (ii) most of the test smells have a strong negative impact on the comprehensibility of test suites and production code.
Information Retrieval (IR) techniques have been used for various software engineering tasks, including the labeling of software artifacts by extracting “keywords” from them. Such techniques include Vector Space Models, Latent Semantic Indexing, Latent Dirichlet Allocation, as well as customized heuristics extracting words from specific source code elements. This paper investigates how source code artifact labeling performed by IR techniques would overlap (and differ) from labeling performed by humans. This has been done by asking a group of subjects to label 20 classes from two Java software systems, JHotDraw and eXVantage. Results indicate that, in most cases, automatic labeling would be more similar to human-based labeling if using simpler techniques - e.g., using words from class and method names - that better reflect how humans behave. Instead, clustering-based approaches (LSI and LDA) are much more worthwhile to be used on source code artifacts having a high verbosity, as well as for artifacts requiring more effort to be manually labeled.
Test case selection has been recently formulated as multi-objective optimization problem trying to satisfy conflicting goals, such as code coverage and computational cost. This paper introduces the concept of asymmetric distance preserving, useful to improve the diversity of non-dominated solutions produced by multi-objective Pareto efficient genetic algorithms, and proposes two techniques to achieve this objective. Results of an empirical study conducted over four programs from the SIR benchmark show how the proposed technique (i) obtains non-dominated solutions having a higher diversity than the previously proposed multi-objective Pareto genetic algorithms; and (ii) improves the convergence speed of the genetic algorithms.
Meta-heuristics have been successfully used to solve a wide variety of problems. However, one issue many techniques have is their risk of being trapped into local optima, or to create a limited variety of solutions (problem known as "population drift"). During recent and past years, different kinds of techniques have been proposed to deal with population drift, for example hybridizing genetic algorithms with local search techniques or using niche techniques. This paper proposes a technique, based on Singular Value Decomposition (SVD), to enhance Genetic Algorithms (GAs) population diversity. SVD helps to estimate the evolution direction and drive next generations towards orthogonal dimensions. The proposed SVD-based GA has been evaluated on 11 benchmark problems and compared with a simple GA and a GA with a distance-crowding schema. Results indicate that SVD-based GA achieves significantly better solutions and exhibits a quicker convergence than the alternative techniques.
During software evolution changes are inevitable. These changes may lead to design erosion and the introduction of inadequate design solutions, such as design antipatterns. Several empirical studies provide evidence that the presence of antipatterns is generally associated with lower productivity, greater rework, and more significant design efforts for developers. In order to improve the quality and remove antipatterns, refactoring operations are needed. In this demo, we present the Extract class features of ARIES (Automated Refactoring In EclipSe), an Eclipse plug-in that supports the software engineer in removing the "Blob" antipattern.
Text retrieval approaches have been used to address many software engineering tasks. In most cases, their use involves issuing a textual query to retrieve a set of relevant software artifacts from the system. The performance of all these approaches depends on the quality of the given query (i.e., its ability to describe the information need in such a way that the relevant software artifacts are retrieved during the search). Currently, the only way to tell that a query failed to lead to the expected software artifacts is by investing time and effort in analyzing the search results. In addition, it is often very difficult to ascertain what part of the query leads to poor results. We propose a novel pre-retrieval metric, which reflects the quality of a query by measuring the specificity of its terms. We exemplify the use of the new specificity metric on the task of concept location in source code. A preliminary empirical study shows that our metric is a good effort predictor for text retrieval-based concept location, outperforming existing techniques from the field of natural language document retrieval.
We present a practical approach for teaching two different courses of Software Engineering (SE) and Software Project Management (SPM) in an integrated way. The two courses are taught in the same semester, thus allowing to build mixed project teams composed of five-eight Bachelor's students (with development roles) and one or two Master's students (with management roles). The main goal of our approach is to simulate a real-life development scenario giving to the students the possibility to deal with issues arising from typical project situations, such as working in a team, organising the division of work, and coping with time pressure and strict deadlines.
The potential benefits of traceability are well known and documented, as well as the impracticability of recovering and maintaining traceability links manually. Indeed, the manual management of traceability information is an error prone and time consuming task. Consequently, despite the advantages that can be gained, explicit traceability is rarely established unless there is a regulatory reason for doing so. Extensive efforts have been brought forth to improve the explicit connection of software artifacts in the software engineering community (both research and commercial). Promising results have been achieved using Information Retrieval (IR) techniques for traceability recovery. IR-based traceability recovery methods propose a list of candidate traceability links based on the similarity between the text contained in the software artifacts. Software artifacts have different structures and the common element among many of them is the textual data, which most often captures the informal semantics of artifacts. For example, source code includes large volume of textual data in the form of comments and identifiers. In consequence, IR-based approaches are very well suited to address the traceability recovery problem. The conjecture is that artifacts with high textual similarity are good candidates to be traced to each other since they share several concepts. In this chapter we overview a general process of using IR-based methods for traceability link recovery and overview some of them in a greater detail: probabilistic, vector space, and Latent Semantic Indexing models. Finally, we discuss common approaches to measuring the performance of IR-based traceability recovery methods and the latest advances in techniques for the analysis of candidate links.
Approaches for improving class cohesion identify refactoring opportunities using metrics that capture structural relationships between the methods of a class, e.g., attribute references. Semantic metrics, e.g., C3 metric, have also been proposed to measure class cohesion, as they seem to complement structural metrics. However, until now semantic relationships between methods have not been used to identify refactoring opportunities. In this paper we propose an Extract Class refactoring method based on graph theory that exploits structural and semantic relationships between methods. The empirical evaluation of the proposed approach highlighted the benefits provided by the combination of semantic and structural measures and the potential usefulness of the proposed method as a feature for software development environments.
The paper presents an approach helping developers to maintain source code identifiers and comments consistent with high-level artifacts. Specifically, the approach computes and shows the textual similarity between source code and related high-level artifacts. Our conjecture is that developers are induced to improve the source code lexicon, i.e., terms used in identifiers or comments, if the software development environment provides information about the textual similarity between the source code under development and the related high-level artifacts. The proposed approach also recommends candidate identifiers built from high-level artifacts related to the source code under development and has been implemented as an Eclipse plug-in, called COde Comprehension Nurturant Using Traceability (COCONUT). The paper also reports on two controlled experiments performed with master's and bachelor's students. The goal of the experiments is to evaluate the quality of identifiers and comments (in terms of their consistency with high-level artifacts) in the source code produced when using or not using COCONUT. The achieved results confirm our conjecture that providing the developers with similarity between code and high-level artifacts helps to improve the quality of source code lexicon. This indicates the potential usefulness of COCONUT as a feature for software development environments.
Maintaining traceability links between unit tests and tested classes is an important factor for effectively managing the development and evolution of software systems. Exploiting traceability links helps in program comprehension and maintenance by ensuring consistency between unit tests and tested classes during maintenance activities. Unfortunately, it is often the case that such links are not explicitly maintained and thus they have to be recovered manually during software evolution. A novel automated solution to this problem, based on dynamic slicing and conceptual coupling, is presented. The resulting tool, SCOTCH (Slicing and Coupling based Test to Code trace Hunter), is empirically evaluated on three systems: an open source system and two industrial systems. The results indicate that SCOTCH identifies traceability links between unit test classes and tested classes with a high accuracy and greater stability than existing techniques, highlighting its potential usefulness as a feature within a software development environment.
In this paper we present an experiment and two replications aimed at comparing the support provided by ER and UML class diagrams during comprehension activities by focusing on the single building blocks of the two notations. This kind of analysis can be used to identify weakness in a notation and/or justify the need of preferring ER or UML for data modeling. The results reveal that UML class diagrams are generally more comprehensible than ER diagrams, even if the former has some weaknesses related to three building blocks, i.e., multi-value attribute, composite attribute, and weak entity. These findings suggest that a UML class diagram extension should be considered to overcome these weaknesses and improve the comprehensibility of the notation.
Different Information Retrieval (IR) methods have been proposed to recover traceability links among software artifacts. Until now there is no single method that sensibly outperforms the others, however, it has been empirically shown that some methods recover different, yet complementary traceability links. In this paper, we exploit this empirical finding and propose an integrated approach to combine orthogonal IR techniques, which have been statistically shown to produce dissimilar results. Our approach combines the following IR-based methods: Vector Space Model (VSM), probabilistic Jensen and Shannon (JS) model, and Relational Topic Modeling (RTM), which has not been used in the context of traceability link recovery before. The empirical case study conducted on six software systems indicates that the integrated method outperforms stand-alone IR methods as well as any other combination of non-orthogonal methods with a statistically significant margin.
Maintaining traceability links between unit tests and tested classes is an important factor for effectively managing the development and evolution of software systems. Exploiting traceability links helps in program comprehension and maintenance by ensuring consistency between unit tests and tested classes during maintenance activities. Unfortunately, it is often the case that such links are not explicitly maintained and thus they have to be recovered manually during software evolution. A novel automated solution to this problem, based on dynamic slicing and conceptual coupling, is presented. The resulting tool, SCOTCH (Slicing and Coupling based Test to Code trace Hunter), is empirically evaluated on three systems: an open source system and two industrial systems. The results indicate that SCOTCH identifies traceability links between unit test classes and tested classes with a high accuracy and greater stability than existing techniques, highlighting its potential usefulness as a feature within a software development environment.
Information Retrieval methods have been largely adopted to identify traceability links based on the textual similarity of software artifacts. However, noise due to word usage in software artifacts might negatively affect the recovery accuracy. We propose the use of smoothing filters to reduce the effect of noise in software artifacts and improve the performances of traceability recovery methods. An empirical evaluation performed on two repositories indicates that the usage of a smoothing filter is able to significantly improve the performances of Vector Space Model and Latent Semantic Indexing. Such a result suggests that other than being used for traceability recovery the proposed filter can be used to improve performances of various other software engineering approaches based on textual analysis.
Identifiers play an important role in source code understandability, maintainability, and fault-proneness. This paper reports a study of identifier renamings in software systems, studying how terms (identifier atomic components) change in source code identifiers. Specifically, the paper (i) proposes a term renaming taxonomy, (ii) presents an approximate lightweight code analysis approach to detect and classify term renamings automatically into the taxonomy dimensions, and (iii) reports an exploratory study of term renamings in two open-source systems, Eclipse-JDT and Tomcat. We thus report evidence that not only synonyms are involved in renamings but also (in a small fraction) more unexpected changes occur: surprisingly, we detected hypernym (a more abstract term, e.g., size vs. length) and hyponym (a more concrete term, e.g., restriction vs. rule) renamings, and antonym renamings (a term replaced with one having the opposite meaning, e.g., closing vs. opening). Despite being only a fraction of all renamings, synonym, hyponym, hypernym, and antonym renamings may hint at some program understanding issues and, thus, could be used in a renamingrecommendation system to improve code quality.
We propose a novel approach to identify Move Method refactoring opportunities and remove the Feature Envy bad smell from source code. The proposed approach analyzes both structural and conceptual relationships between methods and uses Relational Topic Models to identify sets of methods that share several responsabilities, i.e., 'friend methods'. The analysis of method friendships of a given method can be used to pinpoint the target class (envied class) where the method should be moved in. The results of a preliminary empirical evaluation indicate that the proposed approach provides meaningful refactoring opportunities.
Recent studies indicated that showing the similarity between the source code being developed and related high-level artifacts (HLAs), such as requirements, helps developers improve the quality of source code identifiers. In this paper, we present CodeTopics, an Eclipse plug-in that in addition to showing the similarity between source code and HLAs also highlights to what extent the code under development covers topics described in HLAs. Such views complement information derived by showing only the similarity between source code and HLAs helping (i) developers to identify functionality that are not implemented yet or (ii) newcomers to comprehend source code artifacts by showing them the topics that these artifacts relate to.
Some studies have been recently carried out to investigate the use of search-based techniques in estimating software development effort and the results reported seem to be promising. Tabu Search is a meta-heuristic approach successfully used to address several optimization problems. In this paper, we report on an empirical analysis carried out exploiting Tabu Search on two publicly available datasets, i.e., Desharnais and NASA. On these datasets, the exploited Tabu Search settings provided estimates comparable with those achieved with some widely used estimation techniques, thus suggesting for further investigations on this topic.
We present ADvanced Artefact Management System (ADAMS), a web-based system that integrates project management features, such as work-breakdown structure definition, resource allocation, and schedule management as well as artefact management features, such as artefact versioning, traceability management, and artefact quality management. In this article we focus on the fine-grained artefact management approach adopted in ADAMS, which is a valuable support to high-level documentation and traceability management. In particular, the traceability layer in ADAMS is used to propagate events concerning changes to an artefact to the dependent artefacts, thus also increasing the context-awareness in the project. We also present the results of experimenting with the system in software projects developed at the University of Salerno.
We present the results of three sets of controlled experiments aimed at analysing whether UML class diagrams are more comprehensible than ER diagrams during data models maintenance. In particular, we considered the support given by the two notations in the comprehension and interpretation of data models, comprehension of the change to perform to meet a change request, and detection of defects contained in a data model. The experiments involved university students with different levels of ability and experience. The results demonstrate that using UML class diagrams subjects achieved better comprehension levels. With regard to the support given by the two notations during maintenance activities the results demonstrate that the two notations give the same support, while in general UML class diagrams provide a better support with respect to ER diagrams during verification activities.
The structure of a software system has a major impact on its maintainability. To improve maintainability, software systems are usually organized into subsystems using the constructs of packages or modules. However, during software evolution the structure of the system undergoes continuous modifications, drifting away from its original design, often reducing its quality. In this paper we propose an approach for helping maintainers to improve the quality of software modularization. The proposed approach analyzes the (structural and semantic) relationships between classes in a package identifying chains of strongly related classes. The identified chains are used to define new packages with higher cohesion than the original package. The proposed approach has been empirical evaluated through a case study. The context of the study is represented by an open source system, JHotDraw, and two software systems developed by teams of students at the University of Salerno. The analysis of the results reveals that the proposed approach generates meaningful re-modularization of the studied systems, which can lead to higher quality.
We propose a novel approach supporting the Extract Class refactoring. The proposed approach analyzes the (structural and semantic) similarity of the methods in a class in order to identify chains of strongly related methods. The identified method chains are used to define new classes with higher cohesion than the original class. A preliminary evaluation reveals that the approach is able to identify meaningful refactoring operations.
Unit tests are valuable as a source of up-to-date documentation as developers continuously changes them to reflect changes in the production code to keep an effective regression suite. Maintaining traceability links between unit tests and classes under test can help developers to comprehend parts of a system. In particular, unit tests show how parts of a system are executed and as such how they are supposed to be used. Moreover, the dependencies between unit tests and classes can be exploited to maintain the consistency during refactoring. Generally, such dependences are not explicitly maintained and they have to be recovered during software development. Some guidelines and naming conventions have been defined to describe the testing environment in order to easily identify related tests for a programming task. However, very often these guidelines are not followed making the identification of links between unit tests and classes a time-consuming task. Thus, automatic approaches to recover such links are needed. In this paper a traceability recovery approach based on Data Flow Analysis (DFA) is presented. In particular, the approach retrieves as tested classes all the classes that affect the result of the last assert statement in each method of the unit test class. The accuracy of the proposed method has been empirically evaluated on two systems, an open source system and an industrial system. As a benchmark, we compare the accuracy of the DFA-based approach with the accuracy of the previously used traceability recovery approaches, namely Naming Convention (NC) and Last Call Before Assert (LCBA) that seem to provide the most accurate results. The results show that the proposed approach is the most accurate method demonstrating the effectiveness of DFA. However, the case study also highlights the limitations of the experimented traceability recovery approaches, showing that detecting the class under test cannot be fully automated and some issues are still under study.
Poorly-chosen identifiers have been reported in the literature as misleading and increasing the program comprehension effort. Identifiers are composed of terms, which can be dictionary words, acronyms, contractions, or simple strings. We conjecture that the use of identical terms in different contexts may increase the risk of faults. We investigate our conjecture using a measure combining term entropy and term context coverage to study whether certain terms increase the odds ratios of methods to be fault-prone. Entropy measures the physical dispersion of terms in a program: the higher the entropy, the more scattered across the program the terms. Context coverage measures the conceptual dispersion of terms: the higher their context coverage, the more unrelated the methods using them. We compute term entropy and context coverage of terms extracted from identifiers in Rhino 1.4R3 and ArgoUML 0.16. We show statistically that methods containing terms with high entropy and context coverage are more fault-prone than others.
In software engineering, developers must often find solutions to problems balancing competing goals, e.g., quality versus cost, time to market versus resources, or cohesion versus coupling. Finding a suitable balance between contrasting goals is often complex and recommendation systems are useful to support developers and managers in performing such a complex task. We believe that contrasting goals can be often dealt with game theory techniques. Indeed, game theory is successfully used in other fields, especially in economics, to mathematically propose solutions to strategic situation, in which an individual's success in making choices depends on the choices of others. To demonstrate the applicability of game theory to software engineering and to understand its pros and cons, we propose an approach based on game theory that recommend extract-class refactoring opportunities. A preliminary evaluation inspired by mutation testing demonstrates the applicability and the benefits of the proposed approach.
We present an empirical study to statistically analyze the equivalence of several traceability recovery methods based on Information Retrieval (IR) techniques. The analysis is based on Principal Component Analysis and on the analysis of the overlap of the set of candidate links provided by each method. The studied techniques are the Jensen-Shannon (JS) method, Vector Space Model (VSM), Latent Semantic Indexing (LSI), and Latent Dirichlet Allocation (LDA). The results show that while JS, VSM, and LSI are almost equivalent, LDA is able to capture a dimension unique to the set of techniques which we considered.
Tabu Search is a meta-heuristic approach successfully used to address optimization problems in several contexts. This paper reports the results of an empirical study carried out to investigate the effectiveness of Tabu Search in estimating Web application development effort. The dataset employed in this investigation is part of the Tukutuku database. This database has been used in several studies to assess the effectiveness of various effort estimation techniques, such as Linear Regression and Case-Based Reasoning. Our results are encouraging given that Tabu Search outperformed all the other estimation techniques against which it has been compared.
Context: The use of search-based methods has been recently proposed for software development effort estimation and some case studies have been carried out to assess the effectiveness of Genetic Programming (GP). The results reported in the literature showed that GP can provide an estimation accuracy comparable or slightly better than some widely used techniques and encouraged further research to investigate whether varying the fitness function the estimation accuracy can be improved. Aim: Starting from these considerations, in this paper we report on a case study aiming to analyse the role played by some fitness functions for the accuracy of the estimates. Method: We performed a case study based on a publicly available dataset, i.e., Desharnais, by applying a 3-fold cross validation and employing summary measures and statistical tests for the analysis of the results. Moreover, we compared the accuracy of the obtained estimates with those achieved using some widely used estimation methods, namely Case-Based Reasoning (CBR) and Manual Step Wise Regression (MSWR). Results: The obtained results highlight that the fitness function choice significantly affected the estimation accuracy. The results also revealed that GP provided significantly better estimates than CBR and comparable with those of MSWR for the considered dataset.
Antipatterns are poor object-oriented solutions to recurring design problems. The identification of occurrences of antipatterns in systems has received recently some attention but current approaches have two main limitations: either (1) they classify classes strictly as being or not antipatterns, and thus cannot report accurate information for borderline classes, or (2) they return the probabilities of classes to be antipatterns but they require an expensive tuning by experts to have acceptable accuracy. To mitigate such limitations, we introduce a new identification approach, ABS (Antipattern identification using B-Splines), based on a numerical analysis technique. The results of a preliminary study show that ABS generally outperforms previous approaches in terms of accuracy when used to identify Blobs.
Unit tests are valuable as a source of up-to-date documentation as developers continuously changes them to reflect changes in the production code to keep an effective regression suite. Maintaining traceability links between unit tests and classes under test can help developers to comprehend parts of a system. In particular, unit tests show how parts of a system are executed and as such how they are supposed to be used. Moreover, the dependencies between unit tests and classes can be exploited to maintain the consistency during refactoring. Generally, such dependences are not explicitly maintained and they have to be recovered during software development. Some guidelines and naming conventions have been defined to describe the testing environment in order to easily identify related tests for a programming task. However, very often these guidelines are not followed making the identification of links between unit tests and classes a time-consuming task. Thus, automatic approaches to recover such links are needed. In this paper a traceability recovery approach based on Data Flow Analysis (DFA) is presented. In particular, the approach retrieves as tested classes all the classes that affect the result of the last assert statement in each method of the unit test class. The accuracy of the proposed method has been empirically evaluated on two systems, an open source system and an industrial system. As a benchmark, we compare the accuracy of the DFA-based approach with the accuracy of the previously used traceability recovery approaches, namely Naming Convention (NC) and Last Call Before Assert (LCBA) that seem to provide the most accurate results. The results show that the proposed approach is the most accurate method demonstrating the effectiveness of DFA. However, the case study also highlights the limitations of the experimented traceability recovery approaches, showing that detecting the class under test cannot be fully automated and some issues are still under study.
Software development effort estimation is a critical activity for the competitiveness of a software company; it is crucial for planning and monitoring project development and for delivering the product on time and within budget. In the last years, some attempts have been made to apply search-based approaches to estimate software development effort. In particular, some genetic algorithms have been defined and some empirical studies have been performed with the aim of assessing the effectiveness of the proposed approaches for estimating software development effort. The results reported in those studies seem to be promising. The objective of this chapter is to present a state of the art in the field by reporting on the most significant empirical studies undertaken so far. Furthermore, some suggestions for future research directions are also provided.
We present the project management support in ADAMS (ADvanced Artefact Management System), a web-based system that integrates project management features such as resource allocation and process control and artefact management features, such as coordination of cooperative workers and artefact versioning, as well as context-awareness and artefact traceability features. We also present an evaluation of the usefulness and adequacy of the functionalities that ADAMS provides to the project managed. This evaluation has been conducted during the last 3 years with Master students attending a course of Software Project Management at the University of Salerno.
We report the results of a controlled experiment and a replication performed with different subjects, in which we assessed the usefulness of an Information Retrieval-based traceability recovery tool during the traceability link identification process. The main result achieved in the two experiments is that the use of a traceability recovery tool significantly reduces the time spent by the software engineer with respect to manual tracing. Replication with different subjects allowed us to investigate if subjects’ experience and ability play any role in the traceability link identification process. In particular, we made some observations concerning the retrieval accuracy achieved by the software engineers with and without the tool support and with different levels of experience and ability.
The paper proposes a novel information retrieval technique based on numerical analysis for recovering traceability links between code and software documentation. The results of a reported case study demonstrate that the proposed approach significantly outperforms two vector-based IR models, i.e., the vector space model and latent semantic indexing, and it is comparable and sometimes better than a probabilistic model, i.e., the Jensen-Shannon method. The paper also discusses the influence of each method with the specific artifact type considered and the artifact language.
The use of optimization techniques has been recently proposed to build models for software development effort estimation. In particular, some studies have been carried out using search-based techniques, such as genetic programming, and the results reported seem to be promising. At the best of our knowledge nobody has analyzed the effectiveness of Tabu search for development effort estimation. Tabu search is a meta-heuristic approach successful used to address several optimization problems. In this paper we report on an empirical analysis carried out exploiting Tabu Search on a publicity available dataset, i.e., Desharnais dataset. The achieved results show that Tabu Search provides estimates comparable with those achieved with some widely used estimation techniques.
This paper presents a two-steps process aiming at improving the tracing performances of the software engineer when using an IR-based traceability recovery tool. In the first step the software engineer performs an incremental coarse-grained traceability recovery between a set of source artefacts and a set of target artefacts. During this step he/she traces as many links as possible keeping low the effort to discard false positives. In the second step he/she uses a coverage link analysis aiming at identifying source artefacts poorly traced and guiding focused fine-grained traceability recovery sessions to recover links missed in the first step. The results achieved in a reported controlled experiment demonstrate that the proposed approach significantly increases the amount of correct links traced by the software engineer with respect to a tradition process.
The intensive human effort needed to manually manage traceability information has increased the interest in utilising semi-automated traceability recovery techniques. This paper presents a simple way to improve the accuracy of traceability recovery methods based on Information Retrieval techniques. The proposed method acts on the artefact indexing considering only the nouns contained in the artefact content to define the semantics of an artefact. The rationale behind such a choice is that the language used in software documents can be classified as a sectorial language, where the terms that provide more indication on the semantics of a document are the nouns. The results of a reported case study demonstrate that the proposed artefact indexing significantly improves the accuracy of traceability recovery methods based on the probabilistic or vector space based IR models.
Software change impact analysis is the activity of the software maintenance process that determines possible effects of proposed software changes. This activity is necessary to be aware of ripple-effects caused by the change and record them so that nothing is overlooked. A change has not only impact on the source code, but also on the other related software artefacts, such as requirements, design, and test. For this reason, impact analysis can be efficiently supported through traceability information. In this paper we review traceability management in the context of impact analysis and discuss the main challenges and research directions.
Several refactoring methods have been proposed in the literature to improve the cohesion of classes. Very often, refactoring operations are guided by cohesion metrics based on the structural information of the source code, such as attribute references in methods. In this paper we present a novel approach to guide the extract class refactoring (M. Fowler, 1999), taking into account structural and semantic cohesion metrics. The proposed approach has been evaluated in a case study conducted on JHotDraw, an open source software system. The achieved results revealed that the performance achieved with the proposed approach significantly outperforms the results achieved with methods considering only structural or semantic information. The proposed approach has also been integrated in the Eclipse platform.
We present the results of a controlled experiment aiming at analysing the role played by the approach adopted during an IR-based traceability recovery process. In particular, we compare the tracing performances achieved by subjects using the "one-shot" process, where the full ranked list of candidate links is proposed, and the incremental process, where a similarity threshold is used to cut the ranked list and the links are classified step-by-step. The analysis of the achieved results shows that, in general, the incremental process improves the tracing accuracy and reduces the effort to analyse the proposed links.
We present the results of two controlled experiments to compare ER and UML class diagrams, in order to find out which of the models provides better support during the comprehension of data models. The experiment involved Master and Bachelor students performing comprehension tasks on data models represented by ER or UML class diagrams. The achieved results show that UML class diagrams significantly improve the comprehension level achieved by subjects. Moreover, having different subjects with different levels of ability and experience allowed us to also make some considerations on the influence of such factors on the comprehension performances.
In this demonstration we present the traceability recovery tool developed in ADAMS, a fine-grained artefact management system. The tool is based on an information retrieval technique, namely latent semantic indexing, and aims at supporting the software engineer in the identification of traceability links between artefacts of different types. The tool has also been integrated in the Eclipse-based client of ADAMS.
The potential benefits of traceability are well known, as well as the impracticability of maintaining traceability links manually. Recently, Information Retrieval (IR) techniques have been proposed in or- der to support the software engineer during the traceability link identification process. Clearly, a research method/tool has more change to be transferred to practitioner if its usefulness is investigated through empir- ical user studies and it can be integrated within a commercial and widely used CASE tool. In this paper we try to achieve this result showing how IBM Requisite Pro can be enriched with IR-based traceability recovery features.
We present the results of two controlled experiments carried out to compare the support given by the ER and UML class diagrams during the maintenance of data models. The experiments involved master and bachelor students performing maintenance tasks on data models represented by ER and UML class diagrams. The results reveal that the two notations give in general the same support. In particular, the correctness level achieved by a subject performing the task on data model represented by an ER diagram are comparable with the correctness level achieved by the same subject performing the task on a different data model represented by an UML class diagram. Moreover, by discriminating the levels of ability (high vs. low) and experience (graduate vs. undergraduate) of subjects we also provide some consideration about the influence of such factors on the correctness level achieved by subjects. In particular, we observe that UML class diagrams better support subjects with high ability than ER diagrams, while no difference can be observed considering subjects with low ability. Regarding the experience factor the results reveal no difference in the correctness level achieved by graduate and undergraduate students.
This research abstract analyses the strengths and limitations of the application of information retrieval (IR) methods for traceability link recovery between software artefacts. This work also shows how the ideas behind an IR-based traceability recovery process combined with traceability information can be used to improve and monitor software artefact quality during software development.
The main drawback of existing software artifact management systems is the lack of automatic or semi-automatic traceability link generation and maintenance. We have improved an artifact management system with a traceability recovery tool based on Latent Semantic Indexing (LSI), an information retrieval technique. We have assessed LSI to identify strengths and limitations of using information retrieval techniques for traceability recovery and devised the need for an incremental approach. The method and the tool have been evaluated during the development of seventeen software projects involving about 150 students. We observed that although tools based on information retrieval provide a useful support for the identification of traceability links during software development, they are still far to support a complete semi-automatic recovery of all links. The results of our experience have also shown that such tools can help to identify quality problems in the textual description of traced artifacts.
Computer aided assessment (CAA) tools are more and more widely adopted in academic environments mixed to other assessment means. In this article, we present a CAA Web application, named eWorkbook, which can be used for evaluating learner’s knowledge by creating (the tutor) and taking (the learner) on-line tests based on multiple choice, multiple response and true/false question types. Its use is suitable within the academic environment in a blended learning approach, by providing tutors with an additional assessment tool, and learners with a distance self-assessment means. In the article, the main characteristics of the tool are presented together with a rationale behind them and an outline of the architectural design of the system.
In this paper, we present an extension of the Subversion command line to support fine-grained versioning of Java code. To this aim, for each Java file under versioning, an XML-based file representing the logical structure of the original file is automatically built by parsing the code. An additional XML-based file is also built to model collaboration constraints. This information is useful to enrich the context awareness by providing developers information about changes made by others to the same logical unit (i.e., class, method, or attribute) of the Java file. Finally, we present an extension of Subclipse, a Subversion front-end implemented as an Eclipse plug-in, aiming to support the fine-grained versioning in Subversion.
In this paper, we present an extension of the Subversion command line to support fine-grained versioning of Java code. To this aim, for each Java file under versioning, an XML-based file representing the logical structure of the original file is automatically built by parsing the code. An additional XML-based file is also built to model collaboration constraints. This information is useful to enrich the context awareness by providing developers information about changes made by others to the same logical unit (i.e., class, method, or attribute) of the Java file. Finally, we present an extension of Subclipse, a Subversion front-end implemented as an Eclipse plug-in, aiming to support the fine-grained versioning in Subversion.
The main drawback of existing software artifact management systems is the lack of automatic or semi-automatic traceability link generation and maintenance. We have improved an artifact management system with a traceability recovery tool based on Latent Semantic Indexing (LSI), an information retrieval technique. We have assessed LSI to identify strengths and limitations of using information retrieval techniques for traceability recovery and devised the need for an incremental approach. The method and the tool have been evaluated during the development of seventeen software projects involving about 150 students. We observed that although tools based on information retrieval provide a useful support for the identification of traceability links during software development, they are still far to support a complete semi-automatic recovery of all links. The results of our experience have also shown that such tools can help to identify quality problems in the textual description of traced artifacts.
Several authors apply information retrieval (IR) techniques to recover traceability links between software artefacts. The use of user feedbacks (in terms of classification of retrieval links as correct or false positives) has been proposed to improve the retrieval performances of these techniques. In this paper we present a critical analysis of using feedbacks within an incremental traceability recovery process. In particular, we analyse the trade-off between the improvement of the performances and the link classification effort required to train the IR-based traceability recovery tool. We also present the results achieved in case studies and show that even though the retrieval performances generally improve with the use of feedbacks, IR-based approaches are still far from solving the problem of recovering all correct links with a low classification effort.
In this paper we present an Eclipse plug-in, called COCONUT (COde COmprehension Nurturant Using Traceability), that shows the similarity level between the source code under development and high-level artefacts the source code should be traced onto. Also, the plug-in suggests candidate source code identifiers according to the domain terms contained into the corresponding high-level artefacts. Experiments showed that the plug-in helps to produce source code easier to be understood
Applying information retrieval (IR) techniques to retrieve all correct links between software artefacts is in general impractical, as usually this means producing a high effort for discarding too many false positives. We show that the only way to recover traceability links using IR methods is to identify an "optimal" threshold that achieves an acceptable balance between traced links and false positives. Unfortunately, such threshold is not known a priori. For this reason we have devised the need to use an incremental traceability recovery approach to gradually identify the threshold where it is more convenient to stop the traceability recovery process, and provide evidence of this in a case study. We also report the experience of using the incremental traceability recovery during the development of software projects.
The presence of traceability links between software artefacts is very important to achieve high comprehensibility and maintainability. This is confirmed by several researches and tools aiming at support traceability link maintenance and recovery. We propose to use traceability information combined with Information Retrieval techniques within an Eclipse plug-in to show the software engineer the similarity between source code components being developed and the high level artefacts they should be traced on. Such a similarity suggests actions aiming at improving the correct usage of identifiers and comments in source code and, as a consequence, the traceability and the comprehensibility level. The approach and tool have been assessed with a controlled experiment performed with master students.
In this paper, we present ADAMS (ADvanced Artefact Management System), a Web-based system that integrates project management and artefact management features, as well as context-awareness and artefact traceability features. In particular, we focus on two features of the tool, namely hierarchical versioning and traceability support.
Maintaining traceability links (dependencies) between artefacts enables the management of changes during incremental and iterative software development in a flexible way. In this paper we present the traceability environment offered by ADAMS (ADvanced Artefact Management System). Basically, the traceability layer is used to propagate events concerning changes to an artefact to the dependent artefacts, thus also increasing the context awareness within the project. The proliferation of the messages generated by the system could slowdown the system and cause the developer to ignore notifications. Therefore, in ADAMS a visualisation tool is also included that enables the software engineer to browse the dependences concerning a given artefact and selectively subscribe events he/she is interested in.
We present the traceability recovery tool developed in the ADAMS artefact management system. The tool is based on an Information Retrieval technique, namely Latent Semantic Indexing and aims at supporting the software engineer in the identification of the traceability links between artefacts of different types. We also present a case study involving seven student projects which represented an ideal workbench for the tool. The results emphasise the benefits provided by the tool in terms of new traceability links discovered, in addition to the links manually traced by the software engineer. Moreover, the tool was also helpful in identifying cases of lack of similarity between artefacts manually traced by the software engineer, thus revealing inconsistencies in the usage of domain terms in these artefacts. This information is valuable to assess the quality of the produced artefacts.
We present a traceability recovery method and tool based on latent semantic indexing (LSI) in the context of an artefact management system. The tool highlights the candidate links not identified yet by the software engineer and the links identified but missed by the tool, probably due to inconsistencies in the usage of domain terms in the traced software artefacts. We also present a case study of using the traceability recovery tool on software artefacts belonging to different categories of documents, including requirement, design, and testing documents, as well as code components.
Recently, researchers have addressed the problem of recovering traceability links between code and documentation using information retrieval techniques. We present a case study of applying Latent Semantic Indexing to recovering traceability links between artefacts produced during the requirements phase of a software development process and discuss the application of our approach within an artefact management system.