Enabling improved IR-based feature location

Article ID	Journal	Published Year	Pages	File Type
6885666	Journal of Systems and Software	2015	13 Pages	PDF

Abstract

Recent solutions to software engineering problems have incorporated tools and techniques from information retrieval (IR). The use of IR requires choosing an appropriate retrieval model and deciding on a query that best captures a particular information need. Taking feature location as a representative example, three research questions are investigated: (1) the impact of query preprocessing, (2) the impact that different scraping techniques for queries have on retrieval performance, (3) the performance impact that the underlying retrieval model has on identifying the correct source-code functions (the correct documents). These research questions are addressed using the five open source projects released as part of the SEMERU dataset. In the experiments, five methods of scraping queries from modification requests and seven retrieval model instances are considered. Using the standard evaluation metric Mean Reciprocal Rank (MRR), the experimental analysis reveals that better retrieval models are not the ones commonly used by software engineering researchers. Results find that models based on query-likelihood perform about twice as well as models in common use in software engineering such as LSI and thus deserve greater attention. Furthermore, corpus preprocessing has a significant impact as the top performing setting is over 100% better than the average.

Keywords

Query formulation Feature location