WSDL term tokenization methods for IR-style Web services discovery

Article ID	Journal	Published Year	Pages	File Type
433461	Science of Computer Programming	2012	20 Pages	PDF

Abstract

The IR-style Web services discovery represents an important approach that applies proven techniques developed in the field of Information Retrieval (IR). Many studies exploited the Web Services Description Language (WSDL) syntax to extract useful service metadata for building indexes. However, a fundamental issue associated with this approach is the WSDL term tokenization. This paper proposes the application of three statistical methods for WSDL term tokenization—MDL, TP, and PPM. With the increasing need for effective IR-style Web services discovery facilities, term tokenization is of fundamental importance for properly indexing WSDL documents. We compare our applied methods with two baseline methods. The experiment suggests the superiority of MDL and PPM methods based on IR evaluation metrics. To the best of our knowledge, our work is the first to systematically investigate the issue of WSDL term tokenization for Web services discovery. Our solution can benefit source coding mining, in which a key step is to tokenize names (i.e. terms) of variables, functions, classes, modules, etc. for semantic analysis. Our methods could also be used for solving Web-related string tokenization problems such as URL analysis and Web scripts comprehension.

► We address a critical issue for Information Retrieval-style Web services discovery. ► We propose the use of statistical methods for WSDL term tokenization. ► We show the superiority our methods compared to two baseline methods. ► Our methods can be used for source coding mining and automated script comprehension.

Keywords

Information retrieval Data engineering