کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
10342798 696280 2005 18 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
The BankSearch web document dataset: investigating unsupervised clustering and category similarity
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر شبکه های کامپیوتری و ارتباطات
پیش نمایش صفحه اول مقاله
The BankSearch web document dataset: investigating unsupervised clustering and category similarity
چکیده انگلیسی
Targeting useful and relevant information on the internet is a highly complicated research area, which is served in part by research into document clustering. A foundational aspect of such research (proven over and over again in other research disciplines) is the use of standard datasets, against which different techniques can be properly benchmarked and assessed. We argue that, so far in this broad area of research, as many datasets have been used as research papers written, thus preventing confident reasoning about the relative performance of different techniques used in different publications. We describe a solution to this problem with the compilation of the BankSearch dataset, a proposed standard dataset suitable for a wide range of web-intelligence related research activities. At the time of writing, this dataset has already become a popular download in the Statlib archive, and is in use for benchmarking of a variety of document processing and web search techniques. Herein we also use the dataset in experiments to investigate certain issues in unsupervised web document clustering. Our main interest is how unsupervised clustering performance varies with the relative 'distance' between the categories inherent in the data, and how this is affected by the use of stemming and stoplists. These issues relate to, among other things, the design of useful search engines. We use simple k-means clustering, and find, unsurprisingly, that performance improves as categories become more distant. However, we also find that very close categories can be distinguished with fair accuracy, and there are interesting results concerning the use of stemming. Stop-word removal is confirmed as universally helpful, but stemming is not always to be recommended on 'distant' categories.
ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Journal of Network and Computer Applications - Volume 28, Issue 2, April 2005, Pages 129-146
نویسندگان
, ,