Combining text and link analysis for focused crawling—An application for vertical search engines

Article ID	Journal	Published Year	Pages	File Type
396621	Information Systems	2007	23 Pages	PDF

Abstract

The number of vertical search engines and portals has rapidly increased over the last years, making the importance of a topic-driven (focused) crawler self-evident. In this paper, we develop a latent semantic indexing classifier that combines link analysis with text content in order to retrieve and index domain-specific web documents. Our implementation presents a different approach to focused crawling and aims to overcome the limitations imposed by the need to provide initial data for training, while maintaining a high recall/precision ratio. We compare its efficiency with other well-known web information retrieval techniques.

Keywords

Text categorisation Information retrieval Focused crawling Latent Semantic Indexing