Web page title extraction and its application

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
515626	867050	2007	16 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

Metadata extraction Information retrieval - بازیابی اطلاعات

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر نرم افزارهای علوم کامپیوتر

پیش نمایش صفحه اول مقاله

Web page title extraction and its application

چکیده انگلیسی

This paper is concerned with automatic extraction of titles from the bodies of HTML documents (web pages). Titles of HTML documents should be correctly defined in the title fields by the authors; however, in reality they are often bogus. It is advantageous if we can automatically extract titles from HTML documents. In this paper, we take a supervised machine learning approach to address the problem. We first propose a specification on HTML titles, that is, a ‘definition’ on HTML titles. Next, we employ two learning methods to perform the task. In one method, we utilize features extracted from the DOM (direct object model) Tree; in the other method, we utilize features based on vision. We also combine the two methods to further enhance the extraction accuracy. Our title extraction methods significantly outperform the baseline method of using the lines in largest font size as title (22.6–37.4% improvements in terms of F1 score). As application, we consider web page retrieval. We use the TREC Web Track data for evaluation. We propose a new method for HTML documents retrieval using extracted titles. Experimental results indicate that the use of both extracted titles and title fields is almost always better than the use of title fields alone; the use of extracted titles is particularly helpful in the task of named page finding (25.1–30.3% improvements).

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Information Processing & Management - Volume 43, Issue 5, September 2007, Pages 1332–1347

نویسندگان

Yewei Xue, Yunhua Hu, Guomao Xin, Ruihua Song, Shuming Shi, Yunbo Cao, Chin-Yew Lin, Hang Li,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

Web page title extraction and its application

دسترسی سریع

ارتباط

English Website