Content extraction from Chinese web page based on title and content dependency tree

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
725092	1461246	2012	6 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

Content Extraction - استخراج محتوا Dependency distance - فاصله وابستگی

موضوعات مرتبط

مهندسی و علوم پایه سایر رشته های مهندسی مهندسی برق و الکترونیک

پیش نمایش صفحه اول مقاله

Content extraction from Chinese web page based on title and content dependency tree

چکیده انگلیسی

Content extraction is the basis of many other technologies about data mining, which aims to extract the worthiest information from data-intensive web pages full of noise. Traditional content extraction based on statistics cannot deal with short content documents, table text or documents with long comments. Thus, through the research of positional relation between title and content, the paper provides you with a new method to extract content of web pages, which constructs title and content dependency tree (TCDT), localizes a content with the smallest dependency distance and realizes the accurate extraction of web pages' contents by usage of dependency relation between title and content and the statistical information of pages. A number of experiments of several websites prove that it can not only make up for the deficiency of statistical method, but also has a better precision in extracting content.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: The Journal of China Universities of Posts and Telecommunications - Volume 19, Supplement 2, October 2012, Pages 147-151, 189

نویسندگان

Bin ZHANG, Xiao-fei WANG,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

Content extraction from Chinese web page based on title and content dependency tree

دسترسی سریع

ارتباط

English Website