دانلود رایگان مقاله: سیستم مستقل زبان مستقل براساس چارچوب تشخیص متن است

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
391808	662002	2016	18 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

Language independent web news extraction system based on text detection framework

ترجمه فارسی عنوان

سیستم مستقل زبان مستقل براساس چارچوب تشخیص متن است

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

HTML Content Extraction - استخراج محتوا Information Filtering - فیلتر کردن اطلاعات Web mining - معدن وب

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی

پیش نمایش مقاله

سیستم مستقل زبان مستقل براساس چارچوب تشخیص متن است

چکیده انگلیسی

Web news provides a direct and efficient way to construct large text corpora. The creation of text data requires an understanding of HTML code and the preparation of customized parsing rules to identify text content in a webpage. Typically, parsing rules are written manually and cannot be applied to pages with different layouts. In this study, we present a web news extraction system that is based on a text detection framework. The proposed method scans the input HTML page and creates text statistics as a projection profile. Then, text block identification is applied to determine a set of content candidates. To filter noise, text verification determines whether a given text block can be included with content. We evaluate the proposed approach with the L3S-GN1 corpus and 3506 multilingual news data items randomly sampled from 325 websites (15 geographic regions and 11 distinct languages). We also compare the proposed method to 23 well-known state-of-the-art techniques. The experimental results show that the proposed method outperforms the second best method (NReadability) by 7.30% in the macro F-measure rate and is 16.91 times faster than NReadability. In terms of the perfect rate, the proposed method demonstrates 46.38% accuracy, whereas the Boilerpipe algorithm demonstrates only 21.54% accuracy. The proposed method is very useful for constructing a multilingual corpus because it requires no language-specific processing component.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Information Sciences - Volume 342, 10 May 2016, Pages 132–149

نویسندگان

Yu-Chieh Wu,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

دانلود رایگان مقاله ISI : سیستم مستقل زبان مستقل براساس چارچوب تشخیص متن است

دسترسی سریع

ارتباط

English Website