کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
425270 685710 2014 16 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Temporal representation for mining scientific data provenance
ترجمه فارسی عنوان
نمایندگی موقتی برای علم اطلاعات علمی معدن
کلمات کلیدی
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر نظریه محاسباتی و ریاضیات
چکیده انگلیسی


• We propose a representation of data provenance using logical time that reduces its feature space.
• This temporal representation supports clustering, classification and association rule mining.
• Analysis of the clustering results shows that the kk-means algorithm gives the best performance.
• We carry out an evaluation against a multi-gigabyte synthetic provenance dataset.
• We also carry out an evaluation against a real provenance dataset gathered from a satellite instrument.

Provenance of digital scientific data is a distinct piece of metadata about a data object. It can serve as a “ground-truth” for determining the cause of execution failure for instance, or can explain a particular result to a researcher intending to reuse a data object. Provenance can quickly grow voluminous and be quite feature rich, requiring new structure and concepts that support data mining. We propose a representation of data provenance using logical time that reduces the feature space of the provenance. The temporal representation supports clustering, classification and association rule mining. This paper studies the full utility of the temporal representation through an empirical evaluation and identification of the data mining algorithms that are most effective in application to the proposed representation. The evaluation is carried out against a multi-gigabyte semi-synthetic provenance dataset built from a range of scientific workflows, and against a real one month provenance dataset gathered from a satellite instrument. Through analysis of the results via clustering metrics—purity and Normalized Mutual Information (NMI), we determine that the kk-means algorithm gives the best clustering with the proposed temporal representation, while still yielding provenance-useful information.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Future Generation Computer Systems - Volume 36, July 2014, Pages 363–378
نویسندگان
, , ,