کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
425270 | 685710 | 2014 | 16 صفحه PDF | دانلود رایگان |
• We propose a representation of data provenance using logical time that reduces its feature space.
• This temporal representation supports clustering, classification and association rule mining.
• Analysis of the clustering results shows that the kk-means algorithm gives the best performance.
• We carry out an evaluation against a multi-gigabyte synthetic provenance dataset.
• We also carry out an evaluation against a real provenance dataset gathered from a satellite instrument.
Provenance of digital scientific data is a distinct piece of metadata about a data object. It can serve as a “ground-truth” for determining the cause of execution failure for instance, or can explain a particular result to a researcher intending to reuse a data object. Provenance can quickly grow voluminous and be quite feature rich, requiring new structure and concepts that support data mining. We propose a representation of data provenance using logical time that reduces the feature space of the provenance. The temporal representation supports clustering, classification and association rule mining. This paper studies the full utility of the temporal representation through an empirical evaluation and identification of the data mining algorithms that are most effective in application to the proposed representation. The evaluation is carried out against a multi-gigabyte semi-synthetic provenance dataset built from a range of scientific workflows, and against a real one month provenance dataset gathered from a satellite instrument. Through analysis of the results via clustering metrics—purity and Normalized Mutual Information (NMI), we determine that the kk-means algorithm gives the best clustering with the proposed temporal representation, while still yielding provenance-useful information.
Journal: Future Generation Computer Systems - Volume 36, July 2014, Pages 363–378