کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
246743 | 502387 | 2014 | 14 صفحه PDF | دانلود رایگان |
• Hybrid approach for clustering semantically-related project documents is proposed.
• For clustering, inverse relationship between dimensionality & similarity threshold.
• tf–idf weighting method results in high precision, average recall outcomes.
• Refining clustering outcome using supervised learning improves accuracy.
• Textual similarities can be used to reveal semantic relations between documents.
Text classifiers, as supervised learning methods, require a comprehensive training set that covers all classes in order to classify new instances. This limits the use of text classifiers for organizing construction project documents since it is not guaranteed that sufficient samples are available for all possible document categories. To overcome the restriction imposed by the all-inclusive requirement, an unsupervised learning method was used to automatically cluster documents together based on textual similarities. Repeated evaluations using different randomizations of the dataset revealed a region of threshold/dimensionality values of consistently high precision values and average recall values. Accordingly, a hybrid approach was proposed which initially uses an unsupervised method to develop core clusters and then trains a text classifier on the core clusters to classify outlier documents in a consequent refinement step. Evaluation of the hybrid approach demonstrated a significant improvement in recall values, resulting in an overall increase in F-measure scores.
Journal: Automation in Construction - Volume 42, June 2014, Pages 36–49