A fast clustering algorithm based on pruning unnecessary distance computations in DBSCAN for high-dimensional data

Article ID	Journal	Published Year	Pages	File Type
6938705	Pattern Recognition	2018	42 Pages	PDF

Abstract

Clustering is an important technique to deal with large scale data which are explosively created in internet. Most data are high-dimensional with a lot of noise, which brings great challenges to retrieval, classification and understanding. No current existing approach is “optimal” for large scale data. For example, DBSCAN requires O(n2) time, Fast-DBSCAN only works well in 2 dimensions, and Ï-Approximate DBSCAN runs in O(n) expected time which needs dimension D to be a relative small constant for the linear running time to hold. However, we prove theoretically and experimentally that Ï-Approximate DBSCAN degenerates to an O(n2) algorithm in very high dimension such that 2Dâ¯>â¯â¯>â¯n. In this paper, we propose a novel local neighborhood searching technique, and apply it to improve DBSCAN, named as NQ-DBSCAN, such that a large number of unnecessary distance computations can be effectively reduced. Theoretical analysis and experimental results show that NQ-DBSCAN averagely runs in O(n*log(n)) with the help of indexing technique, and the best case is O(n) if proper parameters are used, which makes it suitable for many realtime data.

Keywords

00-01 99-00 DBSCAN