Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
6938705 | Pattern Recognition | 2018 | 42 Pages |
Abstract
Clustering is an important technique to deal with large scale data which are explosively created in internet. Most data are high-dimensional with a lot of noise, which brings great challenges to retrieval, classification and understanding. No current existing approach is “optimal” for large scale data. For example, DBSCAN requires O(n2) time, Fast-DBSCAN only works well in 2 dimensions, and Ï-Approximate DBSCAN runs in O(n) expected time which needs dimension D to be a relative small constant for the linear running time to hold. However, we prove theoretically and experimentally that Ï-Approximate DBSCAN degenerates to an O(n2) algorithm in very high dimension such that 2Dâ¯>â¯â¯>â¯n. In this paper, we propose a novel local neighborhood searching technique, and apply it to improve DBSCAN, named as NQ-DBSCAN, such that a large number of unnecessary distance computations can be effectively reduced. Theoretical analysis and experimental results show that NQ-DBSCAN averagely runs in O(n*log(n)) with the help of indexing technique, and the best case is O(n) if proper parameters are used, which makes it suitable for many realtime data.
Related Topics
Physical Sciences and Engineering
Computer Science
Computer Vision and Pattern Recognition
Authors
Yewang Chen, Shengyu Tang, Nizar Bouguila, Cheng Wang, Jixiang Du, HaiLin Li,