Spatial Clustering
Problem
The DBSCAN algorithm for density-based spatial clustering by Ester, Kriegel (my PhD supervisor), Sander, and Xu [1] is widely used across disciplines and has been cited more than 25,000 times. A problem of DBSCAN is the superlinear runtime, having to perform an epsilon range query for each database object to identify core objects and expand clusters. This task requires O(n^2) time without index support, and O(n log n ) assuming an index such as an R-Tree that supports range queries in O(log n) (which is a reasonable assumption for low-dimensional data spaces).
I have an idea for a result-sensitive variant of DBSCAN, which may be superlinear only in the number of objects that actually belong to clusters rather than noise. For this purpose, I would like to work with a student to explore using an algorithm to solve the Maximum Range Sum (MaxRS) Problem [2], [3]. The goal of this problem is to find the densest region of the dataspace and can be solved in O(n log n) without index support.
The idea is to iteratively find the densest (unprocessed) area of the dataspace. If the densest area is not dense enough to constitute a DBSCAN cluster, we can terminate and mark all unprocessed points as outliers.
In theory, the resulting algorithm should have a O(K * n log n) runtime, where K is the number of points contained in clusters. In cases where clusters are, thus K << n - without index support.
In addition, there are approximate algorithms for MaxRS that approximately solve MaxRS in O(n log^2 n) time with strong theoretical guarantees [3]. We may be able to leverage these algorithms to obtain an approximate solution for DBSCAN in O(K * n log^2 n).
Funding
No extramural funding is available for this project at this time. But I'd like to explore the ideas above with a motivated PhD student in a rotation project. If successful, I have funding available to fully fund a PhD student (three years, fully funded).
[1] Ester, M., Kriegel, H.P., Sander, J. and Xu, X., 1996, August. A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd (Vol. 96, No. 34, pp. 226-231).
[2] Choi, D.W., Chung, C.W. and Tao, Y., 2012. A scalable algorithm for maximizing range sum in spatial databases. Proceedings of the VLDB Endowment, 5(11), pp.1088-1099.
[3] Tao, Y., Hu, X., Choi, D.W. and Chung, C.W., 2013, August. Approximate MaxRS in spatial databases. In 39th International Conference on Very Large Data Bases 2013, VLDB 2013 (Vol. 6, pp. 1546-1557). International Conference on Very Large Data Bases.