GitHub user james-willis added a comment to the discussion: ST_DBSCAN vs sklearn.DBSCAN: understanding tradeoffs
The execution time of DBSCAN on different implementations/approaches is multifactorial. Specifically, it will be influenced by: * The dataset itself * dataset size * spatial density * spatial patterns in the data (e.g. some severe spatial concentration). In the graphframes connected components algorithm this can lead to a long right tail for execution time vs number of running tasks. In other words you whole cluster is waiting for only a few cores to do work. * The parameters for DBSCAN: * larger values of epsilon result in higher selectivity of the distance join and a larger graph for the connected components calculation * smaller values of min pts means a larger graph for the connected components calculation * cluster characteristics - spark is a multihost solution so the composition of the cluster comes in to play * network speed - the connected components algorithm from graphframes does a lot of shuffles and so network throughput is critical to performance. * host size - for a given cluster size (ie cores and ram) larger hosts will have more [local](https://medium.com/data-engineer/understanding-spark-locality-levels-d4ab14d15be1) tasks which will reduce the amount of network io and speed up shuffle stages. The Sedona implementation of DBSCAN is designed to be performant and robust on large datasets with many core points. It will process datasets that are not feasible on sklearn. It will often not outperform sklearn when the data is smaller, as you've found. sklearn is a single host solution and so saves itself a lot of the overhead that spark has as a distributed compute platform. Among other things that impair its per-core throughput, Spark incurs a lot of overhead when there is data shuffle. But as you've noticed sklearn is memory hungry. As is often the case memory consumption and CPU throughput are directly in tension with each other. Out of the box, graphframes uses the algorithm described [here](https://dl.acm.org/doi/pdf/10.1145/2670979.2670997) to calculate the connected components. In graphx, they use a [message passing approach](https://github.com/apache/spark/blob/v4.0.0/graphx/src/main/scala/org/apache/spark/graphx/lib/ConnectedComponents.scala). In most workloads this connected components calculation will dominate the runtime. When graphframes 0.9.0 releases, you will be able to change which algorithm is being used between these two. See [this PR](https://github.com/graphframes/graphframes/pull/563) I made. Using the graphx algorithm will be faster but more memory hungry and less robust. You can test if this will be a desirable middle ground for your use case. In this write up I mostly focused on the connected components element of DBSCAN. There are characteristics of the data that can make the distance join element slower or faster but since you mention sklearn I assume you are working only with point data and thus are probably getting good performance on that front. GitHub link: https://github.com/apache/sedona/discussions/1965#discussioncomment-13424667 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
