On Wed, 1 Jun 2016 17:24:47 +0300, Artem Barger wrote:
On Tue, May 31, 2016 at 4:04 PM, Artem Barger <ar...@bargr.net> wrote:

Hi,

Current implementation of kmeans within CM framework, inherently uses
algorithm published by  Arthur, David, and Sergei Vassilvitskii.
"k-means++: The advantages of careful seeding." *Proceedings of the
eighteenth annual ACM-SIAM symposium on Discrete algorithms*. Society for Industrial and Applied Mathematics, 2007. While there other alternative
algorithms for initial seeding is available, for instance:

1. Random initialization (each center picked uniformly at random).
2. Canopy https://en.wikipedia.org/wiki/Canopy_clustering_algorithm
3. Bicriteria  Feldman, Dan, et al. "Bi-criteria linear-time
approximations for generalized k-mean/median/center." *Proceedings of the
twenty-third annual symposium on Computational geometry*. ACM, 2007.

While I understand that kmeans++ is preferable option, others could be
also used for testing, trials and evaluations as well.

I'd like to propose to separate logic of seeding and clustering to
increase flexibility for kmeans clustering. Would be glad to hear your
comments, pros/cons or rejections...


I've found "Scalable KMeans" or kmeans|| as referred in the
http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf, which provides
parallelizable seeding procedure.
​I guess this might serve as additional +1 vote for doing separation
between seeding and LLoyd's iterations in current implementations of kmeans.

I guess that, around here, you are the expert about these algorithms...
So go ahead and (re)write the code as you see fit, while still taking into
account that the code should be self-documenting as much as possible.
And OO (since this is Java).

If you are up for a major refactoring (e.g. for sparse data), I'd suggest to do it in a new package, so that we can easily compare the old and new
codes (e.g. run the tests).

And if you contemplate parallelization, I wonder whether the issue of
switching to Java 8 might not have to be resolved first.


Regards,
Gilles


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Reply via email to