John,

Hi, just a general question: when we do hierarchical clustering, should we
compute the dissimilarity matrix based on scaled dataset or non-scaled dataset?


daisy() in cluster package allow standardizing the variables before calculating
dissimilarity matrix;

I'd say that should depend on your data.

- if your data is all (physically) different kinds of things (and thus different orders of magnitude), then you should probably scale.

- On the other hand, I cluster spectra. Thus my variates are all the same unit, and moreover I'd be afraid that scaling would blow up noise-only variates (i.e. the spectra do have low or no intensity regions), thus I usually don't scale.

- It also depends on your distance. E.g. Mahalanobis should do the scaling by itself, if think correctly at this time of the day...

What I do frequently, though, is subtracting something like the minimum spectrum (in practice, I calculate the 5th percentile for each variate - it's less noisy). You can also center, but I'm strongly for having a physical meaning, and for my samples that's the minimum spectrum is better interpretable (it represents the matrix composition).

but dist() doesn't have that option at all. Appreciate if
you can share your thoughts?
but you could call scale () and then dist ().

Claudia



Thanks

John




        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to