John,
Hi, just a general question: when we do hierarchical clustering, should we
compute the dissimilarity matrix based on scaled dataset or non-scaled dataset?
daisy() in cluster package allow standardizing the variables before calculating
dissimilarity matrix;
I'd say that should depend on your data.
- if your data is all (physically) different kinds of things (and thus
different orders of magnitude), then you should probably scale.
- On the other hand, I cluster spectra. Thus my variates are all the
same unit, and moreover I'd be afraid that scaling would blow up
noise-only variates (i.e. the spectra do have low or no intensity
regions), thus I usually don't scale.
- It also depends on your distance. E.g. Mahalanobis should do the
scaling by itself, if think correctly at this time of the day...
What I do frequently, though, is subtracting something like the minimum
spectrum (in practice, I calculate the 5th percentile for each variate -
it's less noisy). You can also center, but I'm strongly for having a
physical meaning, and for my samples that's the minimum spectrum is
better interpretable (it represents the matrix composition).
but dist() doesn't have that option at all. Appreciate if
you can share your thoughts?
but you could call scale () and then dist ().
Claudia
Thanks
John
[[alternative HTML version deleted]]
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.