Re: [R] Cluster analysis, factor variables, large data set

Peter Langfelder Thu, 31 Mar 2011 12:23:32 -0700

On Thu, Mar 31, 2011 at 11:48 AM, Hans Ekbrand <h...@sociologi.cjb.net> wrote:
>
> The variables are unordered factors, stored as integers 1:9, where
>
> 1 means "Full-time employment"
> 2 means "Part-time employment"
> 3 means "Student"
> 4 means "Full-time self-employee"
> ...
>
> Does euclidean distances make sense on unordered factors coded as
> integers?


It probably doesn't. You said you have some 36 observations for each
case, correct? You can turn these 36 observations into a vector of
length 36 * 9 on which Euclidean distance will make some sense, namely
k changes will produce a distance of sqrt(2*k). For each observation
with value p (p between 1 and 9), create a vector r = c(0,0,1,0,...0)
where the entry 1 is in the p-th component. Hence, if values p1 and p2
are the same, euclidean distance between r1 and r2 is zero; if they
are not the same, Euclidan distance is sqrt(2).

Here's some possible R code:


transform = function(obsVector, maxVal)
{
  templateMat = matrix(0, maxVal, maxVal);
  diag(templateMat) = 1;

  return(as.vector(templateMat[, obsVector]));
}

set.seed(10)
n = 4;
m = 5;
max = 4;
data = matrix(sample(c(1:max), n*m, replace = TRUE), m, n);

> data
     [,1] [,2] [,3] [,4]
[1,]    3    3    1    2
[2,]    1    3    3    2
[3,]    3    3    2    4
[4,]    1    2    4    2
[5,]    4    1    4    1


trafoData = apply(data, 2, transform, maxVal = max);

> trafoData
      [,1] [,2] [,3] [,4]
 [1,]    0    0    1    0
 [2,]    0    0    0    1
 [3,]    1    1    0    0
 [4,]    0    0    0    0
 [5,]    1    0    0    0
 [6,]    0    0    0    1
 [7,]    0    1    1    0
 [8,]    0    0    0    0
 [9,]    0    0    0    0
[10,]    0    0    1    0
[11,]    1    1    0    0
[12,]    0    0    0    1
[13,]    1    0    0    0
[14,]    0    1    0    1
[15,]    0    0    0    0
[16,]    0    0    1    0
[17,]    0    1    0    1
[18,]    0    0    0    0
[19,]    0    0    0    0
[20,]    1    0    1    0



The code assumes that cases are in columns and observations in rows of
data. Examine data and trafoData to see how the transformation works.
Once you have the transformed data, simply apply your favorite
clustering method that uses Euclidean distance.

HTH,

Peter

>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Cluster analysis, factor variables, large data set

Reply via email to