Hi Grant, Grant Gillis wrote: > My problem is: > > I have a data set for individuals (rows) and values for behaviours > (columns). I would like to know the proportion of shared behaviours for all > possible pairs of individuals. The sum of shared behaviours divided by the > total. There are zeros in the data that I would like treated as the > behaviour does not exist. > > example data format: > > ind B1 B2 B3 B4 B5 B6 > w 2 1 5 3 4 4 > x 1 2 3 4 5 6 > y 1 3 5 2 7 6 > z 2 3 2 4 2 6
I hope I understand correctly that the numbers label different behaviours, hence e.g. individuals 'y' and 'z' have the same level of behaviour, namely level '3', for the behaviour B2. You may want to look at R's 'factor's, which allow you to give the levels descriptive names instead of just numbers. Let us first make a dataframe out of your example: t <- data.frame( B1 = c(2,1,1,2), B2 = c(1,NA,3,3), B3 = c(5,2,5,3), B4 = c(3,4,2,4), B5 = c(4,5,7,2), B6 = c(4,6,6,6) ) rownames(t) = c("w","x","y","z") > t B1 B2 B3 B4 B5 B6 w 2 1 5 3 4 4 x 1 2 2 4 5 6 y 1 3 5 2 7 6 z 2 3 3 4 2 6 If you now test two rows for equality, this happens element-wise: > t["w",] == t["y",] B1 B2 B3 B4 B5 B6 w FALSE FALSE TRUE FALSE FALSE FALSE You can call 'sum' on this output to get the number of TRUE values. > sum( t["w",] == t["y",] ) [1] 1 As you want to do this with all pairings, we need a nested 'sapply': > sapply( rownames(t), function(ind1) + sapply( rownames(t), function(ind2) + sum( t[ind1,] == t[ind2,] ) ) ) w x y z w 6 0 1 1 x 0 6 2 2 y 1 2 6 2 z 1 2 2 6 This table now contains the desired information. Of course, you have to divide by the number of behaviours, i.e. by 6, and the format is a bit different from your suggestion, but I hope that does not matter. > Desired output: > > w x 0 > w y 0.166667 > w z 0 > x y 0.33333 > x z 0.33333 > etc. To deal with the missing behaviour you should better use 'NA' instead of 0. Then R may be able to help you with it, as it treats NAs, i.e. values marked as missing, in a special way. Assume, for example, that you compare the rows > r1 <- c( 2, 3, NA, 1, 5 ) > r2 <- c( 1, 3, 4, NA, 4 ) Calling '==' as above on such data yields: > r1==r2 [1] FALSE TRUE NA NA FALSE As you can see, the missing behaviour is marked NA, because it is uncomparable. To get the number of TRUE values, use > sum( r1==r2, na.rm=TRUE ) [1] 1 And to get the number of comparable observations, i.e. those without NA, use e.g. > length( na.omit( r1==r2 ) ) [1] 3 I hope this helps you to work out your own solution. Otherwise, ask again. Best Simon +--- | Dr. Simon Anders, Dipl. Phys. | European Bioinformatics Institute, Hinxton, Cambridgeshire, UK | preferred (permanent) e-mail: [EMAIL PROTECTED] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.