On Mon, Feb 14, 2011 at 12:22 PM, mathijsdevaan <mathijsdev...@gmail.com> wrote: > > Hi, > > I have a large dataset with info on individuals (B) that have been involved > in projects (A) during multiple years (C). The dataset contains three > columns: A, B, C. Example: > > A B C > 1 1 a 1999 > 2 1 b 1999 > 3 1 c 1999 > 4 1 d 1999 > 5 2 c 2001 > 6 2 d 2001 > 7 3 a 2004 > 8 3 c 2004 > 9 3 d 2004 > > I am interested in how well all the individuals in a project know each > other. To calculate this team familiarity measure I want to sum the > familiarity between all individual pairs in a team. The familiarity between > each individual pair in a team is calculated as the summation of each pair's > prior co-appearance in a project divided by the total number of team > members. So the team familiarity in project 3 = (1/4+1/4) + (1/4+1/4+1/2) + > (1/4+1/4+1/2) = 2,5 or a has been in project 1 (of size 4) with c and d > > 1/4+1/4 and c has been in project 1 (of size 4) with 1 and d > 1/4+1/4 and c > has been in project 2 (of size 2) with d > 1/2. > > I think that the best way to do it is to transform the data into an edgelist > (each pair in one row/two columns) and then creating two additional columns > for the strength of the familiarity and the year of the project in which the > pair was active. The problem is that I am stuck already in the first step. > So the question is: how do I go from the current data structure to a list of > projects and the familiarity of its team members? >
First define the data frame, DF. Note we have used column names of proj, pers and year. Then append a size column producing DF2. Using sqldf merge DF2 with itself within project giving one row per pair in M. At the same time we calculate the reciprocal of size for each row or 0 if the two components of the pair are the same person. Next we merge M with itself giving pairs of pairs zeroing out rows that should not contribute to the sum and aggregating the reciprocal sizes by project. The automatic class assignment heuristic does not work well in this case so we use method = "raw" to bypass it. DF <- structure(list(proj = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L), pers = structure(c(1L, 2L, 3L, 4L, 3L, 4L, 1L, 3L, 4L), .Label = c("a", "b", "c", "d"), class = "factor"), year = c(1999L, 1999L, 1999L, 1999L, 2001L, 2001L, 2004L, 2004L, 2004L)), .Names = c("proj", "pers", "year"), row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9"), class = "data.frame") library(sqldf) DF2 <- transform(DF, size = ave(proj, proj, FUN = length)) M <- sqldf("select proj, year, x.pers || ' ' || y.pers pair, (x.pers != y.pers) / (size + 0.0) recip from DF2 x, DF2 y using(proj, year, size)", method = "raw") sqldf("select x.proj, sum((x.year > y.year and x.recip > 0) * y.recip) familiarity from M x, M y using (pair) group by x.proj", method = "raw") The result of the last statement is: proj familiarity 1 1 0.0 2 2 0.5 3 3 2.5 More info on sqldf is available at http://sqldf.googlecode.com The last statement could be replaced by these two in case you want the intermediate PP: PP <- sqldf("select *, (x.year > y.year and x.recip > 0) * y.recip familiarity from M x, M y using (pair)", method = "raw") sqldf("select proj, sum(familiarity) as familiarity from PP group by proj") -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.