This is easier if you read the data into a list instead of creating a data frame since the number of values on each row is different. You may be able to modify this to fit your needs. The steps are 1) Read the file with readLines(); 2) split the lines into numeric vectors (one for each line); 3) repeat the first column (id) once for each brand in the line and build a data.frame with col.names; 4) use table() to build a list of all the brands and the number of times each appears; 5) cluster using the table or if necessary convert to a data frame (this will add X to the front of each brand number since numbers cannot be column names.
dta <- readLines(con=stdin(), n=3) 1 , 45 , 32, 45, 23 2 , 34 4, 11, 43, 45 lst <- strsplit(dta, ", ") lst <- sapply(lst, as.numeric) a <- sapply(1:length(lst), function(x) cbind(rep(lst[[x]][[1]], length(lst[[x]])-1), lst[[x]][-1])) a <- data.frame(do.call(rbind, a)) colnames(a) <- c("id", "brand") newdat <- table(a$id, a$brand) newdf <- data.frame(unclass(newdat)) ------------------------------------- David L Carlson Associate Professor of Anthropology Texas A&M University College Station, TX 77840-4352 -----Original Message----- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Raphael Bauduin Sent: Tuesday, November 13, 2012 4:47 AM To: r-help@r-project.org Subject: [R] help formatting data for clustering Hi, I'm a R beginner. I have data of this form: user_id, brand_id1, brand_id2, ..... for example: 1 , 45 , 32, 45, 23 2 , 34 4, 11, 43, 45 I'm looking for the right procedure to be able to cluster users. I am especially interested to know which functions to use at each step. I am currently able to load the data in a data frame, each row's name being the user id. #extract user brands, ie all collumn except the first user_brands <- userclustering[,-1] # extract user ids, ie the first column user_ids <- userclustering[,1] # set user ids as row name row.names(user_brands) <- user_ids But now I'm stuck replacing the brand ids by a count for each brand the user ordered, all other brand counters being implicitely 0 for that user. Then I'll need to be sure I can use it for clustering (normalising, correct handling of brands absent from a user's list, etc). thanks in advance for your help! Raph [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.