Hi: Is this what you're after?
tq <- with(ds, quantile(freq, seq(0.2, 1, by = 0.2))) ds$int <- with(ds, cut(freq, c(0, tq))) with(ds, table(int)) int (0,1] (1,2] (2,4] (4,7] (7,16] 10 6 7 6 6 HTH, Dennis On Sat, May 14, 2011 at 9:42 AM, gj <gaw...@gmail.com> wrote: > Hi Jim, > Thanks very much for the code. I modified it a bit because I needed to > allocate the track ratings by userid (eg if user 1 plays track x once, he > gets rating 1, user 1 plays track y 100 times, he gets a rating 5) and not > by track (sorry if this wasn't clear in my original post). > > This is almost working! What I can't get right at the moment is the cutoff > interval for the ratings. > Any help please? > > Sample data: > > userid,track,freq > 1,1,1 > 1,2,10 > 1,3,1 > 1,4,1 > > 1,5,15 > 1,6,4 > 1,7,16 > 1,8,6 > 1,9,1 > 1,10,1 > 1,11,2 > 1,12,2 > 1,13,1 > 1,14,6 > 1,15,7 > 1,16,13 > 1,17,3 > 1,18,2 > 1,19,5 > 1,20,2 > 1,21,2 > 1,22,6 > 1,23,4 > 1,24,1 > 1,25,1 > 1,26,16 > 1,27,4 > 1,28,1 > 1,29,4 > 1,30,4 > 1,31,4 > 1,32,1 > 1,33,14 > 1,34,2 > 1,35,7 > > >>tq > > > $'1' > > > 20% 40% 60% 80% 100% > 1 2 4 7 16 > > > >From this distribution I would expect to distribute ratings as follows: > freq: <=1 <=2 <=4 <=7 <=16 > rating: 1 2 3 4 5 > > But my output is (the ratings are correct except at the cut off points): > "userid" "freq" "track" "rating" > 1 1 1 2 > 1 10 2 5 > 1 1 3 2 > 1 1 4 2 > 1 15 5 5 > 1 4 6 4 > 1 16 7 5 > 1 6 8 4 > 1 1 9 2 > 1 1 10 2 > 1 2 11 3 > 1 2 12 3 > 1 1 13 2 > 1 6 14 4 > 1 7 15 5 > 1 13 16 5 > 1 3 17 3 > 1 2 18 3 > 1 5 19 4 > 1 2 20 3 > 1 2 21 3 > 1 6 22 4 > 1 4 23 4 > 1 1 24 2 > 1 1 25 2 > 1 16 26 5 > 1 4 27 4 > 1 1 28 2 > 1 4 29 4 > 1 4 30 4 > 1 4 31 4 > 1 1 32 2 > 1 14 33 5 > 1 2 34 3 > 1 7 35 5 > > This is the code: > # cheers jim holtman > x=read.csv(file="C:\\Data\\lastfm\\ratings\\play_history_3.csv",header=T, > sep=',') > # get the quantiles for each track > tq <- tapply(x$freq,x$userid,quantile,prob=c(0.2,0.4,0.6,0.8,1)) > # create a matrix with the rownames as the tracks to use in the findInterval > tqm <- do.call(rbind, tq) > #now put the ratings > require(data.table) > x.dt <- data.table(x) > x.new <- x.dt[,list(freq = freq,track=track,rating = > findInterval(freq,tqm[as.character(userid[1L]),], rightmost.closed = TRUE) + > 1L),by=userid] > > Regards > Gawesh > > On Sun, May 8, 2011 at 10:42 PM, jim holtman <jholt...@gmail.com> wrote: > >> try this: >> >> > # create some data >> > x <- data.frame(userid = paste('u', rep(1:20, each = 20), sep = '') >> + , track = rep(1:20, 20) >> + , freq = floor(runif(400, 10, 200)) >> + , stringsAsFactors = FALSE >> + ) >> > # get the quantiles for each track >> > tq <- tapply(x$freq, x$track, quantile, prob = c(.2, .4, .6, .8, 1)) >> > # create a matrix with the rownames as the tracks to use in the >> findInterval >> > tqm <- do.call(rbind, tq) >> > # now put the ratings >> > require(data.table) >> > x.dt <- data.table(x) >> > x.new <- x.dt[, >> + list(userid = userid >> + , freq = freq >> + , rating = findInterval(freq >> + # use track as index into >> quantile matrix >> + , tqm[as.character(track[1L]),] >> + , rightmost.closed = TRUE >> + ) + 1L >> + ) >> + , by = track] >> > >> > head(x.new) >> track userid freq rating >> [1,] 1 u1 10 1 >> [2,] 1 u2 15 1 >> [3,] 1 u3 126 4 >> [4,] 1 u4 117 3 >> [5,] 1 u5 76 2 >> [6,] 1 u6 103 3 >> > >> >> >> On Sun, May 8, 2011 at 2:48 PM, gj <gaw...@gmail.com> wrote: >> > Hi, >> > >> > I have a mysql table with fields userid,track,frequency e.g >> > u1,1,10 >> > u1,2,100 >> > u1,3,110 >> > u1,4,200 >> > u1,5,120 >> > u1,6,130 >> > . >> > u2,1,23 >> > . >> > . >> > where "frequency" is the number of times a music track is played by a >> > "userid" >> > >> > I need to turn my 'frequency' table into a rating table (it's for a >> > recommender system). So, for each user, I need to categorise the >> frequency >> > of tracks played by quintile so that each particular track can have 5 >> > ratings (1-5), with the ratings allocated as follows: inter-quintile >> range >> > 100-80% = rating 5, inter-quintile range 80-60% = rating 4, >> > ..., inter-quintile range 20-0% = rating 1) >> > >> > Hence, I need to create a table with fields userid,track,rating: >> > u1,1,1 >> > u1,2, 3 >> > ... >> > >> > Can anybody help me to do this with R? >> > >> > Regards >> > Gawesh >> > >> > [[alternative HTML version deleted]] >> > >> > ______________________________________________ >> > R-help@r-project.org mailing list >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> > and provide commented, minimal, self-contained, reproducible code. >> > >> >> >> >> -- >> Jim Holtman >> Data Munger Guru >> >> What is the problem that you are trying to solve? >> > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.