An easy way is to just offset the quantiles by a small increment so that boundary condition is less likely. If you change the line
tqm <- do.call(rbind, tq) + 0.001 in my example, that should do the trick. On Sat, May 14, 2011 at 6:09 PM, gj <gaw...@gmail.com> wrote: > Hi, > I think I haven't been able to explain correctly what I want. Here another > try: > Given that I have the following input: > userid,track,freq > > 1,1,1 > 1,2,10 > 1,3,1 > 1,4,1 > 1,5,15 > 1,6,4 > 1,7,16 > 1,8,6 > 1,9,1 > 1,10,1 > 1,11,2 > 1,12,2 > 1,13,1 > 1,14,6 > 1,15,7 > 1,16,13 > 1,17,3 > 1,18,2 > 1,19,5 > 1,20,2 > 1,21,2 > 1,22,6 > 1,23,4 > 1,24,1 > 1,25,1 > 1,26,16 > 1,27,4 > 1,28,1 > 1,29,4 > 1,30,4 > 1,31,4 > 1,32,1 > 1,33,14 > 1,34,2 > 1,35,7 > > It is a sample of the history of tracks played: userid,track and frequency. > What I want is to convert the frequency into a rating scale (1-5) based on > the frequency at which a user has played a track, using the following > interquintile ranges for the cfd: > 0%-20% = rating 1, 20%-40% = rating 2, .... ,80%-100%=rating 5 > > Jim kindly provided the following code: > # cheers jim holtman >>x=read.csv(file="C:\\Data\\lastfm\\ratings\\play_history_3.csv",header=T, > sep=',') >># get the quantiles for each user(we want the frequency distribution to be > based on user) >>tq <- tapply(x$freq,x$userid,quantile,prob=c(0.2,0.4,0.6,0.8,1)) >># create a matrix with the rownames as the tracks to use in the > findInterval >>tqm <- do.call(rbind, tq) >>#now put the ratings >>require(data.table) >>x.dt <- data.table(x) >>x.new <- x.dt[,list(freq = freq,track=track,rating = > findInterval(freq,tqm[as.character(userid[1L]),], rightmost.closed = TRUE) + > 1L),by=userid] >>head(x.new) > > > userid freq track rating > [1,] 1 1 1 2 > [2,] 1 10 2 5 > [3,] 1 1 3 2 > [4,] 1 1 4 2 > [5,] 1 15 5 5 > [6,] 1 4 6 4 > > > which is almost what I wanted except that the ratings are 1 point higher for > tracks where the frequency is at the cut-off points in the interquintile > range. > To illustrate the quintiles are: > >> tq$`1` > 20% 40% 60% 80% 100% > 1 2 4 7 16 > > > > So, ideally I want (note the different ratings): > > > userid freq track rating > [1,] 1 1 1 1 > [2,] 1 10 2 5 > [3,] 1 1 3 1 > [4,] 1 1 4 1 > [5,] 1 15 5 5 > [6,] 1 4 6 3 > > > Can anybody help me? I'm new to R (as you have probably guessed). Sorry for > the long explanation. > > Regards > Gawesh > > On Sat, May 14, 2011 at 7:37 PM, Dennis Murphy <djmu...@gmail.com> wrote: > >> Hi: >> >> Is this what you're after? >> >> tq <- with(ds, quantile(freq, seq(0.2, 1, by = 0.2))) >> ds$int <- with(ds, cut(freq, c(0, tq))) >> with(ds, table(int)) >> >> int >> (0,1] (1,2] (2,4] (4,7] (7,16] >> 10 6 7 6 6 >> >> HTH, >> Dennis >> >> On Sat, May 14, 2011 at 9:42 AM, gj <gaw...@gmail.com> wrote: >> > Hi Jim, >> > Thanks very much for the code. I modified it a bit because I needed to >> > allocate the track ratings by userid (eg if user 1 plays track x once, he >> > gets rating 1, user 1 plays track y 100 times, he gets a rating 5) and >> not >> > by track (sorry if this wasn't clear in my original post). >> > >> > This is almost working! What I can't get right at the moment is the >> cutoff >> > interval for the ratings. >> > Any help please? >> > >> > Sample data: >> > >> > userid,track,freq >> > 1,1,1 >> > 1,2,10 >> > 1,3,1 >> > 1,4,1 >> > >> > 1,5,15 >> > 1,6,4 >> > 1,7,16 >> > 1,8,6 >> > 1,9,1 >> > 1,10,1 >> > 1,11,2 >> > 1,12,2 >> > 1,13,1 >> > 1,14,6 >> > 1,15,7 >> > 1,16,13 >> > 1,17,3 >> > 1,18,2 >> > 1,19,5 >> > 1,20,2 >> > 1,21,2 >> > 1,22,6 >> > 1,23,4 >> > 1,24,1 >> > 1,25,1 >> > 1,26,16 >> > 1,27,4 >> > 1,28,1 >> > 1,29,4 >> > 1,30,4 >> > 1,31,4 >> > 1,32,1 >> > 1,33,14 >> > 1,34,2 >> > 1,35,7 >> > >> > >> >>tq >> > >> > >> > $'1' >> > >> > >> > 20% 40% 60% 80% 100% >> > 1 2 4 7 16 >> > >> > >> > >From this distribution I would expect to distribute ratings as follows: >> > freq: <=1 <=2 <=4 <=7 <=16 >> > rating: 1 2 3 4 5 >> > >> > But my output is (the ratings are correct except at the cut off points): >> > "userid" "freq" "track" "rating" >> > 1 1 1 2 >> > 1 10 2 5 >> > 1 1 3 2 >> > 1 1 4 2 >> > 1 15 5 5 >> > 1 4 6 4 >> > 1 16 7 5 >> > 1 6 8 4 >> > 1 1 9 2 >> > 1 1 10 2 >> > 1 2 11 3 >> > 1 2 12 3 >> > 1 1 13 2 >> > 1 6 14 4 >> > 1 7 15 5 >> > 1 13 16 5 >> > 1 3 17 3 >> > 1 2 18 3 >> > 1 5 19 4 >> > 1 2 20 3 >> > 1 2 21 3 >> > 1 6 22 4 >> > 1 4 23 4 >> > 1 1 24 2 >> > 1 1 25 2 >> > 1 16 26 5 >> > 1 4 27 4 >> > 1 1 28 2 >> > 1 4 29 4 >> > 1 4 30 4 >> > 1 4 31 4 >> > 1 1 32 2 >> > 1 14 33 5 >> > 1 2 34 3 >> > 1 7 35 5 >> > >> > This is the code: >> > # cheers jim holtman >> > x=read.csv(file="C:\\Data\\lastfm\\ratings\\play_history_3.csv",header=T, >> > sep=',') >> > # get the quantiles for each track >> > tq <- tapply(x$freq,x$userid,quantile,prob=c(0.2,0.4,0.6,0.8,1)) >> > # create a matrix with the rownames as the tracks to use in the >> findInterval >> > tqm <- do.call(rbind, tq) >> > #now put the ratings >> > require(data.table) >> > x.dt <- data.table(x) >> > x.new <- x.dt[,list(freq = freq,track=track,rating = >> > findInterval(freq,tqm[as.character(userid[1L]),], rightmost.closed = >> TRUE) + >> > 1L),by=userid] >> > >> > Regards >> > Gawesh >> > >> > On Sun, May 8, 2011 at 10:42 PM, jim holtman <jholt...@gmail.com> wrote: >> > >> >> try this: >> >> >> >> > # create some data >> >> > x <- data.frame(userid = paste('u', rep(1:20, each = 20), sep = '') >> >> + , track = rep(1:20, 20) >> >> + , freq = floor(runif(400, 10, 200)) >> >> + , stringsAsFactors = FALSE >> >> + ) >> >> > # get the quantiles for each track >> >> > tq <- tapply(x$freq, x$track, quantile, prob = c(.2, .4, .6, .8, 1)) >> >> > # create a matrix with the rownames as the tracks to use in the >> >> findInterval >> >> > tqm <- do.call(rbind, tq) >> >> > # now put the ratings >> >> > require(data.table) >> >> > x.dt <- data.table(x) >> >> > x.new <- x.dt[, >> >> + list(userid = userid >> >> + , freq = freq >> >> + , rating = findInterval(freq >> >> + # use track as index into >> >> quantile matrix >> >> + , tqm[as.character(track[1L]),] >> >> + , rightmost.closed = TRUE >> >> + ) + 1L >> >> + ) >> >> + , by = track] >> >> > >> >> > head(x.new) >> >> track userid freq rating >> >> [1,] 1 u1 10 1 >> >> [2,] 1 u2 15 1 >> >> [3,] 1 u3 126 4 >> >> [4,] 1 u4 117 3 >> >> [5,] 1 u5 76 2 >> >> [6,] 1 u6 103 3 >> >> > >> >> >> >> >> >> On Sun, May 8, 2011 at 2:48 PM, gj <gaw...@gmail.com> wrote: >> >> > Hi, >> >> > >> >> > I have a mysql table with fields userid,track,frequency e.g >> >> > u1,1,10 >> >> > u1,2,100 >> >> > u1,3,110 >> >> > u1,4,200 >> >> > u1,5,120 >> >> > u1,6,130 >> >> > . >> >> > u2,1,23 >> >> > . >> >> > . >> >> > where "frequency" is the number of times a music track is played by a >> >> > "userid" >> >> > >> >> > I need to turn my 'frequency' table into a rating table (it's for a >> >> > recommender system). So, for each user, I need to categorise the >> >> frequency >> >> > of tracks played by quintile so that each particular track can have 5 >> >> > ratings (1-5), with the ratings allocated as follows: inter-quintile >> >> range >> >> > 100-80% = rating 5, inter-quintile range 80-60% = rating 4, >> >> > ..., inter-quintile range 20-0% = rating 1) >> >> > >> >> > Hence, I need to create a table with fields userid,track,rating: >> >> > u1,1,1 >> >> > u1,2, 3 >> >> > ... >> >> > >> >> > Can anybody help me to do this with R? >> >> > >> >> > Regards >> >> > Gawesh >> >> > >> >> > [[alternative HTML version deleted]] >> >> > >> >> > ______________________________________________ >> >> > R-help@r-project.org mailing list >> >> > https://stat.ethz.ch/mailman/listinfo/r-help >> >> > PLEASE do read the posting guide >> >> http://www.R-project.org/posting-guide.html >> >> > and provide commented, minimal, self-contained, reproducible code. >> >> > >> >> >> >> >> >> >> >> -- >> >> Jim Holtman >> >> Data Munger Guru >> >> >> >> What is the problem that you are trying to solve? >> >> >> > >> > [[alternative HTML version deleted]] >> > >> > ______________________________________________ >> > R-help@r-project.org mailing list >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> > and provide commented, minimal, self-contained, reproducible code. >> > >> > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.