> * David Winsemius <qjvafrz...@pbzpnfg.arg> [2012-09-06 10:30:16 -0700]: > >> these are the results of applying a model to the test data. >> the first column is the ID > > In which case you should be using the 'by' argument to `merge`
I already do! see my initial message! >> 3. sort by the sum/mean of the V3 columns and evaluate the combined >> model using the lift quality metric >> (http://dl.acm.org/citation.cfm?id=380995.381018) > > That's going to require more background (or more money since they want $15.00 > for a pdf. :-) that I have already implemented, works just fine: proficiency <- function (actual, prediction) { proficiency1(ea = entropy(table(actual)), ep = entropy(table(prediction)), ej = entropy(table(actual,prediction))) } proficiency1 <- function (ea, ep, ej) { mi <- ea + ep - ej list(joint = ej, actual = ea, prediction = ep, mutual = mi, proficiency = mi / ea, dependency = mi / ej) } detector.statistics <- function (tp,fn,fp,tn) { observationCount <- tp + fn + fp + tn predictedPositive <- tp + fp predictedNegative <- fn + tn actualPositive <- tp + fn actualNegative <- fp + tn correct <- tp + tn list(baseRate = actualPositive / observationCount, precision = if (tp == 0) 0 else tp / predictedPositive, specificity = if (tn == 0) 0 else tn / actualNegative, recall = if (tp == 0) 0 else tp / actualPositive, accuracy = correct / observationCount, lift = (tp * observationCount) / (predictedPositive * actualPositive), f1score = if (tp == 0) 0 else 2 * tp / (2 * tp + fp + fn), proficiency = proficiency1(ej = entropy(c(tp,fn,fp,tn)), ea = entropy(c(actualPositive,actualNegative)), ep = entropy(c(predictedPositive,predictedNegative)))) } ## v should be vector of 0&1 sorted according to some model ## Gregory Piatetsky-Shapiro, Samuel Steingold ## "Measuring Lift Quality in Database Marketing" ## http://sds.podval.org/data/l-quality.pdf ## http://www.sigkdd.org/explorations/issues/2-2-2000-12/piatetsky-shapiro.pdf ## SIGKDD Explorations, Vol. 2:2, (2000), 81-86 ## tests: lift.quality(rbinom(10000,size=1,prob=0.1)) ==> ~0 ## lift.quality(rev(round((1:10000)/12000))) ==> 1 lift.quality <- function (v, plot = TRUE, file = NULL, main = "lift curve", thresholds = NULL) { target.count <- sum(v) total.count <- length(v) base.rate <- target.count / total.count target.level <- cumsum(v)/target.count lq <- ((2*sum(target.level) - 1)/total.count - 1) / (1 - base.rate) if (plot) { if (!is.null(file)) { pdf(file = file) on.exit(dev.off()) } plot(x=(1:total.count)/total.count,y=target.level,type="l", main=paste(main,"( lift quality ",lq,")"), xlab="% cutoff",ylab="cumulative % hit") } if (is.null(thresholds)) thresholds = c(base.rate) list(lift.quality = lq, detector.statistics = sapply(thresholds, function (l) { cutoff <- round(l * total.count) tp <- round(target.level[cutoff] * target.count) # = sum(v[1:cutoff]) fn <- target.count - tp fp <- cutoff - tp tn <- total.count - target.count - cutoff + tp detector.statistics(tp, fn, fp, tn) })) } >> I have many more score files (not just 4), so it is not practical for me >> to rename the column to something unique. > > Which column? the 3rd ("score") column. Meanwhile I realised that the fastest way is actuall shell: sort+cut+paste produced the csv file which can be loaded into R much faster than the individual score files, so this issue is now purely academic. However, I appreciate the replies I got so far, it was quite educational, thanks! (I also appreciate comments on the code above) -- Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000 http://www.childpsy.net/ http://www.memritv.org http://truepeace.org http://openvotingconsortium.org http://ffii.org http://mideasttruth.com Save your burned out bulbs for me, I'm building my own dark room. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.