[R] code review: is it too much to ask?

Giovanni Azua Sun, 23 Oct 2011 14:05:39 -0700

Hello all,

I really appreciate how helpful the people in this list are. Would it be too 
much to ask to send a small script to have it peer-reviewed? to make sure I am 
not making blatant mistakes? The script takes an experiment.dat as input and 
generates system Throughput using ggplot2. It works now ... [sigh] but I have 
this nasty feeling that I might be doing something wrong :). Changing "samples" 
i.e. number of samples per group produces arbitrarily different results, I 
basically increased it (until 9) until there were no strongly deterministic 
periodicities. This is not a full-fledge experiment but just a preliminary 
report that will show I have implemented a healthy system. Proper experimental 
analysis comes after varying factors according to the 2^k*r experimental design 
etc


Some key points I would like to find out:
- aggregation is not breaking the natural order of the measurements i.e. if 
there are 20 runtimes taken in that order, and I make groups of 10 measurements 
(to compute statistics on them) the first group must contain the first 10 
runtimes and the second group must contain the second 10 runtimes. I am not 
sure if the choice of aggregation etc is respecting this.
- I am not sure if it is best to do the binning by filling the bins by time 
intervals of by number of observations.

Your help will be greatly appreciated!

I have the data too and the plots look very nice but it is a 4mb file.

TIA
Best regards,
Giovanni

# 
=========================================================================================
# Advanced Systems Lab 
# Milestone 1
# Author: Giovanni Azua
# Date: 22 October 2011
# 
=========================================================================================

rm(list=ls())                                                        # clear 
workspace

library(boot)                                                        # use boot 
library
library(ggplot2)                                                     # use 
ggplot2 library
library(doBy)                                                        # use doBy 
library

# 
=========================================================================================
# ETL Step
# 
=========================================================================================

data_file <- file("/Users/bravegag/code/asl11/trunk/report/experiment.dat")
df <- read.table(data_file)                                          # reads 
the data as data frame
class(df)                                                            # show the 
class to be 'list' 
names(df)                                                            # data is 
prepared correcly in Python
str(df)
head(df)

names(df)[names(df)=="V1"] <- "Time"                                 # change 
column names
names(df)[names(df)=="V2"] <- "Partitioning"
names(df)[names(df)=="V3"] <- "Workload"
names(df)[names(df)=="V4"] <- "Runtime"
str(df)
head(df)

# 
=========================================================================================
# Define utility functions
# 
=========================================================================================

se <- function(x) sqrt(var(x)/length(x))
sst <- function(x) sum(x-mean(x))^2

## ************************************ COPIED FROM 
********************************************
## 
http://wiki.stdout.org/rcookbook/Graphs/Plotting%20means%20and%20error%20bars%20%28ggplot2%29
## 
*********************************************************************************************
## Summarizes data.
## Gives count, mean, standard deviation, standard error of the mean, and 
confidence interval (default 95%).
## If there are within-subject variables, calculate adjusted values using 
method from Morey (2008).
##   data: a data frame.
##   measurevar: the name of a column that contains the variable to be 
summariezed
##   groupvars: a vector containing names of columns that contain grouping 
variables
##   na.rm: a boolean that indicates whether to ignore NA's
##   conf.interval: the percent range of the confidence interval (default is 
95%)
summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE, 
conf.interval=.95) {
    require(doBy)

    # New version of length which can handle NA's: if na.rm==T, don't count them
    length2 <- function (x, na.rm=FALSE) {
        if (na.rm) sum(!is.na(x))
        else       length(x)
    }

    # Collapse the data
    formula <- as.formula(paste(measurevar, paste(groupvars, collapse=" + "), 
sep=" ~ "))
    datac <- summaryBy(formula, data=data, FUN=c(length2,mean,sd), na.rm=na.rm)

    # Rename columns
    names(datac)[ names(datac) == paste(measurevar, ".mean", sep="") ] <- 
measurevar
    names(datac)[ names(datac) == paste(measurevar, ".sd", sep="") ] <- "sd"
    names(datac)[ names(datac) == paste(measurevar, ".length2", sep="") ] <- "N"
    
    datac$se <- datac$sd / sqrt(datac$N)  # Calculate standard error of the mean
    
    # Confidence interval multiplier for standard error
    # Calculate t-statistic for confidence interval: 
    # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
    ciMult <- qt(conf.interval/2 + .5, datac$N-1)
    datac$ci <- datac$se * ciMult
    
    return(datac)
}

# 
=========================================================================================
# Prepare the Throughput data
# 
=========================================================================================

throughput <- aggregate(x=df$Runtime, by=list(df$Time,df$Partitioning), 
FUN=length)
head(throughput)
names(throughput)[names(throughput)=="Group.1"] <- "Time"            # change 
column names
names(throughput)[names(throughput)=="Group.2"] <- "Partitioning"
names(throughput)[names(throughput)=="x"] <- "Y"
head(throughput)

samples = 9
throughput$Time_group <- floor(throughput$Time/samples) + 1          # generate 
Time groups of "samples"

dfc <- summarySE(throughput, measurevar="Y", groupvars=c("Time_group", 
"Partitioning"))
last <- length(dfc$Time)
dfc <- dfc[c(-1,-2,-(last-1),-last),]
dfc$Time <- dfc$Time - min(dfc$Time) + 1
head(dfc)

# mu + se error bar
ggplot(dfc, aes(x=Time, y=Y, colour=Partitioning, group=Partitioning)) + 
geom_point(fill="white", size=3) +
    geom_line() + geom_errorbar(aes(ymin=Y-se, ymax=Y+se), width=.5) + 
theme_bw() +
    xlab(paste("Minutes")) + ylab("Throughput (Requests per Minute)") + 
    scale_y_continuous(breaks=seq(0,max(dfc$Y + dfc$se), 50), limits=c(0, 
max(dfc$Y + dfc$se))) + 
    opts(title="System Throughput\n2x Clients 2x Middlewares 2x Databases") + 
    scale_x_continuous(breaks=0:length(dfc$Y), 
labels=as.character(0:length(dfc$Y)*samples))

# 
=========================================================================================
# Prepare the Response Time data
# 
=========================================================================================




        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] code review: is it too much to ask?

Reply via email to