Hello, I have simulation results in the form of
Time V I 0.000000000000e+000 7.218354344368e-001 5.224478627497e-006 1.000000000000e-009 7.218354344368e-001 5.224477718002e-006 2.000000000000e-009 7.218354344368e-001 5.224477718002e-006 4.000108361244e-009 7.218354344368e-001 5.224478627497e-006 8.000325083733e-009 7.218354344368e-001 5.224478627497e-006 as the timesteps are small, each simulation results in a lot of data, about 1e5 data points per simulation. Now I want to plot this data. If I do this with a simple plot(x=data$Time, y=data$V, type="l") the resulting file (I plot into postscript files) is huge and takes a long time to render, since R creates a new line segment for each timestep. Of course it makes no sense to plot more than a few hundred datapoints in a single plot. However, I don't have a good idea how to remove the "uninteresting" part of the data, i.e., the datapoints that lie very close to the lines that would be drawn by R anyway if there were no datapoint for that time value. Since the values in my simulation are constant most of the time but sometimes have interesting "spikes" a simple data <- data[seq(1:length(data),1000),] to only plot every 1000th point does not work for me as it could remove some "spikes" completely or lead to aliasing problems. Is there any standard way to do this in R? The best thing I came up with so far is a function that judges if a row in the dataframe should be kept for plotting based on each points difference to its predecessor. However, this function has two problems: * It is very slow! (Takes about 4 seconds for each 1e5 element dataframe) * It does not work well if the values increase/decrease monotonically with small values - it will remove them all since the difference between each point and its predecessor is minimal I included my own function below: === cut === get_significant_rows_1 <- function (data, threshold) { # get the difference between each datapoint and the following datapoint # of course this list is one shorter than the input dataset, which does # not matter since the first and last datapoint will always be included diffs = abs(data[1:nrow(data)-1,] - data[2:nrow(data),]); # normalize the differences according to the value range in their column col.range = apply(data,2, function(d) {abs(max(d) - min(d))}); normalized_diffs <- t(apply(diffs, 1, function(d) {d/col.range})); rm("col.range"); # get the "biggest difference" in each row biggest_difference <- as.vector(apply(normalized_diffs,1, max)); # check if the "biggest difference" is above the threshold - # that means the row is "significant" in a plot signif <- biggest_difference >= threshold; rm("biggest_difference"); # the last datapoint/row is always significant, otherwise the plot could become "shorter" signif[length(signif)] = TRUE; # also the first one - we are adding a TRUE in front of the signif vector # now, since it does not include a value for this because the first value # naturally doesn't have a predecessor, so there was no entry for it in # the diffs array signif <- append(signif, TRUE, 0); # if a point is significant in a plot, the point before that is also "important", # at least for line plots, otherwise we get angled lines where flat ones should be signif <- (signif | append(signif[2:length(signif)], FALSE)); return(data[signif,]); } #example application (makes no sense for this kind of data though) data <- data.frame(a=rnorm(10000), b=rnorm(10000)); # dataset, threshold get_significant_rows_1(data, 0.01) ==== here ==== Thank you for any helpful advice or comments. :-) Regards, Timo ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.