It is a nice surprise to wake up receiving three answers, all producing correct results. Many thanks to all of you.

Jim Holtman solved it with amazing clarity. Gang Peng using a traditioanl C-like pointer style and Arun with awesome tight code thanks to diff().

I am embrassed to see my mis-spellings inherited in the answers ('lenths' should be 'lengths' and 'valida' should be 'valid'). This experience is to behove me to not to code in midnight again.

For anyone wishing to test these methods, I have compiled them all into one R script file, pasted at the end of this email.

Jim Holtman asked me to elaborate the problem:

    It is a common problem in reading sparse variable-lenght record data
    file.  Records are stored in file one next to another. The length of
    each record is known in advance, but a lot of them records are invalid,
    and should be skipped to make efficient use of memory.

    Ideally the datafile-reading routine should receive a skip-table. Before
    reading each wanted/valid record, it seeks forward for the distance
    given in the skip-table. The problem is how to obtain such a skip table.

    What we have at hand to produce the skip table, is a set of two data
    frames: a record.lengths data frame about each record's length, and a
    valid.records data frame about which records are significant and should
    be read.

--

###### input data:

record.lengths <- read.table(text = "    NR     length
         1       100
         2       130
         3       150
         4       148
         5       100
         6        83
         7        60", header = TRUE)

valid.records <- read.table(text = "  NR     factor
         1       3
         2       4
         4       8
         7       9", header = TRUE)

####### Jim Holtman's method:

x <- merge(record.length, valid.records, by = "NR", all.x = TRUE)
x$seq <- cumsum(!is.na(x$factor))

# need to add 1 to lines with NA to associate with next group
x$seq[is.na(x$factor)] <- x$seq[is.na(x$factor)] + 1

# split by 'seq', output last record and sum of preceeding records
skip.table <- do.call(rbind
     , lapply(split(x, x$seq), function(.sk){
         if (nrow(.sk) > 1) .sk$skip <- sum(.sk$length[1:(nrow(.sk) - 1L)])
         else .sk$skip <- 0
         .sk[nrow(.sk), ] # return first value
         })
     )

print(skip.table)


####### Gang Peng's method:

n.record <- length(record.lengths$NR)
index    <- record.lengths$NR %in% valid.records$NR
tmp <- 1:n.record
ind <- tmp[index]
st  <- 1
skip <- rep(0,length(ind))
for (i in 1:length(ind)) {
        if(st<ind[i]){
                skip[i]<-sum(record.lengths$length[st:(ind[i]-1)])
        }
        st <- ind[i]+1
}
print(cbind(valid.records,skip))

####### Arun's method:
indx<-diff(valid.records$NR)-1
skip.table<- within(valid.records, {skip<-
with(record.lengths,tapply(length,c(-1,rep(indx,indx+1)),function(x)
sum(x[-length(x)])))})[,c(1,3,2)]
print(skip.table)

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to