Thank you Martin! This code is amazing! SO fast! Exactly what i was looking for!

Parsing ~8M lines (~ 600M file size) took about 45s on a Xeon 3,4 Ghz (8 Gb).

Thank you so much!

Sincerely,
Paolo


Martin Morgan ha scritto:
Paolo Sonego <[EMAIL PROTECTED]> writes:

I apologize  for giving wrong information again ...  :-[
The number of files is not a problem (30/40). The real deal is that
some of my files have ~10^6  lines (file size ~ 300/400M)  :'(
Thanks again for your help and advices!

If memory is not an issue, then this might be reasonably performant...

process_chunk <- function(txt, rec_sep, keys)
{
    ## filter
    keep_regex <- paste("^(",
                        paste(rec_sep, keys, sep="|", collapse="|"),
                        ")", sep="")
    txt <- txt[grep(keep_regex, txt)]

    ## construct key/value pairs
    splt <- strsplit(txt, "\\W+")
    val <- unlist(lapply(splt, "[", 2))
    names(val) <- unlist(lapply(splt, "[", 1))

    ## break key/value into records
    ends <- c(grep(rec_sep, txt), length(txt))
    grps <- rep(seq_along(ends), c(ends[1], diff(ends)))
    recs <- split(val, grps)

    ## reformat as matrix
    sapply(keys, function(key, recs) {
        res <- sapply(recs, "[", key)
        names(res) <- NULL
        res
    }, recs=recs)
}

rec <- "//"
keys <- process_chunk(readLines("/tmp/tmp.txt"), rec, keys)
x y z w id1 [1,] "x_string" "y_string" "z_string" "w_string" "id1_string" [2,] "x_string1" "y_string1" "z_string1" "w_string1" NA [3,] "x_string2" "y_string2" "z_string2" "w_string2" "id1_string1" id2 [1,] "id2_string" [2,] NA [3,] "id2_string1"

This took about 130s and no more than 250Mb to process your data
replicated to about 5M lines (~ 80M file size)

I haven't really tested the following, but this might also be useful
for processing in chunks

process <- function(filename, rec_sep="//",
                    keys=c("x", "y", "z", "w", "id1", "id2"),
                    chunk_size = 10^6)
{
    result <- NULL
    resid <- character(0)
    con <- file(filename, "r")
    while(length(txt <- readLines(con, chunk_size)) != 0) {
        recs <- grep(rec_sep, txt)
        if (length(recs) > 0) {
            maxrec <- max(recs)
            if (maxrec == length(txt)) buf <- character(0)
            else buf <- txt[(maxrec+1):length(txt)]
            txt <- c(resid, txt[-(maxrec:length(txt))])
            resid <- buf
        } else {
            txt <- c(resid, txt)
            resid <- character(0)
        }
        result <-
            rbind(result,
                  process_chunk(txt, rec_sep=rec_sep, keys=keys))
}
    close(con)
    if (length(resid) != 0) {
        result <-
            rbind(result,
                  process_chunk(resid, rec_sep=rec_sep, keys=keys))
    }
    result
}

process('/tmp/tmp.txt', chunk_size=10L) # make size much larger
x y z w id1 [1,] "x_string" "y_string" "z_string" "w_string" "id1_string" [2,] "x_string1" "y_string1" "z_string1" "w_string1" NA [3,] "x_string2" "y_string2" "z_string2" "w_string2" "id1_string1" id2 [1,] "id2_string" [2,] NA [3,] "id2_string1"



Regards,
Paolo


jim holtman ha scritto:
How much time is it taking on the files and how many files do you have
to process?  I tried it with your data duplicated so that I had 57K
lines and it took 27 seconds to process.  How much faster to you want?
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to