Hello everyone,

I'm having a problem performing reshape() on a large data frame. The
operation is fairly trivial but it makes R run out of memory.

The data frame has the following structure:

ID                     DATE1             DATE2            VALTYPE
VALUE
'abcd1233'         2009-11-12        2009-12-23      'TYPE1'
123.45
...

VALTYPE is a string and is a factor with only 2 values (say TYPE1 and
TYPE2). I need to transform it into the following data frame ("wide"
transpose) based on common ID and DATEs:

ID                     DATE1             DATE2            VALUE.TYPE1
VALUE.TYPE2
'abcd1233'         2009-11-12       2009-12-23
123.45                     NA
...

This is fairly easily achieved by reshape():

    tbl2 <- reshape(tbl, direction = "wide", idvar = c("ID", "DATE1",
"DATE2"), timevar = "VALTYPE");

This works very well on toy data frames however it fails miserably on a
production dataset. The latter has more than 4,500,000 observations
(although about 70% of VALUEs are NA). When running on a production data
frame, I eventually get:

    Error: cannot allocate vector of size 4.8 Gb

The machine is an Intel-based Linux workstation with 4Gb of RAM. Loading the
data (from a compressed Rdata file) into a fresh R process makes it grow to
about 250Mb which clearly leaves a lot of space for reshaping.

So far, I managed to deal with the issue by performing a manual transpose by
splitting/merging on VALTYPE:

    sp <- by(tbl[c(1,2,3,5)], tbl$VALTYPE, function(x) x);
    tbl <- merge(sp[["TYPE1"]], sp[["TYPE2"]], by = c("ID", "DATE1",
"DATE2"), all = TRUE, sort = TRUE);

..but this is not as elegant, let alone inflexible, should we add a new kind
of VALTYPE.


So I would appreciate help with the following questions:

1. Am I doing something wrong or is reshape() hugely suboptimal in this
particular case? Given that the data set is already sorted by ID, DATE1,
DATE2, I preforming this transformation using a small Python/Perl/AWK script
that wouldn't even require loading the entire dataset instead operating on a
stream of data

2. Is there a better way to deal with this problem? Again, I can come up
with all kinds of Python/Perl patchwork but it would be nice to have R do
the job, especially considering that the problem is not that difficult.

Thanks in advance,
-- Alexander

P.S. If needed, I'll concoct a test dataset demonstrating the problem, in
case someone is willing to debug the issue.

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to