I have a modest-size XML file (52MB) in a format suited to xmlToDataFrame
(package XML).

I have successfully read it into R by splitting the file 10 ways then
running xmlToDataFrame on each part, then rbind.fill (package plyr) on the
result. This takes about 530 s total, and results in a data.frame with 71k
rows and object.size of 21MB.

But trying to run xmlToDataFrame on the whole file takes forever (> 10000 s
so far). xmlParse of this file takes only 0.8 s.

I tried running xmlToDataFrame on the first 10% of the file, then the first
10% repeated twice, then three times (with the outer tags adjusted of
course). Timings:

1 copy: 111 s = 111 per copy
2 copy: 311 s = 155   " "
3 copy: 626 s = 209   " "

The runtime is superlinear.  What is going on here? Is there a better
approach?

Thanks,

          -s

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to