On 11.08.2011, at 8:50PM, Russell E. Owen wrote: > It seems a shame that loadtxt has no argument for predicted length, > which would allow preallocation and less appending/copying data. > > And yes...reading the whole file first to figure out how many elements > it has seems sensible to me -- at least as a switchable behavior, and > preferably the default. 1Gb isn't that large in modern systems, but > loadtxt is filing up all 6Gb of RAM reading it!
1 GB is indeed not much in terms of disk space these days, but using text files for such data amounts is nonetheless very much non-state-of-the-art ;-) That said, of course there is no justification to use excessive amounts of memory where it could be avoided! Implementing the above scheme for npyio is not quite as straightforward as in the example I gave before, mainly for the following reasons: loadtxt also has to deal with more complex data like structured arrays, plus comments, empty lines etc., meaning it has to find and count the actual valid data lines. Ideally, genfromtxt, which offers yet more functionality to deal with missing data, should offer the same options, but they would be certainly more difficult to implement there. More than 6 GB is still remarkable - from what info I found in the web, lists seem to consume ~24 Bytes/element, i.e. 3 times more than a final float64 array. The text representation would typically take 10-20 char's for one float (though with <12 digits, they could usually be read as float32 without loss of precision). Thus a factor >6 seems quite extreme, unless the file is full of (relatively) short integers... But this also means copying of the final array would still have a relatively low memory footprint compared to the buffer list, thus using some kind of mutable array type for reading should be a reasonable solution as well. Unfortunately fromiter is not of that much use here since it only reads 1D-arrays. I haven't tried to use Chris' accumulator class yet, so for now I did go the 2x read approach with loadtxt, it turned out to add only ~10% to the read-in time. For compressed files this goes up to 30-50%, but once physical memory is exhausted it should probably actually become faster. I've made a pull request https://github.com/numpy/numpy/pull/144 implementing that option as a switch 'prescan'; could you review it in particular regarding the following: Is the option reasonably named and documented? In the case the allocated array does not match the input data (which really should never happen), right now just a warning is issued, filling any excess buffer with zeros or discarding remaining input data - should this rather raise an IndexError? No prediction if/when I might be able to provide this for genfromtxt, sorry! Cheers, Derek _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion