On 26 October 2014 12:54, Jeff Reback <jeffreb...@gmail.com> wrote:

> you should have a read here/
> http://wesmckinney.com/blog/?p=543
>
> going below the 2x memory usage on read in is non trivial and costly in
> terms of performance
>


If you know in advance the number of rows (because it is in the header,
counted with wc -l, or any other prior information) you can preallocate the
array and fill in the numbers as you read, with virtually no overhead.

If the number of rows is unknown, an alternative is to use a chunked data
container like Bcolz [1] (former carray) instead of Python structures. It
may be used as such, or copied back to a ndarray if we want the memory to
be aligned. Including a bit of compression we can get the memory overhead
to somewhere under 2x (depending on the dataset), at the cost of not so
much CPU time, and this could be very useful for large data and slow
filesystems.


/David.

[1] http://bcolz.blosc.org/
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to