On 26 Oct 2014, at 02:21 pm, Eelco Hoogendoorn <hoogendoorn.ee...@gmail.com> 
wrote:

> Im not sure why the memory doubling is necessary. Isnt it possible to 
> preallocate the arrays and write to them? I suppose this might be inefficient 
> though, in case you end up reading only a small subset of rows out of a 
> mostly corrupt file? But that seems to be a rather uncommon corner case.
> 
> Either way, id say a doubling of memory use is fair game for numpy. 
> Generality is more important than absolute performance. The most important 
> thing is that temporary python datastructures are avoided. That shouldn't be 
> too hard to accomplish, and would realize most of the performance and memory 
> gains, I imagine.

Preallocation is not straightforward because the parser needs to be able in 
general to work with streamed input.
I think I even still have a branch on github bypassing this on request (by 
keyword argument).
But a factor 2 is already a huge improvement over that factor ~6 coming from 
the current text readers buffering
the entire input as list of list of Python strings, not to speak of the vast 
performance gain from using a parser
implemented in C like pandas’ - in fact one of the last times this subject came 
up one suggestion was to steal
pandas.read_cvs and adopt as required.
Someone also posted some code or the draft thereof for using resizable arrays 
quite a while ago, which would
reduce the memory overhead for very large arrays.

Cheers,
                                                Derek



_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to