As a bit of an aside, I have just discovered that for fixed-width text data, numpy's text readers seems to edge out pandas' read_fwf(), and numpy has the advantage of being able to specify the dtypes ahead of time (seems that the pandas version just won't allow it, which means I end up with float64's and object dtypes instead of float32's and |S12 dtypes where I want them).
Cheers! Ben Root On Tue, Oct 28, 2014 at 4:09 PM, Chris Barker <chris.bar...@noaa.gov> wrote: > A few thoughts: > > 1) yes, a faster, more memory efficient text file parser would be great. > Yeah, if your workflow relies on parsing lots of huge text files, you > probably need another workflow. But it's a really really common thing to > nee to do -- why not do it fast? > > 2) """you are describing a special case where you know the data size > apriori (eg not streaming), dtypes are readily apparent from a small sample > case > and in general your data is not messy """ > > sure -- that's a special case, but it's a really common special case (OK > -- without the know your data size ,anyway...) > > 3) > >> Someone also posted some code or the draft thereof for using resizable >> arrays quite a while ago, which would >> reduce the memory overhead for very large arrays. >> > > That may have been me -- I have a resizable array class, both pure python > and not-quite finished Cython version. In practice, if you add stuff to the > array row by row (or item by item), it's no faster than putting it all in a > list and then converting to an array -- but it IS more memory efficient, > which seems to be the issue here. Let me know if you want it -- I really > need to get it up on gitHub one of these days. > > My take: for fast parsing of big files you need: > > To do the parsing/converting in C -- what wrong with good old fscanf, at > least for the basic types -- it's pretty darn fast. > > Memory efficiency -- somethign like my growable array is not all that hard > to implement and pretty darn quick -- you just do the usual trick_ over > allocate a bit of memory, and when it gets full re-allocate a larger chunk. > It turns out, at least on the hardware I tested on, that the performance is > not very sensitive to how much you over allocate -- if it's tiny (1 > element) performance really sucks, but once you get to a 10% or so (maybe > less) over-allocation, you don't notice the difference. > > Keep the auto-figuring out of the structure / dtypes separate from the > high speed parsing code. I"d say write high speed parsing code first -- > that requires specification of the data types and structure, then, if you > want, write some nice pure python code that tries to auto-detect all that. > If it's a small file, it's fast regardless. if it's a large file, then the > overhead of teh fancy parsing will be lost, and you'll want the line by > line parsing to be as fast as possible. > > From a quick loo, it seems that the Panda's code is pretty nice -- maybe > the 2X memory footprint should be ignored. > > -Chris > > > > > > > > > > >> Cheers, >> Derek >> >> >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > > > > -- > > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > chris.bar...@noaa.gov > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion