Excerpts from Wes McKinney's message of Thu Feb 23 15:24:44 -0500 2012: > On Thu, Feb 23, 2012 at 3:23 PM, Erin Sheldon <erin.shel...@gmail.com> wrote: > > I designed the recfile package to fill this need. It might be a start. > Can you relicense as BSD-compatible?
If required, that would be fine with me. -e > > > Excerpts from Wes McKinney's message of Thu Feb 23 14:32:13 -0500 2012: > >> dear all, > >> > >> I haven't read all 180 e-mails, but I didn't see this on Travis's > >> initial list. > >> > >> All of the existing flat file reading solutions I have seen are > >> not suitable for many applications, and they compare very unfavorably > >> to tools present in other languages, like R. Here are some of the > >> main issues I see: > >> > >> - Memory usage: creating millions of Python objects when reading > >> a large file results in horrendously bad memory utilization, > >> which the Python interpreter is loathe to return to the > >> operating system. Any solution using the CSV module (like > >> pandas's parsers-- which are a lot faster than anything else I > >> know of in Python) suffers from this problem because the data > >> come out boxed in tuples of PyObjects. Try loading a 1,000,000 > >> x 20 CSV file into a structured array using np.genfromtxt or > >> into a DataFrame using pandas.read_csv and you will immediately > >> see the problem. R, by contrast, uses very little memory. > >> > >> - Performance: post-processing of Python objects results in poor > >> performance. Also, for the actual parsing, anything regular > >> expression based (like the loadtable effort over the summer, > >> all apologies to those who worked on it), is doomed to > >> failure. I think having a tool with a high degree of > >> compatibility and intelligence for parsing unruly small files > >> does make sense though, but it's not appropriate for large, > >> well-behaved files. > >> > >> - Need to "factorize": as soon as there is an enum dtype in > >> NumPy, we will want to enable the file parsers for structured > >> arrays and DataFrame to be able to "factorize" / convert to > >> enum certain columns (for example, all string columns) during > >> the parsing process, and not afterward. This is very important > >> for enabling fast groupby on large datasets and reducing > >> unnecessary memory usage up front (imagine a column with a > >> million values, with only 10 unique values occurring). This > >> would be trivial to implement using a C hash table > >> implementation like khash.h > >> > >> To be clear: I'm going to do this eventually whether or not it > >> happens in NumPy because it's an existing problem for heavy > >> pandas users. I see no reason why the code can't emit structured > >> arrays, too, so we might as well have a common library component > >> that I can use in pandas and specialize to the DataFrame internal > >> structure. > >> > >> It seems clear to me that this work needs to be done at the > >> lowest level possible, probably all in C (or C++?) or maybe > >> Cython plus C utilities. > >> > >> If anyone wants to get involved in this particular problem right > >> now, let me know! > >> > >> best, > >> Wes > > -- > > Erin Scott Sheldon > > Brookhaven National Laboratory -- Erin Scott Sheldon Brookhaven National Laboratory _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion