dear all, I haven't read all 180 e-mails, but I didn't see this on Travis's initial list.
All of the existing flat file reading solutions I have seen are not suitable for many applications, and they compare very unfavorably to tools present in other languages, like R. Here are some of the main issues I see: - Memory usage: creating millions of Python objects when reading a large file results in horrendously bad memory utilization, which the Python interpreter is loathe to return to the operating system. Any solution using the CSV module (like pandas's parsers-- which are a lot faster than anything else I know of in Python) suffers from this problem because the data come out boxed in tuples of PyObjects. Try loading a 1,000,000 x 20 CSV file into a structured array using np.genfromtxt or into a DataFrame using pandas.read_csv and you will immediately see the problem. R, by contrast, uses very little memory. - Performance: post-processing of Python objects results in poor performance. Also, for the actual parsing, anything regular expression based (like the loadtable effort over the summer, all apologies to those who worked on it), is doomed to failure. I think having a tool with a high degree of compatibility and intelligence for parsing unruly small files does make sense though, but it's not appropriate for large, well-behaved files. - Need to "factorize": as soon as there is an enum dtype in NumPy, we will want to enable the file parsers for structured arrays and DataFrame to be able to "factorize" / convert to enum certain columns (for example, all string columns) during the parsing process, and not afterward. This is very important for enabling fast groupby on large datasets and reducing unnecessary memory usage up front (imagine a column with a million values, with only 10 unique values occurring). This would be trivial to implement using a C hash table implementation like khash.h To be clear: I'm going to do this eventually whether or not it happens in NumPy because it's an existing problem for heavy pandas users. I see no reason why the code can't emit structured arrays, too, so we might as well have a common library component that I can use in pandas and specialize to the DataFrame internal structure. It seems clear to me that this work needs to be done at the lowest level possible, probably all in C (or C++?) or maybe Cython plus C utilities. If anyone wants to get involved in this particular problem right now, let me know! best, Wes _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion