I am not (yet) very familiar with much of the functionality introduced in your script Torgil (izip, imap, etc.), but I really appreciate you taking the time to look at this!
The program stopped with the following error: File "load_iter.py", line 48, in <genexpr> convert_row=lambda r: tuple(fn(x) for fn,x in izip(conversion_functions,r)) ValueError: invalid literal for int() with base 10: '2174.875' A lot of the data I use can have a column with a set of int¹s (e.g., 0¹s), but then the rest of that same column could be floats. I guess finding the right conversion function is the tricky part. I was thinking about sampling each, say, 10th obs to test which function to use. Not sure how that would work however. If I ignore the option of an int (i.e., everything is a float, date, or string) then your script is about twice as fast as mine!! Question: If you do ignore the int's initially, once the rec array is in memory, would there be a quick way to check if the floats could pass as int's? This may seem like a backwards approach but it might be 'safer' if you really want to preserve the int's. Thanks again! Vincent On 7/8/07 5:52 AM, "Torgil Svensson" <[EMAIL PROTECTED]> wrote: > Given that both your script and the mlab version preloads the whole > file before calling numpy constructor I'm curious how that compares in > speed to using numpy's fromiter function on your data. Using fromiter > should improve on memory usage (~50% ?). > > The drawback is for string columns where we don't longer know the > width of the largest item. I made it fall-back to "object" in this > case. > > Attached is a fromiter version of your script. Possible speedups could > be done by trying different approaches to the "convert_row" function, > for example using "zip" or "enumerate" instead of "izip". > > Best Regards, > > //Torgil > > > On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote: >> Thanks for the reference John! csv2rec is about 30% faster than my code on >> the same data. >> >> If I read the code in csv2rec correctly it converts the data as it is being >> read using the csv modules. My setup reads in the whole dataset into an >> array of strings and then converts the columns as appropriate. >> >> Best, >> >> Vincent >> >> >> On 7/6/07 8:53 PM, "John Hunter" <[EMAIL PROTECTED]> wrote: >> >>> On 7/6/07, Vincent Nijs <[EMAIL PROTECTED]> wrote: >>>> I wrote the attached (small) program to read in a text/csv file with >>>> different data types and convert it into a recarray without having to >>>> pre-specify the dtypes or variables names. I am just too lazy to type-in >>>> stuff like that :) The supported types are int, float, dates, and strings. >>>> >>>> I works pretty well but it is not (yet) as fast as I would like so I was >>>> wonder if any of the numpy experts on this list might have some suggestion >>>> on how to speed it up. I need to read 500MB-1GB files so speed is important >>>> for me. >>> >>> In matplotlib.mlab svn, there is a function csv2rec that does the >>> same. You may want to compare implementations in case we can >>> fruitfully cross pollinate them. In the examples directy, there is an >>> example script examples/loadrec.py >>> _______________________________________________ >>> Numpy-discussion mailing list >>> Numpy-discussion@scipy.org >>> http://projects.scipy.org/mailman/listinfo/numpy-discussion >>> >> >> >> _______________________________________________ >> Numpy-discussion mailing list >> Numpy-discussion@scipy.org >> http://projects.scipy.org/mailman/listinfo/numpy-discussion >> > _______________________________________________ > Numpy-discussion mailing list > Numpy-discussion@scipy.org > http://projects.scipy.org/mailman/listinfo/numpy-discussion -- Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 60208-2001 Phone: +1-847-491-4574 Fax: +1-847-491-2498 E-mail: [EMAIL PROTECTED] Skype: vincentnijs _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion