Hi Torgil, 1. I got an email from Tim about this issue:
"I finally got around to doing some more quantitative comparisons between your code and the more complicated version that I proposed. The idea behind my code was to minimize memory usage -- I figured that keeping the memory usage low would make up for any inefficiencies in the conversion process since it's been my experience that memory bandwidth dominates a lot of numeric problems as problem sized get reasonably large. I was mostly wrong. While it's true that for very large file sizes I can get my code to outperform yours, in most instances it lags behind. And the range where it does better is a fairly small range right before the machine dies with a memory error. So my conclusion is that the extra hoops my code goes through to avoid allocating extra memory isn't worth it for you to bother with.² The approach in my code is simple and robust to most data issues I could come-up with. It actually will do an appropriate conversion if there are missing values or int¹s and float in the same column. It will select an appropriate string length as well. It may not be the most memory efficient setup but given Tim¹s comments it is a pretty decent solution for the types of data I have access to. 2. Fixed the spelling error :) 3. I guess that is the same thing. I am not very familiar with zip, izip, map etc. just yet :) Thanks for the tip! 4. I called the function generated using exec, iter(). I need that function to transform the data using the types provided by the user. Best, Vincent On 7/18/07 7:57 PM, "Torgil Svensson" <[EMAIL PROTECTED]> wrote: > Nice, > > I haven't gone through all details. That's a nice new "missing" > feature, maybe all instances where we can't find a conversion should > be "nan". A few comments: > > 1. The "load_search" functions contains all memory/performance > overhead that we wanted to avoid with the fromiter function. Does this > mean that you no longer have large text-files that change sting > representation in the columns (aka "0" floats) ? > > 2. ident=" "*4 > This has the same spelling error as in my first compile try .. it was > meant to be "indent" > > 3. types = list((i,j) for i, j in zip(varnm, types2)) > Isn't this the same as "types = zip(varnm, types2)" ? > > 4. return N.fromiter(iter(reader),dtype = types) > Isn't "reader" an iterator already? What does the "iter()" operator do > in this case? > > Best regards, > > //Torgil > > > On 7/18/07, Vincent Nijs <[EMAIL PROTECTED]> wrote: >> >> I combined some of the very useful comments/code from Tim and Torgil and >> came-up with the attached program to read csv files and convert the data >> into a recarray. I couldn't use all of their suggestions because, frankly, I >> didn't understand all of them :) >> >> The program use variable names if provided in the csv-file and can >> auto-detect data types. However, I also wanted to make it easy to specify >> data types and/or variables names if so desired. Examples are at the bottom >> of the file. Comments are very welcome. >> >> Thanks, >> >> Vincent >> _______________________________________________ >> Numpy-discussion mailing list >> Numpy-discussion@scipy.org >> http://projects.scipy.org/mailman/listinfo/numpy-discussion >> >> >> > _______________________________________________ > Numpy-discussion mailing list > Numpy-discussion@scipy.org > http://projects.scipy.org/mailman/listinfo/numpy-discussion > -- Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 60208-2001 Phone: +1-847-491-4574 Fax: +1-847-491-2498 E-mail: [EMAIL PROTECTED] Skype: vincentnijs
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion