Thanks for looking into this Torgil! I agree that this is a much more complicated setup. I'll check if there is anything I can do on the data end. Otherwise I'll go with Timothy's suggestion and read in numbers as floats and convert to int later as needed.
Vincent On 7/8/07 5:40 PM, "Torgil Svensson" <[EMAIL PROTECTED]> wrote: >> Question: If you do ignore the int's initially, once the rec array is in >> memory, would there be a quick way to check if the floats could pass as >> int's? This may seem like a backwards approach but it might be 'safer' if >> you really want to preserve the int's. > > In your case the floats don't pass as ints since you have decimals. > The attached file takes another approach (sorry for lack of comments). > If the conversion fail, the current row is stored and the iterator > exits (without setting a 'finished' parameter to true). The program > then re-calculates the conversion-functions and checks for changes. If > the changes are supported (=we have a conversion function for old data > in the format_changes dictionary) it calls fromiter again with an > iterator like this: > > def get_data_iterator(row_iter,delim,res): > for x0,x1,x2,x3,x4,x5 in res['data']: > x0=float(x0) > print (x0,x1,x2,x3,x4,x5) > yield (x0,x1,x2,x3,x4,x5) > yield > (float('2.0'),int('2'),datestr2num('4/97'),float('1.33'),float('2.26'),float(' > 1.23')) > for row in row_iter: > x0,x1,x2,x3,x4,x5=row.split(delim) > try: > yield > (float(x0),int(x1),datestr2num(x2),float(x3),float(x4),float(x5)) > except: > res['row']=row > return > res['finished']=True > > res['data'] is the previously converted data. This has the obvious > disadvantage that if only the last row has fractions in a column, > it'll cost double memory. Also if many columns change format at > different places it has to re-convert every time. > > I don't recommend this because of the drawbacks and extra complexity. > I think it is best to convert your files (or file generation) so that > float columns are represented with 0.0 instead of 0. > > Best Regards, > > //Torgil > > On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote: >> I am not (yet) very familiar with much of the functionality introduced in >> your script Torgil (izip, imap, etc.), but I really appreciate you taking >> the time to look at this! >> >> The program stopped with the following error: >> >> File "load_iter.py", line 48, in <genexpr> >> convert_row=lambda r: tuple(fn(x) for fn,x in >> izip(conversion_functions,r)) >> ValueError: invalid literal for int() with base 10: '2174.875' >> >> A lot of the data I use can have a column with a set of int¹s (e.g., 0¹s), >> but then the rest of that same column could be floats. I guess finding the >> right conversion function is the tricky part. I was thinking about sampling >> each, say, 10th obs to test which function to use. Not sure how that would >> work however. >> >> If I ignore the option of an int (i.e., everything is a float, date, or >> string) then your script is about twice as fast as mine!! >> >> Question: If you do ignore the int's initially, once the rec array is in >> memory, would there be a quick way to check if the floats could pass as >> int's? This may seem like a backwards approach but it might be 'safer' if >> you really want to preserve the int's. >> >> Thanks again! >> >> Vincent >> >> >> On 7/8/07 5:52 AM, "Torgil Svensson" <[EMAIL PROTECTED]> wrote: >> >>> Given that both your script and the mlab version preloads the whole >>> file before calling numpy constructor I'm curious how that compares in >>> speed to using numpy's fromiter function on your data. Using fromiter >>> should improve on memory usage (~50% ?). >>> >>> The drawback is for string columns where we don't longer know the >>> width of the largest item. I made it fall-back to "object" in this >>> case. >>> >>> Attached is a fromiter version of your script. Possible speedups could >>> be done by trying different approaches to the "convert_row" function, >>> for example using "zip" or "enumerate" instead of "izip". >>> >>> Best Regards, >>> >>> //Torgil >>> >>> >>> On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote: >>>> Thanks for the reference John! csv2rec is about 30% faster than my code on >>>> the same data. >>>> >>>> If I read the code in csv2rec correctly it converts the data as it is being >>>> read using the csv modules. My setup reads in the whole dataset into an >>>> array of strings and then converts the columns as appropriate. >>>> >>>> Best, >>>> >>>> Vincent >>>> >>>> >>>> On 7/6/07 8:53 PM, "John Hunter" <[EMAIL PROTECTED]> wrote: >>>> >>>>> On 7/6/07, Vincent Nijs <[EMAIL PROTECTED]> wrote: >>>>>> I wrote the attached (small) program to read in a text/csv file with >>>>>> different data types and convert it into a recarray without having to >>>>>> pre-specify the dtypes or variables names. I am just too lazy to type-in >>>>>> stuff like that :) The supported types are int, float, dates, and >>>>>> strings. >>>>>> >>>>>> I works pretty well but it is not (yet) as fast as I would like so I was >>>>>> wonder if any of the numpy experts on this list might have some >>>>>> suggestion >>>>>> on how to speed it up. I need to read 500MB-1GB files so speed is >>>>>> important >>>>>> for me. >>>>> >>>>> In matplotlib.mlab svn, there is a function csv2rec that does the >>>>> same. You may want to compare implementations in case we can >>>>> fruitfully cross pollinate them. In the examples directy, there is an >>>>> example script examples/loadrec.py >>>>> _______________________________________________ >>>>> Numpy-discussion mailing list >>>>> Numpy-discussion@scipy.org >>>>> http://projects.scipy.org/mailman/listinfo/numpy-discussion >>>>> >>>> >>>> >>>> _______________________________________________ >>>> Numpy-discussion mailing list >>>> Numpy-discussion@scipy.org >>>> http://projects.scipy.org/mailman/listinfo/numpy-discussion >>>> >>> _______________________________________________ >>> Numpy-discussion mailing list >>> Numpy-discussion@scipy.org >>> http://projects.scipy.org/mailman/listinfo/numpy-discussion >> >> -- >> Vincent R. Nijs >> Assistant Professor of Marketing >> Kellogg School of Management, Northwestern University >> 2001 Sheridan Road, Evanston, IL 60208-2001 >> Phone: +1-847-491-4574 Fax: +1-847-491-2498 >> E-mail: [EMAIL PROTECTED] >> Skype: vincentnijs >> >> >> >> _______________________________________________ >> Numpy-discussion mailing list >> Numpy-discussion@scipy.org >> http://projects.scipy.org/mailman/listinfo/numpy-discussion >> > _______________________________________________ > Numpy-discussion mailing list > Numpy-discussion@scipy.org > http://projects.scipy.org/mailman/listinfo/numpy-discussion -- Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 60208-2001 Phone: +1-847-491-4574 Fax: +1-847-491-2498 E-mail: [EMAIL PROTECTED] Skype: vincentnijs _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion