Tim, I do want to auto-detect. Reading numbers in as floats is probably not a huge penalty.
Is there an easy way to change the type of one column in a recarray that you know? I tried this: ra.col1 = ra.col1.astype(i¹) but that didn¹t seem to work. I assume that means you would have to create a new array from the old one with an updated dtype list. Thanks, Vincent On 7/8/07 4:51 PM, "Timothy Hochberg" <[EMAIL PROTECTED]> wrote: > > > On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote: >> Torgil, >> >> The function seems to work well and is slightly faster than your previous >> version (about 1/6th faster). >> >> Yes, I do have columns that start with, what looks like, int's and then >> turnTim, >> out to be floats. Something like below (col6). >> >> data = [['col1', 'col2', 'col3', 'col4', 'col5', 'col6'], >> ['1','3','1/97','1.12','2.11','0'], >> ['1','2','3/97',' 1.21','3.12','0'], >> ['2','1','2/97','1.12','2.11','0'], >> ['2','2','4/97','1.33','2.26',' 1.23'], >> ['2','2','5/97','1.73','2.42','1.26']] >> >> I think what your function assumes is that the 1st element will be the >> appropriate type. That may not hold if you have missing values or 'mixed >> types'. > > > Vincent, > > Do you need to auto detect the column types? Things get a lot simpler if you > have some known schema for each file; then you can simply pass that to some > reader function. It's also more robust since there's no way in general to > differentiate a column of integers from a column of floats with no decimal > part. > > If you do need to auto detect, one approach would be to always read both > int-like stuff and float-like stuff in as floats. Then after you get the array > check over the various columns and if any have no fractional parts, make a new > array where those columns are integers. > > -tim > >> Best, >> >> Vincent >> >> >> On 7/8/07 3:31 PM, "Torgil Svensson" < [EMAIL PROTECTED]> wrote: >> >>> > Hi >>> > >>> > I stumble on these types of problems from time to time so I'm >>> > interested in efficient solutions myself. >>> > >>> > Do you have a column which starts with something suitable for int on >>> > the first row (without decimal separator) but has decimals further >>> > down? >>> > >>> > This will be little tricky to support. One solution could be to yield >>> > StopIteration, calculate new type-conversion-functions and start over >>> > iterating over both the old data and the rest of the iterator. >>> > >>> > It'd be great if you could try the load_gen_iter.py I've attached to >>> > my response to Tim. >>> > >>> > Best Regards, >>> > >>> > //Torgil >>> > >>> > On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote: >>>> >> I am not (yet) very familiar with much of the functionality introduced in >>>> >> your script Torgil (izip, imap, etc.), but I really appreciate you >>>> taking >>>> >> the time to look at this! >>>> >> >>>> >> The program stopped with the following error: >>>> >> >>>> >> File "load_iter.py", line 48, in <genexpr> >>>> >> convert_row=lambda r: tuple(fn(x) for fn,x in >>>> >> izip(conversion_functions,r)) >>>> >> ValueError: invalid literal for int() with base 10: '2174.875' >>>> >> >>>> >> A lot of the data I use can have a column with a set of int¹s (e.g., >>>> 0¹s), >>>> >> but then the rest of that same column could be floats. I guess finding >>>> the >>>> >> right conversion function is the tricky part. I was thinking about >>>> sampling >>>> >> each, say, 10th obs to test which function to use. Not sure how that >>>> would >>>> >> work however. >>>> >> >>>> >> If I ignore the option of an int ( i.e., everything is a float, date, or >>>> >> string) then your script is about twice as fast as mine!! >>>> >> >>>> >> Question: If you do ignore the int's initially, once the rec array is in >>>> >> memory, would there be a quick way to check if the floats could pass as >>>> >> int's? This may seem like a backwards approach but it might be 'safer' if >>>> >> you really want to preserve the int's. >>>> >> >>>> >> Thanks again! >>>> >> >>>> >> Vincent >>>> >> >>>> >> >>>> >> On 7/8/07 5:52 AM, "Torgil Svensson" <[EMAIL PROTECTED]> wrote: >>>> >> >>>>> >>> Given that both your script and the mlab version preloads the whole >>>>> >>> file before calling numpy constructor I'm curious how that compares in >>>>> >>> speed to using numpy's fromiter function on your data. Using fromiter >>>>> >>> should improve on memory usage (~50% ?). >>>>> >>> >>>>> >>> The drawback is for string columns where we don't longer know the >>>>> >>> width of the largest item. I made it fall-back to "object" in this >>>>> >>> case. >>>>> >>> >>>>> >>> Attached is a fromiter version of your script. Possible speedups could >>>>> >>> be done by trying different approaches to the "convert_row" function, >>>>> >>> for example using "zip" or "enumerate" instead of "izip". >>>>> >>> >>>>> >>> Best Regards, >>>>> >>> >>>>> >>> //Torgil >>>>> >>> >>>>> >>> >>>>> >>> On 7/8/07, Vincent Nijs <[EMAIL PROTECTED] >>>>> <mailto:[EMAIL PROTECTED]> > wrote: >>>>>> >>>> Thanks for the reference John! csv2rec is about 30% faster than my >>>>>> code on >>>>>> >>>> the same data. >>>>>> >>>> >>>>>> >>>> If I read the code in csv2rec correctly it converts the data as it >>>>>> is being >>>>>> >>>> read using the csv modules. My setup reads in the whole dataset into an >>>>>> >>>> array of strings and then converts the columns as appropriate. >>>>>> >>>> >>>>>> >>>> Best, >>>>>> >>>> >>>>>> >>>> Vincent >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> On 7/6/07 8:53 PM, "John Hunter" <[EMAIL PROTECTED]> wrote: >>>>>> >>>> >>>>>>> >>>>> On 7/6/07, Vincent Nijs <[EMAIL PROTECTED]> wrote: >>>>>>>> >>>>>> I wrote the attached (small) program to read in a text/csv file with >>>>>>>> >>>>>> different data types and convert it into a recarray without >>>>>>>> having to >>>>>>>> >>>>>> pre-specify the dtypes or variables names. I am just too lazy to >>>>>>>> type-in >>>>>>>> >>>>>> stuff like that :) The supported types are int, float, dates, and >>>>>>>> >>>>>> strings. >>>>>>>> >>>>>> >>>>>>>> >>>>>> I works pretty well but it is not (yet) as fast as I would like >>>>>>>> so I was >>>>>>>> >>>>>> wonder if any of the numpy experts on this list might have some >>>>>>>> >>>>>> suggestion >>>>>>>> >>>>>> on how to speed it up. I need to read 500MB-1GB files so speed is >>>>>>>> >>>>>> important >>>>>>>> >>>>>> for me. >>>>>>> >>>>> >>>>>>> >>>>> In matplotlib.mlab svn, there is a function csv2rec that does the >>>>>>> >>>>> same. You may want to compare implementations in case we can >>>>>>> >>>>> fruitfully cross pollinate them. In the examples directy, there >>>>>>> is an >>>>>>> >>>>> example script examples/loadrec.py >>>>>>> >>>>> _______________________________________________ >>>>>>> >>>>> Numpy-discussion mailing list >>>>>>> >>>>> Numpy-discussion@scipy.org >>>>>>> >>>>> http://projects.scipy.org/mailman/listinfo/numpy-discussion >>>>>>> >>>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> _______________________________________________ >>>>>> >>>> Numpy-discussion mailing list >>>>>> >>>> Numpy-discussion@scipy.org <mailto:Numpy-discussion@scipy.org> >>>>>> >>>> http://projects.scipy.org/mailman/listinfo/numpy-discussion >>>>>> >>>> >>>>> >>> _______________________________________________ >>>>> >>> Numpy-discussion mailing list >>>>> >>> Numpy-discussion@scipy.org >>>>> >>> http://projects.scipy.org/mailman/listinfo/numpy-discussion >>>> >> >>>> >> -- >>>> >> Vincent R. Nijs >>>> >> Assistant Professor of Marketing >>>> >> Kellogg School of Management, Northwestern University >>>> >> 2001 Sheridan Road, Evanston, IL 60208-2001 >>>> >> Phone: +1-847-491-4574 Fax: +1-847-491-2498 >>>> >> E-mail: [EMAIL PROTECTED] >>>> >> Skype: vincentnijs >>>> >> >>>> >> >>>> >> >>>> >> _______________________________________________ >>>> >> Numpy-discussion mailing list >>>> >> Numpy-discussion@scipy.org >>>> >> http://projects.scipy.org/mailman/listinfo/numpy-discussion >>>> >> >>> > _______________________________________________ >>> > Numpy-discussion mailing list >>> > Numpy-discussion@scipy.org >>> > http://projects.scipy.org/mailman/listinfo/numpy-discussion >>> <http://projects.scipy.org/mailman/listinfo/numpy-discussion> >>> > >> >> -- >> Vincent R. Nijs >> Assistant Professor of Marketing >> Kellogg School of Management, Northwestern University >> 2001 Sheridan Road, Evanston, IL 60208-2001 >> Phone: +1-847-491-4574 Fax: +1-847-491-2498 >> E-mail: [EMAIL PROTECTED] >> Skype: vincentnijs >> >> >> >> _______________________________________________ >> Numpy-discussion mailing list >> Numpy-discussion@scipy.org >> http://projects.scipy.org/mailman/listinfo/numpy-discussion > > -- Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 60208-2001 Phone: +1-847-491-4574 Fax: +1-847-491-2498 E-mail: [EMAIL PROTECTED] Skype: vincentnijs
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion