FWIW >>> n,dt=descr[0] >>> new_dt=dt.replace('f','i') >>> descr[0]=(n,new_dt) >>> data=ra.col1.astype(new_dt) >>> ra.dtype=N.dtype(descr) >>> ra.col1=data
//Torgil On 7/9/07, Vincent Nijs <[EMAIL PROTECTED]> wrote: > > Tim, > > I do want to auto-detect. Reading numbers in as floats is probably not a > huge penalty. > > Is there an easy way to change the type of one column in a recarray that > you know? > > I tried this: > > ra.col1 = ra.col1.astype('i') > > but that didn't seem to work. I assume that means you would have to create > a new array from the old one with an updated dtype list. > > Thanks, > > Vincent > > > On 7/8/07 4:51 PM, "Timothy Hochberg" <[EMAIL PROTECTED]> wrote: > > > > > On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote: > > Torgil, > > The function seems to work well and is slightly faster than your previous > version (about 1/6th faster). > > Yes, I do have columns that start with, what looks like, int's and then > turnTim, > > out to be floats. Something like below (col6). > > data = [['col1', 'col2', 'col3', 'col4', 'col5', 'col6'], > ['1','3','1/97','1.12','2.11','0'], > ['1','2','3/97',' 1.21','3.12','0'], > ['2','1','2/97','1.12','2.11','0'], > ['2','2','4/97','1.33','2.26',' 1.23'], > ['2','2','5/97','1.73','2.42','1.26']] > > I think what your function assumes is that the 1st element will be the > appropriate type. That may not hold if you have missing values or 'mixed > types'. > > > > Vincent, > > Do you need to auto detect the column types? Things get a lot simpler if > you have some known schema for each file; then you can simply pass that to > some reader function. It's also more robust since there's no way in general > to differentiate a column of integers from a column of floats with no > decimal part. > > If you do need to auto detect, one approach would be to always read both > int-like stuff and float-like stuff in as floats. Then after you get the > array check over the various columns and if any have no fractional parts, > make a new array where those columns are integers. > > -tim > > > Best, > > Vincent > > > On 7/8/07 3:31 PM, "Torgil Svensson" < [EMAIL PROTECTED]> wrote: > > > Hi > > > > I stumble on these types of problems from time to time so I'm > > interested in efficient solutions myself. > > > > Do you have a column which starts with something suitable for int on > > the first row (without decimal separator) but has decimals further > > down? > > > > This will be little tricky to support. One solution could be to yield > > StopIteration, calculate new type-conversion-functions and start over > > iterating over both the old data and the rest of the iterator. > > > > It'd be great if you could try the load_gen_iter.py I've attached to > > my response to Tim. > > > > Best Regards, > > > > //Torgil > > > > On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote: > >> I am not (yet) very familiar with much of the functionality introduced > in > >> your script Torgil (izip, imap, etc.), but I really appreciate you > taking > >> the time to look at this! > >> > >> The program stopped with the following error: > >> > >> File "load_iter.py", line 48, in <genexpr> > >> convert_row=lambda r: tuple(fn(x) for fn,x in > >> izip(conversion_functions,r)) > >> ValueError: invalid literal for int() with base 10: '2174.875' > >> > >> A lot of the data I use can have a column with a set of int's (e.g., > 0's), > >> but then the rest of that same column could be floats. I guess finding > the > >> right conversion function is the tricky part. I was thinking about > sampling > >> each, say, 10th obs to test which function to use. Not sure how that > would > >> work however. > >> > >> If I ignore the option of an int ( i.e., everything is a float, date, or > >> string) then your script is about twice as fast as mine!! > >> > >> Question: If you do ignore the int's initially, once the rec array is in > >> memory, would there be a quick way to check if the floats could pass as > >> int's? This may seem like a backwards approach but it might be 'safer' > if > >> you really want to preserve the int's. > >> > >> Thanks again! > >> > >> Vincent > >> > >> > >> On 7/8/07 5:52 AM, "Torgil Svensson" <[EMAIL PROTECTED]> wrote: > >> > >>> Given that both your script and the mlab version preloads the whole > >>> file before calling numpy constructor I'm curious how that compares in > >>> speed to using numpy's fromiter function on your data. Using fromiter > >>> should improve on memory usage (~50% ?). > >>> > >>> The drawback is for string columns where we don't longer know the > >>> width of the largest item. I made it fall-back to "object" in this > >>> case. > >>> > >>> Attached is a fromiter version of your script. Possible speedups could > >>> be done by trying different approaches to the "convert_row" function, > >>> for example using "zip" or "enumerate" instead of "izip". > >>> > >>> Best Regards, > >>> > >>> //Torgil > >>> > >>> > >>> On 7/8/07, Vincent Nijs <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]> > wrote: > >>>> Thanks for the reference John! csv2rec is about 30% faster than my > code on > >>>> the same data. > >>>> > >>>> If I read the code in csv2rec correctly it converts the data as it is > being > >>>> read using the csv modules. My setup reads in the whole dataset into > an > >>>> array of strings and then converts the columns as appropriate. > >>>> > >>>> Best, > >>>> > >>>> Vincent > >>>> > >>>> > >>>> On 7/6/07 8:53 PM, "John Hunter" <[EMAIL PROTECTED]> wrote: > >>>> > >>>>> On 7/6/07, Vincent Nijs <[EMAIL PROTECTED]> wrote: > >>>>>> I wrote the attached (small) program to read in a text/csv file with > >>>>>> different data types and convert it into a recarray without having > to > >>>>>> pre-specify the dtypes or variables names. I am just too lazy to > type-in > >>>>>> stuff like that :) The supported types are int, float, dates, and > >>>>>> strings. > >>>>>> > >>>>>> I works pretty well but it is not (yet) as fast as I would like so I > was > >>>>>> wonder if any of the numpy experts on this list might have some > >>>>>> suggestion > >>>>>> on how to speed it up. I need to read 500MB-1GB files so speed is > >>>>>> important > >>>>>> for me. > >>>>> > >>>>> In matplotlib.mlab svn, there is a function csv2rec that does the > >>>>> same. You may want to compare implementations in case we can > >>>>> fruitfully cross pollinate them. In the examples directy, there is > an > >>>>> example script examples/loadrec.py > >>>>> _______________________________________________ > >>>>> Numpy-discussion mailing list > >>>>> Numpy-discussion@scipy.org > >>>>> > http://projects.scipy.org/mailman/listinfo/numpy-discussion > >>>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> Numpy-discussion mailing list > >>>> Numpy-discussion@scipy.org > <mailto:Numpy-discussion@scipy.org> > >>>> > http://projects.scipy.org/mailman/listinfo/numpy-discussion > >>>> > >>> _______________________________________________ > >>> Numpy-discussion mailing list > >>> Numpy-discussion@scipy.org > >>> > http://projects.scipy.org/mailman/listinfo/numpy-discussion > >> > >> -- > >> Vincent R. Nijs > >> Assistant Professor of Marketing > >> Kellogg School of Management, Northwestern University > >> 2001 Sheridan Road, Evanston, IL 60208-2001 > >> Phone: +1-847-491-4574 Fax: +1-847-491-2498 > >> E-mail: [EMAIL PROTECTED] > >> Skype: vincentnijs > >> > >> > >> > >> _______________________________________________ > >> Numpy-discussion mailing list > >> Numpy-discussion@scipy.org > >> > http://projects.scipy.org/mailman/listinfo/numpy-discussion > >> > > _______________________________________________ > > Numpy-discussion mailing list > > Numpy-discussion@scipy.org > > > http://projects.scipy.org/mailman/listinfo/numpy-discussion > <http://projects.scipy.org/mailman/listinfo/numpy-discussion> > > > > -- > Vincent R. Nijs > Assistant Professor of Marketing > Kellogg School of Management, Northwestern University > 2001 Sheridan Road, Evanston, IL 60208-2001 > Phone: +1-847-491-4574 Fax: +1-847-491-2498 > E-mail: [EMAIL PROTECTED] > Skype: vincentnijs > > > > _______________________________________________ > Numpy-discussion mailing list > Numpy-discussion@scipy.org > http://projects.scipy.org/mailman/listinfo/numpy-discussion > > > > > -- > Vincent R. Nijs > Assistant Professor of Marketing > Kellogg School of Management, Northwestern University > 2001 Sheridan Road, Evanston, IL 60208-2001 > Phone: +1-847-491-4574 Fax: +1-847-491-2498 > E-mail: [EMAIL PROTECTED] > Skype: vincentnijs > > > _______________________________________________ > Numpy-discussion mailing list > Numpy-discussion@scipy.org > http://projects.scipy.org/mailman/listinfo/numpy-discussion > > _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion