Re: [Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

Timothy Hochberg Sun, 08 Jul 2007 14:51:38 -0700

On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:


Torgil,

The function seems to work well and is slightly faster than your previous
version (about 1/6th faster).

Yes, I do have columns that start with, what looks like, int's and then
turn
out to be floats. Something like below (col6).

    data = [['col1', 'col2', 'col3', 'col4', 'col5', 'col6'],
            ['1','3','1/97','1.12','2.11','0'],
            ['1','2','3/97','1.21','3.12','0'],
            ['2','1','2/97','1.12','2.11','0'],
            ['2','2','4/97','1.33','2.26','1.23'],
            ['2','2','5/97','1.73','2.42','1.26']]

I think what your function assumes is that the 1st element will be the
appropriate type. That may not hold if you have missing values or 'mixed
types'.




Vincent,

Do you need to auto detect the column types? Things get a lot simpler if you
have some known schema for each file; then you can simply pass that to some
reader function. It's also more robust since there's no way in general to
differentiate a column of integers from a column of floats with no decimal
part.

If you do need to auto detect, one approach would be to always read both
int-like stuff and float-like stuff in as floats. Then after you get the
array check over the various columns and if any have no fractional parts,
make a new array where those columns are integers.

-tim

Best,


Vincent


On 7/8/07 3:31 PM, "Torgil Svensson" <[EMAIL PROTECTED]> wrote:

> Hi
>
> I stumble on these types of problems from time to time so I'm
> interested in efficient solutions myself.
>
> Do you have a column which starts with something suitable for int on
> the first row (without decimal separator) but has decimals further
> down?
>
> This will be little tricky to support. One solution could be to yield
> StopIteration, calculate new type-conversion-functions and start over
> iterating over both the old data and the rest of the iterator.
>
> It'd be great if you could try the load_gen_iter.py I've attached to
> my response to Tim.
>
> Best Regards,
>
> //Torgil
>
> On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
>> I am not (yet) very familiar with much of the functionality introduced
in
>> your script Torgil (izip, imap, etc.), but I really appreciate you
taking
>> the time to look at this!
>>
>> The program stopped with the following error:
>>
>>   File "load_iter.py", line 48, in <genexpr>
>>     convert_row=lambda r: tuple(fn(x) for fn,x in
>> izip(conversion_functions,r))
>> ValueError: invalid literal for int() with base 10: '2174.875'
>>
>> A lot of the data I use can have a column with a set of int¹s (e.g.,
0¹s),
>> but then the rest of that same column could be floats. I guess finding
the
>> right conversion function is the tricky part. I was thinking about
sampling
>> each, say, 10th obs to test which function to use. Not sure how that
would
>> work however.
>>
>> If I ignore the option of an int (i.e., everything is a float, date, or
>> string) then your script is about twice as fast as mine!!
>>
>> Question: If you do ignore the int's initially, once the rec array is
in
>> memory, would there be a quick way to check if the floats could pass as
>> int's? This may seem like a backwards approach but it might be 'safer'
if
>> you really want to preserve the int's.
>>
>> Thanks again!
>>
>> Vincent
>>
>>
>> On 7/8/07 5:52 AM, "Torgil Svensson" <[EMAIL PROTECTED]> wrote:
>>
>>> Given that both your script and the mlab version preloads the whole
>>> file before calling numpy constructor I'm curious how that compares in
>>> speed to using numpy's fromiter function on your data. Using fromiter
>>> should improve on memory usage (~50% ?).
>>>
>>> The drawback is for string columns where we don't longer know the
>>> width of the largest item. I made it fall-back to "object" in this
>>> case.
>>>
>>> Attached is a fromiter version of your script. Possible speedups could
>>> be done by trying different approaches to the "convert_row" function,
>>> for example using "zip" or "enumerate" instead of "izip".
>>>
>>> Best Regards,
>>>
>>> //Torgil
>>>
>>>
>>> On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
>>>> Thanks for the reference John! csv2rec is about 30% faster than my
code on
>>>> the same data.
>>>>
>>>> If I read the code in csv2rec correctly it converts the data as it is
being
>>>> read using the csv modules. My setup reads in the whole dataset into
an
>>>> array of strings and then converts the columns as appropriate.
>>>>
>>>> Best,
>>>>
>>>> Vincent
>>>>
>>>>
>>>> On 7/6/07 8:53 PM, "John Hunter" <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> On 7/6/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
>>>>>> I wrote the attached (small) program to read in a text/csv file
with
>>>>>> different data types and convert it into a recarray without having
to
>>>>>> pre-specify the dtypes or variables names. I am just too lazy to
type-in
>>>>>> stuff like that :) The supported types are int, float, dates, and
>>>>>> strings.
>>>>>>
>>>>>> I works pretty well but it is not (yet) as fast as I would like so
I was
>>>>>> wonder if any of the numpy experts on this list might have some
>>>>>> suggestion
>>>>>> on how to speed it up. I need to read 500MB-1GB files so speed is
>>>>>> important
>>>>>> for me.
>>>>>
>>>>> In matplotlib.mlab svn, there is a function csv2rec that does the
>>>>> same.  You may want to compare implementations in case we can
>>>>> fruitfully cross pollinate them.  In the examples directy, there is
an
>>>>> example script examples/loadrec.py
>>>>> _______________________________________________
>>>>> Numpy-discussion mailing list
>>>>> Numpy-discussion@scipy.org
>>>>> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Numpy-discussion mailing list
>>>> Numpy-discussion@scipy.org
>>>> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>>>>
>>> _______________________________________________
>>> Numpy-discussion mailing list
>>> Numpy-discussion@scipy.org
>>> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>>
>> --
>> Vincent R. Nijs
>> Assistant Professor of Marketing
>> Kellogg School of Management, Northwestern University
>> 2001 Sheridan Road, Evanston, IL 60208-2001
>> Phone: +1-847-491-4574 Fax: +1-847-491-2498
>> E-mail: [EMAIL PROTECTED]
>> Skype: vincentnijs
>>
>>
>>
>> _______________________________________________
>> Numpy-discussion mailing list
>> Numpy-discussion@scipy.org
>> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>

--
Vincent R. Nijs
Assistant Professor of Marketing
Kellogg School of Management, Northwestern University
2001 Sheridan Road, Evanston, IL 60208-2001
Phone: +1-847-491-4574 Fax: +1-847-491-2498
E-mail: [EMAIL PROTECTED]
Skype: vincentnijs



_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion




--
.  __
.   |-\
.
.  [EMAIL PROTECTED]

_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

Reply via email to