Re: [Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

Torgil Svensson Sun, 08 Jul 2007 15:40:19 -0700

Question: If you do ignore the int's initially, once the rec array is in
memory, would there be a quick way to check if the floats could pass as
int's? This may seem like a backwards approach but it might be 'safer' if
you really want to preserve the int's.


In your case the floats don't pass as ints since you have decimals.
The attached file takes another approach (sorry for lack of comments).
If the conversion fail, the current row is stored and the iterator
exits (without setting a 'finished' parameter to true). The program
then re-calculates the conversion-functions and checks for changes. If
the changes are supported (=we have a conversion function for old data
in the format_changes dictionary) it calls fromiter again with an
iterator like this:

def get_data_iterator(row_iter,delim,res):
   for x0,x1,x2,x3,x4,x5 in res['data']:
       x0=float(x0)
       print (x0,x1,x2,x3,x4,x5)
       yield (x0,x1,x2,x3,x4,x5)
   yield 
(float('2.0'),int('2'),datestr2num('4/97'),float('1.33'),float('2.26'),float('1.23'))
   for row in row_iter:
       x0,x1,x2,x3,x4,x5=row.split(delim)
       try:
           yield
(float(x0),int(x1),datestr2num(x2),float(x3),float(x4),float(x5))
       except:
           res['row']=row
           return
   res['finished']=True

res['data'] is the previously converted data. This has the obvious
disadvantage that if only the last row has fractions in a column,
it'll cost double memory. Also if many columns change format at
different places it has to re-convert every time.

I don't recommend this because of the drawbacks and extra complexity.
I think it is best to convert your files (or file generation) so that
float columns are represented with 0.0 instead of 0.

Best Regards,

//Torgil

On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:

I am not (yet) very familiar with much of the functionality introduced in
your script Torgil (izip, imap, etc.), but I really appreciate you taking
the time to look at this!

The program stopped with the following error:

  File "load_iter.py", line 48, in <genexpr>
    convert_row=lambda r: tuple(fn(x) for fn,x in
izip(conversion_functions,r))
ValueError: invalid literal for int() with base 10: '2174.875'

A lot of the data I use can have a column with a set of int¹s (e.g., 0¹s),
but then the rest of that same column could be floats. I guess finding the
right conversion function is the tricky part. I was thinking about sampling
each, say, 10th obs to test which function to use. Not sure how that would
work however.

If I ignore the option of an int (i.e., everything is a float, date, or
string) then your script is about twice as fast as mine!!

Question: If you do ignore the int's initially, once the rec array is in
memory, would there be a quick way to check if the floats could pass as
int's? This may seem like a backwards approach but it might be 'safer' if
you really want to preserve the int's.

Thanks again!

Vincent


On 7/8/07 5:52 AM, "Torgil Svensson" <[EMAIL PROTECTED]> wrote:

> Given that both your script and the mlab version preloads the whole
> file before calling numpy constructor I'm curious how that compares in
> speed to using numpy's fromiter function on your data. Using fromiter
> should improve on memory usage (~50% ?).
>
> The drawback is for string columns where we don't longer know the
> width of the largest item. I made it fall-back to "object" in this
> case.
>
> Attached is a fromiter version of your script. Possible speedups could
> be done by trying different approaches to the "convert_row" function,
> for example using "zip" or "enumerate" instead of "izip".
>
> Best Regards,
>
> //Torgil
>
>
> On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
>> Thanks for the reference John! csv2rec is about 30% faster than my code on
>> the same data.
>>
>> If I read the code in csv2rec correctly it converts the data as it is being
>> read using the csv modules. My setup reads in the whole dataset into an
>> array of strings and then converts the columns as appropriate.
>>
>> Best,
>>
>> Vincent
>>
>>
>> On 7/6/07 8:53 PM, "John Hunter" <[EMAIL PROTECTED]> wrote:
>>
>>> On 7/6/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
>>>> I wrote the attached (small) program to read in a text/csv file with
>>>> different data types and convert it into a recarray without having to
>>>> pre-specify the dtypes or variables names. I am just too lazy to type-in
>>>> stuff like that :) The supported types are int, float, dates, and strings.
>>>>
>>>> I works pretty well but it is not (yet) as fast as I would like so I was
>>>> wonder if any of the numpy experts on this list might have some suggestion
>>>> on how to speed it up. I need to read 500MB-1GB files so speed is important
>>>> for me.
>>>
>>> In matplotlib.mlab svn, there is a function csv2rec that does the
>>> same.  You may want to compare implementations in case we can
>>> fruitfully cross pollinate them.  In the examples directy, there is an
>>> example script examples/loadrec.py
>>> _______________________________________________
>>> Numpy-discussion mailing list
>>> Numpy-discussion@scipy.org
>>> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>
>>
>> _______________________________________________
>> Numpy-discussion mailing list
>> Numpy-discussion@scipy.org
>> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion

--
Vincent R. Nijs
Assistant Professor of Marketing
Kellogg School of Management, Northwestern University
2001 Sheridan Road, Evanston, IL 60208-2001
Phone: +1-847-491-4574 Fax: +1-847-491-2498
E-mail: [EMAIL PROTECTED]
Skype: vincentnijs



_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion

import numpy as N
import itertools,csv
from matplotlib.dates import datestr2num
from itertools import imap,izip,chain

string_conversions=[
    # conversion function,  numpy dtype
    ( int,                  N.dtype(int)    ),
    ( float,                N.dtype(float)  ),
    ( datestr2num,          N.dtype(float)  ),
    ]
format_change={ # (from_fn,to_fn) : conversion_fn for old converted
    (int,float) : float,
    }
def string_to_dt_cvt(s):
    """
    Converting data to the appropriate type
    """
    for fn,dt in string_conversions:
        try:
            v=fn(s)
            return fn,dt
        except:
            pass
    return str,N.dtype(object)


def get_generator_fn(str_cvt,dt_row,cvt=None):
    var_nm=["x%d" % i for i,x in enumerate(str_cvt)]
    fn_nm=[fn.__name__ for fn,dt in str_cvt]

    indent=" "*4
    header=["def get_data_iterator(row_iter,delim,res):"]
    olditer=[]
    if cvt:
        olditer=[indent+"for "+",".join(var_nm)+" in res['data']:"]
        olditer.extend([indent*2+"x%d=%s(x%d)" % (i,fn,i) for i,fn in cvt])
        olditer.append(indent*2+"print ("+",".join(var_nm)+")")
        olditer.append(indent*2+"yield ("+",".join(var_nm)+")")
    rowiter=[
        indent+"yield ("+",".join(["%s('%s')" % (f,r) for f,r in zip(fn_nm,dt_row)])+")",
        indent+"for row in row_iter:",
        indent*2+",".join(var_nm)+"=row.split(delim)",
        indent*2+"try:",
        indent*3+"yield ("+",".join(["%s(%s)" % (f,v) for f,v in zip(fn_nm,var_nm)])+")",
        indent*2+"except:",
        indent*3+"res['row']=row",
        indent*3+"return",
        indent+"res['finished']=True",
        ]
    return "\n".join(header+olditer+rowiter)

def load(fname,delim = ',',has_varnm = True, prn_report = True):
    """
    Loading data from a file using fromiter. Returns a recarray.
    """

    row_iter=open(fname,'rb')
    row0=map(str.strip,row_iter.next().split(delim))
    if not has_varnm:
        varnm = ['col%s' % str(i+1) for i in xrange(len(row0))]
        dt_row=row0
    else:
        varnm = [i.strip() for i in row0]
        dt_row=map(str.strip,row_iter.next().split(delim))

    str_cvt=[string_to_dt_cvt(item) for item in dt_row]

    res={}
    res['finished']=False
    cvt=None
    while not res['finished']:
        descr=[(name,dt) for name,(fn,dt) in zip(varnm,str_cvt)]
        generator_fn=get_generator_fn(str_cvt,dt_row,cvt)
        exec(compile(generator_fn,'<string>','exec'))
        data=N.fromiter(get_data_iterator(row_iter,delim,res),dtype=descr).view(N.recarray)
        if not res['finished']:
            res['data']=data
            dt_row=map(str.strip,res['row'].split(delim))
            new_cvt=[string_to_dt_cvt(item) for item in dt_row]
            conv=[(i,(f1,f2)) for i,((f1,d1),(f2,d2)) in enumerate(zip(str_cvt,new_cvt)) if f1!=f2]
            if set([x for i,x in conv])-set(format_change):
                raise "Unsupported string representation change in columns"
            cvt=[(i,format_change[x].__name__) for i,x in conv]
            str_cvt=new_cvt
            for i,(f1,f2) in conv:
                print "Inconsistent string representation: converting %s from %s to %s" % (varnm[i],f1.__name__,f2.__name__)

    # load report
    if prn_report:
        print "##########################################\n"
        print "Loaded file: %s\n" % fname
        print "Nr obs: %s\n" % data.shape[0]
        print "Variables and datatypes:\n"
        for i in data.dtype.descr:
            print "Varname: %s, Type: %s, Sample: %s" % (i[0], i[1], str(data[i[0]][0:3]))
            print "\n##########################################\n"

    return data

def show_dates(dates):
	return N.array([i.strftime('%d %b %y') for i in pylab.num2date(dates)])

if __name__ == '__main__':

	# creating data
	data = [['col1', 'col2', 'col3', 'col4', 'col5', 'col6'],
			['1','3','1/97','1.12','2.11','1.2'],
			['1','2','3/97','1.21','3.12','1.43'],
			['2','1','2/97','1.12','2.11','1.28'],
			['2.0','2','4/97','1.33','2.26','1.23'],
			['2.1','2','5/97','1.73','2.42','1.26']]

	# saving data to csv file
	f = open('testdata.csv','wb')
	output = csv.writer(f)
	for i in data:
		output.writerow(i)
	f.close()

	# opening data file with variable names
	ra = load('testdata.csv')

_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

Reply via email to