Re: [Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

Torgil Svensson Sun, 08 Jul 2007 13:15:24 -0700

On 7/8/07, Timothy Hochberg <[EMAIL PROTECTED]> wrote:



On 7/8/07, Torgil Svensson <[EMAIL PROTECTED]> wrote:
> Given that both your script and the mlab version preloads the whole
> file before calling numpy constructor I'm curious how that compares in
> speed to using numpy's fromiter function on your data. Using fromiter
> should improve on memory usage (~50% ?).
>
> The drawback is for string columns where we don't longer know the
> width of the largest item. I made it fall-back to "object" in this
> case.
>
> Attached is a fromiter version of your script. Possible speedups could
> be done by trying different approaches to the "convert_row" function,
> for example using "zip" or "enumerate" instead of "izip".

I suspect that you'd do better here if you removed a bunch of layers from
the conversion functions. Right now it looks like:
imap->chain->convert_row->tuple->generator->izip. That's
five levels deep and Python functions are reasonably expensive. I would try
to be a lot less clever and do something like:

    def data_iterator(row_iter, delim):
        row0 = row_iter.next().split(delim)
        converters = find_formats(row0) # left as an exercise
        yield tuple(f(x) for f, x in zip(conversion_functions, row0))
        for row in row_iter:
            yield tuple(f(x) for f, x in zip(conversion_functions, row0))



That sounds sane. I've maybe been attracted to bad habits here and got
away with it since i'm very i/o-bound in these cases. My main
objective has been reducing memory footprint to reduce swapping.

That's just a sketch and I haven't timed it, but it cuts a few levels out of
the call chain, so has a reasonable chance of being faster. If you wanted to
be really clever, you could use some exec magic after you figure out the
conversion functions to compile a special function that generates the tuples
directly without any use of tuple or zip. I don't have time to work through
the details right now, but the code you would compile would end up looking
this:

for (x0, x1, x2) in row_iter:
   yield (int(x0), float(x1), float(x2))

Here we've assumed that find_formats determined that there are three fields,
an int and two floats. Once you have this info you can build an appropriate
function and exec it. This would cut another couple levels out of the call
chain. Again, I haven't timed it, or tried it, but it looks like it would be
fun to try.

-tim



Thank you for the lesson!  Great tip. This opens up for a variety of
new coding options. I've made an attempt on the fun part. Attached are
a version that generates the following generator code for Vincent's
__main__=='__name__' - code:

def get_data_iterator(row_iter,delim):
   yield 
(int('1'),int('3'),datestr2num('1/97'),float('1.12'),float('2.11'),float('1.2'))
   for row in row_iter:
       x0,x1,x2,x3,x4,x5=row.split(delim)
       yield (int(x0),int(x1),datestr2num(x2),float(x3),float(x4),float(x5))

Best Regards,

//Torgil

import numpy as N
import itertools,csv
from matplotlib.dates import datestr2num
from itertools import imap,izip,chain

string_conversions=[
    # conversion function,  numpy dtype
    ( int,                  N.dtype(int)    ),
    ( float,                N.dtype(float)  ),
    ( datestr2num,          N.dtype(float)  ),
    ]

def string_to_dt_cvt(s):
    """
    Converting data to the appropriate type
    """
    for fn,dt in string_conversions:
        try:
            v=fn(s)
            return dt,fn
        except:
            pass
    return str,N.dtype(object)


def load(fname,delim = ',',has_varnm = True, prn_report = True):
    """
    Loading data from a file using fromiter. Returns a recarray.
    """

    row_iter=open(fname,'rb')
    row0=map(str.strip,row_iter.next().split(delim))
    if not has_varnm:
        varnm = ['col%s' % str(i+1) for i in xrange(len(row0))]
        dt_row=row0
    else:
        varnm = [i.strip() for i in row0]
        dt_row=map(str.strip,row_iter.next().split(delim))

    str_cvt=[string_to_dt_cvt(item) for item in dt_row]
    descr=[(name,dt) for name,(dt,cvt_fn) in zip(varnm,str_cvt)]
    var_nm=["x%d" % i for i,(dt,cvt_fn) in enumerate(str_cvt)]
    fn_nm=[fn.__name__ for dt,fn in str_cvt]

    ident=" "*4
    generator_code="\n".join([
        "def get_data_iterator(row_iter,delim):",
        ident+"yield (" + ",".join(["%s('%s')" % (f,r) for f,r in zip(fn_nm,dt_row)])+")",
        ident+"for row in row_iter:",
        ident*2+",".join(var_nm)+"=row.split(delim)",
        ident*2+"yield (" + ",".join(["%s(%s)" % (f,v) for f,v in zip(fn_nm,var_nm)])+")",
        ])

    exec(compile(generator_code,'<string>','exec'))

    data=N.fromiter(get_data_iterator(row_iter,delim),dtype=descr).view(N.recarray)

    # load report
    if prn_report:
        print "##########################################\n"
        print "Loaded file: %s\n" % fname
        print "Nr obs: %s\n" % data.shape[0]
        print "Variables and datatypes:\n"
        for i in data.dtype.descr:
            print "Varname: %s, Type: %s, Sample: %s" % (i[0], i[1], str(data[i[0]][0:3]))
            print "\n##########################################\n"

    return data

def show_dates(dates):
	return N.array([i.strftime('%d %b %y') for i in pylab.num2date(dates)])

if __name__ == '__main__':

	# creating data
	data = [['col1', 'col2', 'col3', 'col4', 'col5', 'col6'],
			['1','3','1/97','1.12','2.11','1.2'],
			['1','2','3/97','1.21','3.12','1.43'],
			['2','1','2/97','1.12','2.11','1.28'],
			['2','2','4/97','1.33','2.26','1.23'],
			['2','2','5/97','1.73','2.42','1.26']]

	# saving data to csv file
	f = open('testdata.csv','wb')
	output = csv.writer(f)
	for i in data:
		output.writerow(i)
	f.close()

	# opening data file with variable names
	ra = load('testdata.csv')

_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

Reply via email to