Given that both your script and the mlab version preloads the whole
file before calling numpy constructor I'm curious how that compares in
speed to using numpy's fromiter function on your data. Using fromiter
should improve on memory usage (~50% ?).

The drawback is for string columns where we don't longer know the
width of the largest item. I made it fall-back to "object" in this
case.

Attached is a fromiter version of your script. Possible speedups could
be done by trying different approaches to the "convert_row" function,
for example using "zip" or "enumerate" instead of "izip".

Best Regards,

//Torgil


On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
Thanks for the reference John! csv2rec is about 30% faster than my code on
the same data.

If I read the code in csv2rec correctly it converts the data as it is being
read using the csv modules. My setup reads in the whole dataset into an
array of strings and then converts the columns as appropriate.

Best,

Vincent


On 7/6/07 8:53 PM, "John Hunter" <[EMAIL PROTECTED]> wrote:

> On 7/6/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
>> I wrote the attached (small) program to read in a text/csv file with
>> different data types and convert it into a recarray without having to
>> pre-specify the dtypes or variables names. I am just too lazy to type-in
>> stuff like that :) The supported types are int, float, dates, and strings.
>>
>> I works pretty well but it is not (yet) as fast as I would like so I was
>> wonder if any of the numpy experts on this list might have some suggestion
>> on how to speed it up. I need to read 500MB-1GB files so speed is important
>> for me.
>
> In matplotlib.mlab svn, there is a function csv2rec that does the
> same.  You may want to compare implementations in case we can
> fruitfully cross pollinate them.  In the examples directy, there is an
> example script examples/loadrec.py
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>


_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion

import numpy as N
import pylab,itertools,csv
from itertools import imap,izip,chain



string_conversions=[
    # conversion function,  numpy dtype
    ( int,                  N.dtype(int)    ),
    ( float,                N.dtype(float)  ),
    ( pylab.datestr2num,    N.dtype(float)  ),
    ]

def string_to_dt_cvt(s):
    """
    Converting data to the appropriate type
    """
    for fn,dt in string_conversions:
        try:
            v=fn(s)
            return dt,fn
        except:
            pass
    return str,N.dtype(object)


def load(fname,delim = ',',has_varnm = True, prn_report = True):
    """
    Loading data from a file using the csv module. Returns a recarray.
    """
    global data_iterator,cvt,descr
    
    f=open(fname,'rb')
    row_iterator=itertools.imap(lambda x: x.split(delim),f)

    first_row=row_iterator.next()
    cols=len(first_row)
    if not has_varnm:
        varnm = ['col%s' % str(i+1) for i in xrange(cols)]
        dt_row=first_row
    else:
        varnm = [i.strip() for i in first_row]
        dt_row=row_iterator.next()

    descr=[]
    conversion_functions=[]
    for name,item in zip(varnm,dt_row):
        dtype,cvt_fn=string_to_dt_cvt(item)
        descr.append((name,dtype))
        conversion_functions.append(cvt_fn)
    convert_row=lambda r: tuple(fn(x) for fn,x in izip(conversion_functions,r))

    data_iterator=imap(convert_row,chain([dt_row],row_iterator))
    data=N.fromiter(data_iterator,dtype=descr).view(N.recarray)

    # load report
    if prn_report:
        print "##########################################\n"
        print "Loaded file: %s\n" % fname
        print "Nr obs: %s\n" % data.shape[0]
        print "Variables and datatypes:\n"
        for i in data.dtype.descr:
            print "Varname: %s, Type: %s, Sample: %s" % (i[0], i[1], str(data[i[0]][0:3]))
            print "\n##########################################\n"

    return data

def show_dates(dates):
	return N.array([i.strftime('%d %b %y') for i in pylab.num2date(dates)])

if __name__ == '__main__':

	# creating data
	data = [['col1', 'col2', 'col3', 'col4', 'col5', 'col6'],
			['1','3','1/97','1.12','2.11','1.2'],
			['1','2','3/97','1.21','3.12','1.43'],
			['2','1','2/97','1.12','2.11','1.28'],
			['2','2','4/97','1.33','2.26','1.23'],
			['2','2','5/97','1.73','2.42','1.26']]

	# saving data to csv file
	f = open('testdata.csv','wb')
	output = csv.writer(f)
	for i in data:
		output.writerow(i)
	f.close()

	# opening data file with variable names
	ra = load('testdata.csv')	
_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to