Re: [Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

Torgil Svensson Thu, 19 Jul 2007 03:32:38 -0700

Hi,

1. Your code is fast due to that you convert whole at once columns in
numpy. The first step with the lists is also very fast (python
implements lists as arrays). I like your version, I think it's as fast
as it gets in pure python and has to keep only two versions of the
data at once in memory (since the string versions can be garbage
collected).


If memory really is an issue, you have the nice "load_spec" version
and can always convert the files once by iterating over the file twice
like the attached script does.


4. Okay, that makes sense. I was confused by the fact that your
generated function had the same name as the builtin iter() operator.


//Torgil


On 7/19/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:

 Hi Torgil,

 1. I got an email from Tim about this issue:

 "I finally got around to doing some more quantitative comparisons between
your code and the more complicated version that I proposed. The idea behind
my code was to minimize memory usage -- I figured that keeping the memory
usage low would make up for any inefficiencies in the conversion process
since it's been my experience that memory bandwidth dominates a lot of
numeric problems as problem sized get reasonably large. I was mostly wrong.
While it's true that for very large file sizes I can get my code to
outperform yours, in most instances it lags behind. And the range where it
does better is a fairly small range right before the machine dies with a
memory error. So my conclusion is that the extra hoops my code goes through
to avoid allocating extra memory isn't worth it for you to bother with."

 The approach in my code is simple and robust to most data issues I could
come-up with. It actually will do an appropriate conversion if there are
missing values or int's and float in the same column.  It will select an
appropriate string length as well. It may not be the most memory efficient
setup but given Tim's comments it is a pretty decent solution for the types
of data I have access to.

 2. Fixed the spelling error :)

 3. I guess that is the same thing. I am not very familiar with zip, izip,
map etc. just yet :) Thanks for the tip!

 4. I called the function generated using exec, iter(). I need that function
to transform the data using the types provided by the user.

 Best,

 Vincent

 On 7/18/07 7:57 PM, "Torgil Svensson" <[EMAIL PROTECTED]> wrote:

 > Nice,
 >
 > I haven't gone through all details. That's a nice new "missing"
 > feature, maybe all instances where we can't find a conversion should
 > be "nan". A few comments:
 >
 > 1. The "load_search" functions contains all memory/performance
 > overhead that we wanted to avoid with the fromiter function. Does this
 > mean that you no longer have large text-files that change sting
 > representation in the columns (aka "0" floats) ?
 >
 > 2. ident=" "*4
 > This has the same spelling error as in my first compile try .. it was
 > meant to be "indent"
 >
 > 3. types = list((i,j) for i, j in zip(varnm, types2))
 > Isn't this the same as "types = zip(varnm, types2)" ?
 >
 > 4.  return N.fromiter(iter(reader),dtype = types)
 > Isn't "reader" an iterator already? What does the "iter()" operator do
 > in this case?
 >
 > Best regards,
 >
 > //Torgil
 >
 >
 > On 7/18/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
 >>
 >>  I combined some of the very useful comments/code from Tim and Torgil
and
 >> came-up with the attached program to read csv files and convert the data
 >> into a recarray. I couldn't use all of their suggestions because,
frankly, I
 >> didn't understand all of them :)
 >>
 >>  The program use variable names if provided in the csv-file and can
 >> auto-detect data types. However, I also wanted to make it easy to
specify
 >> data types and/or variables names if so desired. Examples are at the
bottom
 >> of the file. Comments are very welcome.
 >>
 >>  Thanks,
 >>
 >>  Vincent
 >> _______________________________________________
 >> Numpy-discussion mailing list
 >> Numpy-discussion@scipy.org
 >>
http://projects.scipy.org/mailman/listinfo/numpy-discussion
 >>
 >>
 >>
 > _______________________________________________
 > Numpy-discussion mailing list
 > Numpy-discussion@scipy.org
 >
http://projects.scipy.org/mailman/listinfo/numpy-discussion
 >

 --
 Vincent R. Nijs
 Assistant Professor of Marketing
 Kellogg School of Management, Northwestern University
 2001 Sheridan Road, Evanston, IL 60208-2001
 Phone: +1-847-491-4574 Fax: +1-847-491-2498
 E-mail: [EMAIL PROTECTED]
 Skype: vincentnijs

_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion

import pylab
from itertools import imap

string_conversions=[int,float,pylab.datestr2num]
def string_to_cvt(s):
    for fn in string_conversions:
        try:
            v=fn(s)
            return fn
        except:
            pass
    return str

int_to_float_upcast=lambda x: "." in x and x or x+".0"
upcast_functions={
    (int,float): int_to_float_upcast,
    (float,int): int_to_float_upcast,
    }
missing_items=set(['','.','\n','.\n'])
replace_missing=lambda x: x in missing_items and 'nan' or x

def find_upcast((upcasts,cvt),row):
    global old_fn,new_fn
    new_cvt=[string_to_cvt(x) for x in row]
    new_upcasts=[]
    for i,(old_fn,new_fn) in enumerate(zip(cvt,new_cvt)):
        if old_fn==new_fn:
            upcast=set()
        elif (old_fn,new_fn) in upcast_functions:
            upcast=set([upcast_functions[(old_fn,new_fn)]])
        else:
            raise "Unable to upcast %s to %s for column %d" % (old_fn.__name__,new_fn.__name__,i)
        new_upcasts.append(upcast)
    return map(set.union,upcasts,new_upcasts),new_cvt

def assimilate_csv_file(fpath, delim=',', has_varnm=True ):

    global upcasts
    def get_row_iter(f):
        row_iter=imap(lambda x: map(replace_missing,x.split(delim)),fr)
        if has_varnm: row_iter.next()
        return row_iter

    fr=open(fpath,'r')
    row_iter=get_row_iter(fr)
    row0=row_iter.next()

    initial_upcasts=[set() for x in row0]
    initial_cvt=[string_to_cvt(x) for x in row0]
    upcasts,functions=reduce(find_upcast,row_iter,(initial_upcasts,initial_cvt))

    fr.close()
    if not any(upcasts):
        print "Nothing done to file."
        return
    
    fr=open(fpath,'r')
    fw=open(fpath+"_fixed",'w')
    for row in get_row_iter(fr):
        fw.write(delim.join([reduce(lambda x,y: y(x),u,c) for u,c in zip(upcasts,row)])+"\n")
    fw.close()
    fr.close()

if __name__ == '__main__':

    import csv, sys

    # creating data
    data = [['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7'],
            ['1','3','1/97','1.12','2.11','001','bla1'],
            ['1.3','2','3/97','1.21','3.12','002','bla2'],
            ['2','1','2/97','1.12','2.11','003','bla3'],
            ['2','2','4/97','1.33','2.26','004','bla4'],
            ['2','2','5/97','1.73','2.42','005','bla15']]
    # saving data to csv file
    f = open('testdata_with_varnm.csv','wb')
    output = csv.writer(f)
    for i in data:
        output.writerow(i)
    f.close()

    assimilate_csv_file('testdata_with_varnm.csv')

_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

Reply via email to