Re: [Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

Timothy Hochberg Mon, 09 Jul 2007 20:18:11 -0700

On 7/9/07, Timothy Hochberg <[EMAIL PROTECTED]> wrote:

On 7/9/07, Torgil Svensson <[EMAIL PROTECTED]> wrote:
>
> Elegant solution. Very readable and takes care of row0 nicely.
>
> I want to point out that this is much more efficient than my version
> for random/late string representation changes throughout the
> conversion but it suffers from 2*n memory footprint and large block
> copying if the string rep changes arrives very early on huge datasets.

Yep.

I think we can't have best of both and Tims solution is better in the
> general case.

It probably would not be hard to do a hybrid version. One issue is that
one doesn't, in general, know the size of the dataset in advance, so you'd
have to use an absolute criteria (less than 100 lines) instead of a relative
criteria (less than 20% done). I suppose you could stat the file or
something, but that seems like overkill.

Maybe "use one_alt if rownumber < xxx else use other_alt" can
> fine-tune performance for some cases. but even ten, with many cols,
> it's nearly impossible to know.

That sounds sensible. I have an interesting thought on how to this that's
a bit hard to describe. I'll try to throw it together and post another
version today or tomorrow.


OK, as promised, here's an approach that rebuilds the array if the format
changes as long as the less than 'restart_length' lines have been processed.
Otherwise, it uses the old strategy. Perhaps not the most efficient way, but
it reuses what I'd already written with minimal changes. It's still pretty
rough -- once again I didn't bother to polish it.


def find_formats(items, last):
   formats = []
   for i, x in enumerate(items):
       dt, cvt = string_to_dt_cvt(x)
       if last is not None:
           last_cvt, last_dt = last[i]
           if last_cvt is float and cvt is int:
               cvt = float
       formats.append((dt, cvt))
   return formats

class LoadInfo(object):
   def __init__(self, row0):
       self.done = False
       self.lastcols = None
       self.row0 = row0
       self.predata = ()

def data_iterator(lines, converters, delim, info):
   for x in info.predata:
       yield x
   info.predata = ()
   yield tuple(f(x) for f, x in zip(converters, info.row0.split(delim)))
   try:
       for row in lines:
           yield tuple(f(x) for f, x in zip(converters, row.split(delim)))
   except:
       info.row0 = row
   else:
       info.done = True

def load2(fname,delim = ',', has_varnm = True, prn_report = True,
restart_length=20):
   """
   Loading data from a file using the csv module. Returns a recarray.
   """
   f=open(fname,'rb')

   if has_varnm:
       varnames = [i.strip() for i in f.next().split(delim)]
   else:
       varnames = None


   info = LoadInfo(f.next())
   chunks = []

   while not info.done:

       row0 = info.row0.split(delim)
       formats = find_formats(row0, info.lastcols)
       if varnames is None:
           varnames = varnm = ['col%s' % str(i+1) for i, _ in
enumerate(formate)]
       descr=[]
       conversion_functions=[]
       for name, (dtype, cvt_fn) in zip(varnames, formats):
           descr.append((name,dtype))
           conversion_functions.append(cvt_fn)

       if len(chunks) == 1 and len(chunks[0]) < restart_length:
           info.predata = chunks[0].astype(descr)
           chunks = []

       chunks.append(N.fromiter(data_iterator(f, conversion_functions,
delim, info), descr))

   if len(chunks) > 1:
       n = sum(len(x) for x in chunks)
       data = N.zeros([n], chunks[-1].dtype)
       offset = 0
       for x in chunks:
           delta = len(x)
           data[offset:offset+delta] = x
           offset += delta
   else:
       [data] = chunks

   # load report
   if prn_report:
       print "##########################################\n"
       print "Loaded file: %s\n" % fname
       print "Nr obs: %s\n" % data.shape[0]
       print "Variables and datatypes:\n"
       for i in data.dtype.descr:
           print "Varname: %s, Type: %s, Sample: %s" % (i[0], i[1],
str(data[i[0]][0:3]))
           print "\n##########################################\n"

   return data

_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

Reply via email to