Hi All - I've added the relevant code to my numpy fork here
https://github.com/esheldon/numpy The python module and c file are at /numpy/lib/recfile.py and /numpy/lib/src/_recfile.c Access from python is numpy.recfile See below for the doc string for the main class, Recfile. Some example usage is shown. As listed in the limitations section below, quoted strings are not yet supported for text files. This can be addressed by optionally using some smarter code when reading strings from these types of files. I'd greatly appreciate some help with that aspect. There is a test suite in numpy.recfile.test() A class for reading and writing structured arrays to and from files. Both binary and text files are supported. Any subset of the data can be read without loading the whole file. See the limitations section below for caveats. parameters ---------- fobj: file or string A string or file object. mode: string Mode for opening when fobj is a string dtype: A numpy dtype or descriptor describing each line of the file. The dtype must contain fields. This is a required parameter; it is a keyword only for clarity. Note for text files the dtype will be converted to native byte ordering. Any data written to the file must also be in the native byte ordering. nrows: int, optional Number of rows in the file. If not entered, the rows will be counted from the file itself. This is a simple calculation for binary files, but can be slow for text files. delim: string, optional The delimiter for text files. If None or "" the file is assumed to be binary. Should be a single character. skipheader: int, optional Skip this many lines in the header. offset: int, optional Move to this offset in the file. Reads will all be relative to this location. If not sent, it is taken from the current positioin in the input file object or 0 if a filename was entered. string_newlines: bool, optional If true, strings in text files may contain newlines. This is only relevant for text files when the nrows= keyword is not sent, because the number of lines must be counted. In this case the full text reading code is used to count rows instead of a simple newline count. Because the text is fully processed twice, this can double the time to read files. padnull: bool If True, nulls in strings are replaced with spaces when writing text ignorenull: bool If True, nulls in strings are not written when writing text. This results in string fields that are not fixed width, so cannot be read back in using recfile limitations ----------- Currently, only fixed width string fields are supported. String fields can contain any characters, including newlines, but for text files quoted strings are not currently supported: the quotes will be part of the result. For binary files, structured sub-arrays and complex can be writen and read, but this is not supported yet for text files. examples --------- # read from binary file dtype=[('id','i4'),('x','f8'),('y','f8'),('arr','f4',(2,2))] rec=numpy.recfile.Recfile(fname,dtype=dtype) # read all data using either slice or method notation data=rec[:] data=rec.read() # read row slices data=rec[8:55:3] # read subset of columns and possibly rows # can use either slice or method notation data=rec['x'][:] data=rec['id','x'][:] data=rec[col_list][row_list] data=rec.read(columns=col_list, rows=row_list) # for text files, just send the delimiter string # all the above calls will also work rec=numpy.recfile.Recfile(fname,dtype=dtype,delim=',') # save time for text files by sending row count rec=numpy.recfile.Recfile(fname,dtype=dtype,delim=',',nrows=10000) # write some data rec=numpy.recfile.Recfile(fname,mode='w',dtype=dtype,delim=',') rec.write(data) # append some data rec.write(more_data) # print metadata about the file print rec Recfile nrows: 345472 ncols: 6 mode: 'w' id <i4 x <f8 y <f8 arr <f4 array[2,2] Excerpts from Erin Sheldon's message of Mon Feb 27 09:44:52 -0500 2012: > Excerpts from Jay Bourque's message of Mon Feb 27 00:24:25 -0500 2012: > > Hi Erin, > > > > I'm the one Travis mentioned earlier about working on this. I was planning > > on > > diving into it this week, but it sounds like you may have some code already > > that > > fits the requirements? If so, I would be available to help you with > > porting/testing your code with numpy, or I can take what you have and build > > on > > it in my numpy fork on github. > > Hi Jay,all - > > What I've got is a solution for writing and reading structured arrays to > and from files, both in text files and binary files. It is written in C > and python. It allows reading arbitrary subsets of the data efficiently > without reading in the whole file. It defines a class Recfile that > exposes an array like interface for reading, e.g. x=rf[columns][rows]. > > Limitations: Because it was designed with arrays in mind, it doesn't > deal with not fixed-width string fields. Also, it doesn't deal with > quoted strings, as those are not necessary for writing or reading arrays > with fixed length strings. Doesn't deal with missing data. This is > where Wes' tokenizing-oriented code might be useful. So there is a fair > amount of functionality to be added for edge cases, but it provides a > framework. I think some of this can be written into the C code, others > will have to be done at the python level. > > I've forked numpy on my github account, and should have the code added > in a few days. I'll send mail when it is ready. Help will be greatly > appreciated getting this to work with loadtxt, adding functionality from > Wes' and others code, and testing. > > Also, because it works on binary files too, I think it might be worth it > to make numpy.fromfile a python function, and to use a Recfile object > when reading subsets of the data. For example numpy.fromfile(f, > rows=rows, columns=columns, dtype=dtype) could instantiate a Recfile > object to read the column and row subsets. We could rename the C > fromfile to something appropriate, and call it when the whole file is > being read (recfile uses it internally when reading ranges). > > thanks, > -e -- Erin Scott Sheldon Brookhaven National Laboratory _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion