determining available space for Float32, for instance

2006-05-23 Thread David Socha
I am looking for a way to determine the maxium array size I can allocate
for arrays of Float32 values (or Int32, or Int8, ...) at an arbitrary
point in the program's execution.  This is needed because Python cannot
allocate enough memory for all of the data we need to process, so we
need to "chunk" the processing, as described below.

Python's memory management process makes this more complicated, since
once memory is allocated for Float32, it cannot be used for any other
data type, such as Int32.  I'd like a solution that includes either
memory that is not yet allocated, or memory that used to be allocated
for that type, but is no longer used.

We do not want a solution that requires recompiling Python, since we
cannot expect our end users to do that.  

Does anyone know how to do this?
 
The following describes our application context in more detail.
 
Our application is UrbanSim (www.urbansim.org), a micro-simulation
application for urban planning.  It uses "datasets," where each dataset
may have millions of entities (e.g. households), and each entity (e.g.
household) may have dozens of attributes (e.g. number_of_cars, income,
etc.).  Attributes can be any of the standard Python "base" types,
though most attributes are Float32 or Int32 values.  Our models often
create a set of 2D arrays with one dimension being agents, and the
second dimention being choices from another dataset.  For insances, the
agents may be households that choose a new gridcell to live in.  For our
Puget Sound application, there are 1 to 2 million households, and 800K
gridcells.  Each attribute of a dataset has such a 2D array.  Given that
we may have dozens of attributes, they can eat up a lot of memory,
quickly.
 
Given the sizes of these arrays, and Python's limited address space,
Python usually cannot allocate enough memory for us to create the entire
set of 2D arrays at once.  Instead, we "chunk" the model along the
agents dimension, processing a chunk of agents at a time.  Some of our
models can do their work in a single chunk.  Others require hundreds of
chunks.  It depends upon the number of agents, the number of locations,
the number of agent attributes, and the number of location attributes
used by that particular model.
 
What we would like is for the code to be able to automatically determine
the number of agents that can be in a single chunk.  This requires we
solve two sub-problems.  
 
First, we need to know how many attributes of each type (Float32, Int32,
etc.) will be used by this model.  We can do that.  
 
Second, we need to know how much space is available for an array of a
particular type of values, e.g. for Float32 values.  Is there a way to
get this information for Python?  

Cheers,

David Socha
Center for Urban Simulation and Policy Analysis
University of Washington
www.urbansim.org
-- 
http://mail.python.org/mailman/listinfo/python-list


RE: determining available space for Float32, for instance

2006-05-24 Thread David Socha
Robert Kern wrote: 
> David Socha wrote:
> > I am looking for a way to determine the maxium array size I can 
> > allocate for arrays of Float32 values (or Int32, or Int8, 
> ...) at an 
> > arbitrary point in the program's execution.  This is needed because 
> > Python cannot allocate enough memory for all of the data we need to 
> > process, so we need to "chunk" the processing, as described below.
> > 
> > Python's memory management process makes this more 
> complicated, since 
> > once memory is allocated for Float32, it cannot be used for 
> any other 
> > data type, such as Int32.
> 
> Just for clarification, you're talking about Numeric arrays 
> here (judging from the names, you still haven't upgraded to 
> numpy), not general Python. Python itself has no notion of 
> Float32 or Int32 or allocating chunks of memory for those two 
> datatypes.

Yes, I am talking about numarray arrays, not general Python.
 
> > I'd like a solution that includes either memory that is not yet 
> > allocated, or memory that used to be allocated for that 
> type, but is 
> > no longer used.
> > 
> > We do not want a solution that requires recompiling Python, 
> since we 
> > cannot expect our end users to do that.
> 
> OTOH, *you* could recompile Python and distribute your Python 
> with your application. We do that at Enthought although for 
> different reasons. However, I don't think it will come to that.

We could, but that seems like it simply creates a secondar problems,
since then the user would have to choose between installing our version
of Python or Enthought's version, for instance. 

> > Does anyone know how to do this?
> 
> With numpy, it's easy enough to change the datatype of an 
> array on-the-fly as long as the sizes match up.
> 
> In [8]: from numpy import *
> 
> In [9]: a = ones(10, dtype=float32)
> 
> In [10]: a
> Out[10]: array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  
> 1.], dtype=float32)
> 
> In [11]: a.dtype = int32
> 
> In [12]: a
> Out[12]:
> array([1065353216, 1065353216, 1065353216, 1065353216, 1065353216,
>1065353216, 1065353216, 1065353216, 1065353216, 
> 1065353216], dtype=int32)
> 
> However, keeping track of the sizes of your arrays and the 
> size of your datatypes may be a bit much to ask.

Exactly.  Building a duplicate mechanism for tracking this informaiton
would be a sad solution.  Surely Python has access to the amount of
memory being used by the different data types.  How can I get to that
information?
 
> [snip]
> numpy (definitely not Numeric) does have a feature called 
> record arrays which will allow you to deal with your agents 
> much more conveniently:
> 
>   http://www.scipy.org/RecordArrays
> 
> Also, you will certainly want to look at using PyTables to 
> store and access your data. With PyTables you can leave all 
> of your data on disk and access arbitrary parts of it in a 
> relatively clean fashion without doing the fiddly work of 
> swapping chunks of memory from disk and back again:
> 
>   http://www.pytables.org/moin

Do RecordArrays and PyTables work well together?  

Thanks for the info.  The PyTables looks quite promising for our
application (I had been looking for an HDF5 interface, but couldn't
recall the 'HDF5' name).

David Socha
Center for Urban Simulation and Policy Analysis
University of Washington
206 616-4495


-- 
http://mail.python.org/mailman/listinfo/python-list