On 06/11/2010 10:26 AM, Wes McKinney wrote:
On Fri, Jun 11, 2010 at 9:46 AM, Bruce Southey<bsout...@gmail.com> wrote:
On 06/09/2010 03:40 PM, Wes McKinney wrote:
Dear all,
We've been having discussions on the pystatsmodels mailing list
recently regarding data structures and other tools for statistics /
other related data analysis applications. I believe we're trying to
answer a number of different, but related questions:
1. What are the sets of functionality (and use cases) which would be
desirable for the scientific (or statistical) Python programmer?
Things like groupby
(http://projects.scipy.org/numpy/browser/trunk/doc/neps/groupby_additions.rst)
fall into this category.
2. Do we really need to build custom data structures (larry, pandas,
tabular, etc.) or are structured ndarrays enough? (My conclusion is
that we do need to, but others might disagree). If so, how much
performance are we willing to trade for functionality?
3. What needs to happen for Python / NumPy / SciPy to really "break
in" to the statistical computing field? In other words, could a
Python-based stack one day be a competitive alternative to R?
These are just some ideas for collecting community input. Of course as
we're all working in different problem domains, the needs of users
will vary quite a bit across the board. We've started to collect some
thoughts, links, etc. on the scipy.org wiki:
http://scipy.org/StatisticalDataStructures
A lot of what's there already is commentary and comparison on the
functionality provided by pandas and la / larry (since Keith and I
wrote most of the stuff there). But I think we're trying to identify
more generally the things that are lacking in NumPy/SciPy and related
libraries for particular applications. At minimum it should be good
fodder for the SciPy conferences this year and afterward (I am
submitting a paper on this subject based on my experiences).
- Wes
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
If you need pure data storage then all you require is an timeseries,
masked structured ndarray. That will handle time/dates, missing values
and named variables. This is probably the basis of all statistical
packages, databases and spreadsheets. But the real problem is the
blas/lapack usage that prevents anything but an standard narray.
For storing data sets I can agree that a structured / masked ndarray
is sufficient. But I think a lot of people are primarily concerned
about data manipulations in memory (which can be currently quite
obtuse).
Well that is not storage :-)
Data manipulations are too case dependent and full of comprises between
flexibility, memory usage and cpu time. For example, do I create a
design matrix X so I can compute np.dot(X.T, X) or directly form the
product as I read the data? The former is a memory hog because I have
potentially huge X array as well as the smaller product array - this
holds for any solving approach that work on X. Not to mention that X.T*X
is symmetric which is further savings especially if you can use the
symmetric functions of blas/lapack.
If you are referring to scikits.timeseries-- it expects data
to be fixed frequency which is a too rigid assumption for many
applications (like mine).
I am referring to any container that holds a date/time variable such as
the datetime module.
The issue that I have with all these packages like tabulate, la and
pandas that extend narrays is the 'upstream'/'downstream' problem of
open source development. The real problem with these extensions of numpy
is that while you can have whatever storage you like, you either need to
write your own functions or preprocess the storage into an acceptable
form. So you have to rely on those extensions being update with
numpy/scipy since a 'fix' upstream can cause havoc downstream. I
In theory this could be a problem but of all packages to depend on in
the Python ecosystem, NumPy seems pretty safe. How many API breakages
have there been in ndarray in the last few years? Inherently this is a
risk of participating in open-source. After more than 2 years of
running a NumPy-SciPy based stack in production applications I feel
pretty comfortable. And besides, we write unit tests for a reason,
right?
subscribe to what other have said elsewhere in the open source community
in that it is very important to get your desired features upstream to
the original project source - preferably numpy but scipy also counts.
> From my experience developing pandas it's not clear to me what I've
done that _should_ make its way "upstream" into NumPy and / or SciPy.
You could imagine some form of high-level statistical data structure
making its way into scipy.stats but I'm not sure.
As I indicated above, you have to rewrite the functions to use some new
data structure and I think that would be a negative-sum game.
If NumPy could
incorporate something like R's NA value without substantially
degrading performance then that would be a boon to the issue of
handling missing data (which MaskedArray does do for us-- but at
non-trivial performance loss).
Numpy is not orientated to the same goals as S (or SAS or any other
stats application) so it it not a valid comparison to make. For example,
S was designed from the start to " support serious data analysis"
http://cm.bell-labs.com/cm/ms/departments/sia/S/history.html
and "[f]rom the beginning, S was designed to provide a complete
environment for data analysis"
http://cm.bell-labs.com/stat/doc/96.7.ps
There is also the issue of how S/R handles missing values as well.
Data alignment routines, groupby (which
is already on the table), and NaN / missing data sensitive moving
window functions (mean, median, std, etc.) would be nice general
additions as well. Any other ideas?
At present I am waiting to see what happens with pystatsmodels as Python
stats analysis is not very high on my list as other Python things.
Bruce
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion