Greetings,
Speaking of brute force... I've attached a rather ugly module that
let's you do things with a pretty simple interface (session shown
below). I haven't fully tested the performance, but a million
records with 5 fields takes about 11 seconds on my Mac to do a
'mean'. I'm not sure what your performance considerations are, but
this may be useful. Record arrays are really nice if they make sense
for your data.
Travis
(from an ipython command prompt)
In [1]: import testpivot as p
In [2]: a = p.sample_data()
In [3]: a
Out[3]:
recarray([('ACorp', 'Region 1', 'Q1', 20000.0),
('ACorp', 'Region 1', 'Q2', 22000.0),
('ACorp', 'Region 1', 'Q3', 21000.0),
('ACorp', 'Region 1', 'Q4', 26000.0),
('ACorp', 'Region 2', 'Q1', 23000.0),
('ACorp', 'Region 2', 'Q2', 20000.0),
('ACorp', 'Region 2', 'Q3', 22000.0),
('ACorp', 'Region 2', 'Q4', 21000.0),
('ACorp', 'Region 3', 'Q1', 26000.0),
('ACorp', 'Region 3', 'Q2', 23000.0),
('ACorp', 'Region 3', 'Q3', 29000.0),
('ACorp', 'Region 3', 'Q4', 27000.0),
('BCorp', 'Region 1', 'Q1', 20000.0),
('BCorp', 'Region 1', 'Q2', 20000.0),
('BCorp', 'Region 1', 'Q3', 24000.0),
('BCorp', 'Region 1', 'Q4', 24000.0),
('BCorp', 'Region 2', 'Q1', 21000.0),
('BCorp', 'Region 2', 'Q2', 21000.0),
('BCorp', 'Region 2', 'Q3', 22000.0),
('BCorp', 'Region 2', 'Q4', 29000.0),
('BCorp', 'Region 3', 'Q1', 28000.0),
('BCorp', 'Region 3', 'Q2', 25000.0),
('BCorp', 'Region 3', 'Q3', 22000.0),
('BCorp', 'Region 3', 'Q4', 21000.0)],
dtype=[('company', '|S5'), ('region', '|S8'), ('quarter', '|
S2'), ('income', '<f8')])
In [4]: p.pivot(a, 'company', 'region', 'income', p.psum)
######## Summary by company and region ##########
cols:['ACorp' 'BCorp']
rows:['Region 1' 'Region 2' 'Region 3']
[[ 89000. 88000.]
[ 86000. 93000.]
[ 105000. 96000.]]
In [5]: p.pivot(a, 'company', 'quarter', 'income', p.psum)
######## Summary by company and quarter ##########
cols:['ACorp' 'BCorp']
rows:['Q1' 'Q2' 'Q3' 'Q4']
[[ 69000. 69000.]
[ 65000. 66000.]
[ 72000. 68000.]
[ 74000. 74000.]]
In [6]: p.pivot(a, 'company', 'quarter', 'income', p.pmean)
######## Summary by company and quarter ##########
cols:['ACorp' 'BCorp']
rows:['Q1' 'Q2' 'Q3' 'Q4']
[[ 23000. 23000. ]
[ 21666.66666667 22000. ]
[ 24000. 22666.66666667]
[ 24666.66666667 24666.66666667]]
import numpy as N
from pprint import pprint
def sample_data():
# return a 'big-enough' sample record array to play with
# note: the strings should be int keys in any real table
company_column = ['ACorp', 'ACorp', 'ACorp', 'ACorp',
'ACorp', 'ACorp', 'ACorp', 'ACorp',
'ACorp', 'ACorp', 'ACorp', 'ACorp',
'BCorp', 'BCorp', 'BCorp', 'BCorp',
'BCorp', 'BCorp', 'BCorp', 'BCorp',
'BCorp', 'BCorp', 'BCorp', 'BCorp' ]
region_column = ['Region 1', 'Region 1', 'Region 1', 'Region 1',
'Region 2', 'Region 2', 'Region 2', 'Region 2',
'Region 3', 'Region 3', 'Region 3', 'Region 3',
'Region 1', 'Region 1', 'Region 1', 'Region 1',
'Region 2', 'Region 2', 'Region 2', 'Region 2',
'Region 3', 'Region 3', 'Region 3', 'Region 3']
quarter_column = ['Q1', 'Q2', 'Q3', 'Q4',
'Q1', 'Q2', 'Q3', 'Q4',
'Q1', 'Q2', 'Q3', 'Q4',
'Q1', 'Q2', 'Q3', 'Q4',
'Q1', 'Q2', 'Q3', 'Q4',
'Q1', 'Q2', 'Q3', 'Q4']
income_column = [20000., 22000., 21000., 26000.,
23000., 20000., 22000., 21000.,
26000., 23000., 29000., 27000.,
20000., 20000., 24000., 24000.,
21000., 21000., 22000., 29000.,
28000., 25000., 22000., 21000.]
return N.rec.fromarrays([company_column, region_column, quarter_column, income_column],
names='company,region,quarter,income')
def big_sample_data(rows, cols):
""" make a big recarray with specified row count (rows), two 'name'
columns and specified data column count (cols) """
namelst = []
for i in range(rows):
namelst.append(str(N.random.randint(1,10)))
# make first two columns text that may be crosstab headings
name1 = N.array(namelst)
name2 = name1.copy()
name1.sort()
datacols=[]
collist = ['name1', 'name2']
for j in range(cols):
datacols.append(N.random.randn(rows))
collist.append('data' + str(j))
#print name1, name2, datacols
reclist = [name1, name2]
#print reclist
reclist.extend(datacols)
#print reclist
colnames = ','.join(collist)
# dump all the names and data into a recarray
return N.rec.fromarrays(reclist, names=colnames)
#return reclist, colnames
def subtotal(ary, col1, col2):
# Subtotals by col1
print '######### Subtotals by %s: ###########' % col1
for itm in N.unique(ary.field(col1)):
msk = N.array(ary.field(col1)==itm)
print "%s: %s" % (itm, ary.field(col2)[msk].sum())
def psum(ary):
return ary.sum()
def pmean(ary):
return ary.mean()
def pmax(ary):
return ary.max()
def pmin(ary):
return ary.min()
def pivot(ary, col1, col2, col3, func=psum):
""" inputs: ary - numpy record array
col1 - column from rec array in pivot columns
col2 - column from rec array in pivot rows
col3 - column from rec array to aggregate
func - callable to perform aggregate operation
"""
print '######## Summary by %s and %s ##########' % (col1, col2)
i=0
# get list of unique values
top_row = N.unique(ary.field(col2))
left_col = N.unique(ary.field(col1))
pt = N.zeros((len(top_row), len(left_col)))
for itm in left_col:
msk = N.array(ary.field(col1)==itm)
j=0
for col in top_row:
msk2 = msk&N.array(ary.field(col2)==col)
val = func(ary.field(col3)[msk2])
pt[j,i] = val
j+=1
i+=1
print 'cols:%s' % (left_col)
print 'rows:%s' % (top_row)
print pt
On Aug 1, 2007, at 2:02 PM, Bruce Southey wrote:
Hi,
The hard part is knowing what aggregate function that you want. So a
hard way, even after cheating, to take the data provided is given
below. (The Numpy Example List was very useful especially on the where
function)!
I tried to be a little generic so you can replace the sum by any
suitable function and probably the array type as well. Of course it is
not complete because you still need to know the levels of the 'rows'
and 'columns' and also is not efficient as it has loops.
Bruce
from numpy import *
A=array([[1,1,10],
[1,1,20],
[1,2,30],
[2,1,40],
[2,2,50],
[2,2,60] ])
C = zeros((2,2))
for i in range(2):
crit1 = (A[:,0]==1+i)
subA=A[crit1,1:]
for j in range(2):
crit2 = (subA[:,0]==1+j)
subB=subA[crit2,1:]
C[i,j]=subB.sum()
print C
On 7/30/07, Geoffrey Zhu <[EMAIL PROTECTED]> wrote:
Hi Everyone,
I am wondering what is the best (and fast) way to build a pivot table
aside from the 'brute force way?'
I want to transform an numpy array into a pivot table. For
example, if
I have a numpy array like below:
Region Date # of Units
---------- ---------- --------------
East 1/1 10
East 1/1 20
East 1/2 30
West 1/1 40
West 1/2 50
West 1/2 60
I want to transform this into the following table, where f() is a
given aggregate function:
Date
Region 1/1 1/2
----------
East f(10,20) f(30)
West f(40) f(50,60)
I can regroup them into 'sets' and do it the brute force way, but
that
is kind of slow to execute. Does anyone know a better way?
Thanks,
Geoffrey
_______________________________________________
Numpy-discussion mailing list
[email protected]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
Numpy-discussion mailing list
[email protected]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
Numpy-discussion mailing list
[email protected]
http://projects.scipy.org/mailman/listinfo/numpy-discussion