Thanks guys..very handy examples by Francesc. I need to bookmark them until I reach this point.
best, -Abhi On Tue, Mar 13, 2012 at 9:24 AM, Francesc Alted <[email protected]> wrote: > On Mar 13, 2012, at 7:31 AM, Sturla Molden wrote: > >> On 12.03.2012 23:23, Abhishek Pratap wrote: >>> Super awesome. I love how the python community in general keeps the >>> recordings available for free. >>> >>> @Adam : I do have some problems that I can hit numpy with, mainly >>> bigData based. So in summary I have millions/billions of rows of >>> biological data on which I want to run some computation but at the >>> same time have a capability to do quick lookup. I am not sure if numpy >>> will be applicable for quick lookups by a string based key right ?? >> >> >> Jason Kinser's book on Python for bioinformatics might be of interest. >> Though I don't always agree with his NumPy coding style. >> >> As for "big data", it is a problem regardless of language. The HDF5 library >> might be of help (cf. PyTables or h5py, I actually prefer the latter). > > Yes, however IMO PyTables does adapt better to the OP lookup user case. For > example, let's suppose a very simple key-value problem, where we need to > locate a certain value by using a key. Using h5py I get: > > In [1]: import numpy as np > > In [2]: N = 100*1000 > > In [3]: sa = np.fromiter((('key'+str(i), i) for i in xrange(N)), > dtype="S8,i4") > > In [4]: import h5py > > In [5]: f = h5py.File('h5py.h5', 'w') > > In [6]: d = f.create_dataset('sa', data=sa) > > In [7]: time [val for val in d if val[0] == 'key500'] > CPU times: user 28.34 s, sys: 0.06 s, total: 28.40 s > Wall time: 29.25 s > Out[7]: [('key500', 500)] > > Another option is to use fancy selection: > > In [8]: time d[d['f0']=='key500'] > CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s > Wall time: 0.01 s > Out[8]: > array([('key500', 500)], > dtype=[('f0', 'S8'), ('f1', '<i4')]) > > Hmm, time resolution is too poor here. Let's use the %timeit magic: > > In [9]: timeit d[d['f0']=='key500'] > 100 loops, best of 3: 9.3 ms per loop > > which is much better. But, in this case you need to load the column d['f0'] > completely in-memory, and this is *not* what you want when you have large > tables that does not fit in-memory. > > Using PyTables: > > In [10]: import tables > > In [11]: ft = tables.openFile('pytables.h5', 'w') > > In [12]: dt = ft.createTable(ft.root, 'sa', sa) > > In [13]: time [val[:] for val in dt if val[0] == 'key500'] > CPU times: user 0.04 s, sys: 0.01 s, total: 0.05 s > Wall time: 0.04 s > Out[13]: [('key500', 500)] > > That's almost a 100x of speed-up compared with h5py. But, in addition, > PyTables has specific machinery to optimize these queries by using the > numexpr behind the scenes: > > In [14]: time [val[:] for val in dt.where("f0=='key500'")] > CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s > Wall time: 0.00 s > Out[14]: [('key500', 500)] > > Again, time resolution is too poor here. Let's use timeit magic: > > In [15]: timeit [val[:] for val in dt.where("f0=='key500'")] > 100 loops, best of 3: 2.36 ms per loop > > This is an additional 10x speed-up. In fact, this is almost as fast as > performing the query using NumPy directly: > > In [16]: timeit sa[sa['f0']=='key500'] > 100 loops, best of 3: 2.14 ms per loop > > with the difference that PyTables uses an out-of-core paradigm (i.e. it does > not need to load the datasets completely in-memory). And finally, PyTables > does support true indexing capabilities, so that you do not have to read the > complete dataset for getting results: > > In [17]: dt.cols.f0.createIndex() > Out[17]: 100000 > > In [18]: timeit [val[:] for val in dt.where("f0=='key500'")] > 1000 loops, best of 3: 213 us per loop > > which accounts for another additional 10x speedup. Of course, this speed up > can be *much* more larger for bigger datasets, and specially for those that > does not fit in-memory. See: > > http://pytables.github.com/usersguide/optimization.html#accelerating-your-searches > > for more detailed rational and benchmarks in big datasets. > > -- Francesc Alted > > > > _______________________________________________ > NumPy-Discussion mailing list > [email protected] > http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
