2008/9/19 Paul Moore <[EMAIL PROTECTED]>: > Anne Archibald <peridot.faceted <at> gmail.com> writes: > >> This was discussed on one of the mailing lists several months ago. It >> turns out that there is no simple way to efficiently choose without >> replacement in numpy/scipy. > > That reassures me that I'm not missing something obvious! I'm pretty new with > numpy (I've lurked here for a number of years, but never had a real-life need > to use numpy until now). > >> I posted a hack that does this somewhat >> efficiently (if SAMPLESIZE>M/2, choose the first SAMPLESIZE of a >> permutation; if SAMPLESIZE<M/2, choose with replacement and redraw any >> duplicates) but it's not vectorized across many sample sets. Is your >> problem large M or large N? what is SAMPLESIZE/M? > > It's actually large SAMPLESIZE. As an example, I'm simulating repeated deals > of poker hands from a deck of cards: M=52, N=5, SAMPLESIZE=1000000.
Er. I must have got my variable names wrong. Anyway, since you are choosing 5 from 52, repeats should be quite rare, and generating a (52,1000000) temporary array is expensive. So I would generate a (5,1000000) array with replacement, then use sort and diff (or unique) to look for any duplicates along the short axis; then fill those in with random values drawn with replacement. Repeat until there aren't any more duplicates. This is roughly the approach my code took, but my code did it for a single draw, and iterating it a million times will be expensive. If you redraw for the whole large array at once, it should be fairly cheap. The only cost will be re-sorting and re-diffing all those already-sorted hands that didn't need replacement. Even there you should be able to speed things with judicious use of fancy indexing. Anne P.S. here's an implementation, seems to work all right: def choose_without_replacement(m,n,repeats=None): """Choose n nonnegative integers less than m without replacement Returns an array of shape n, or (n,repeats). """ if repeats is None: r = 1 else: r = repeats if n>m: raise ValueError, "Cannot find %d nonnegative integers less than %d" % (n,m) if n>m/2: res = np.sort(np.random.rand(m,r).argsort(axis=0)[:n,:],axis=0) else: res = np.random.random_integers(m,size=(n,r)) while True: res = np.sort(res,axis=0) w = np.nonzero(np.diff(res,axis=0)==0) nr = len(w[0]) if nr==0: break res[w] = np.random.random_integers(m,size=nr) if repeats is None: return res[:,0] else: return res For really large values of repeats it does too much sorting; I didn't have the energy to make it pull all the ones with repeats to the beginning so that only they need to be re-sorted the next time through. Still, the expected number of trips through the while loop grows only logarithmically with repeats, so it shouldn't be too bad. -A _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion