Hi, On Tue, Jul 31, 2012 at 7:20 PM, Vlastimil Brom <vlastimil.b...@gmail.com>wrote:
> 2012/7/31 eat <e.antero.ta...@gmail.com>: > > Hi, > > > > On Tue, Jul 31, 2012 at 5:01 PM, Vlastimil Brom < > vlastimil.b...@gmail.com> > > wrote: > >> > >> 2012/7/31 eat <e.antero.ta...@gmail.com>: > >> > Hi, > >> > > >> > On Tue, Jul 31, 2012 at 10:23 AM, Vlastimil Brom > >> > <vlastimil.b...@gmail.com> > >> > wrote: > >> >> > >> >> 2012/7/30 eat <e.antero.ta...@gmail.com>: > >> >> > Hi, > >> >> > > >> >> > A partial answer to your questions: > >> >> > > >> >> > On Mon, Jul 30, 2012 at 10:33 PM, Vlastimil Brom > >> >> > <vlastimil.b...@gmail.com> > >> >> > wrote: > >> >> >> > >> >> >> Hi all, > >> >> >> I'd like to ask for some hints or advice regarding the usage of > >> >> >> numpy.array and especially slicing. > >> >> >> > >> >> >> I only recently tried numpy and was impressed by the speedup in > some > >> >> >> parts of the code, hence I suspect, that I might miss some other > >> >> >> oportunities in this area. > >> >> >> > >> >> >> I currently use the following code for a simple visualisation of > the > >> >> >> search matches within the text, the arrays are generally much > larger > >> >> >> than the sample - the texts size is generally hundreds of > kilobytes > >> >> >> up > >> >> >> to a few MB - with an index position for each character. > >> >> >> First there is a list of spans(obtained form the regex match > >> >> >> objects), > >> >> >> the respective character indices in between these slices should be > >> >> >> set > >> >> >> to 1: > >> >> >> > >> >> >> >>> import numpy > >> >> >> >>> characters_matches = numpy.zeros(10) > >> >> >> >>> matches_spans = numpy.array([[2,4], [5,9]]) > >> >> >> >>> for start, stop in matches_spans: > >> >> >> ... characters_matches[start:stop] = 1 > >> >> >> ... > >> >> >> >>> characters_matches > >> >> >> array([ 0., 0., 1., 1., 0., 1., 1., 1., 1., 0.]) > >> >> >> > >> >> >> Is there maybe a way tu achieve this in a numpy-only way - without > >> >> >> the > >> >> >> python loop? > >> >> >> (I got the impression, the powerful slicing capabilities could > make > >> >> >> it > >> >> >> possible, bud haven't found this kind of solution.) > >> >> >> > >> >> >> > >> >> >> In the next piece of code all the character positions are > evaluated > >> >> >> with their "neighbourhood" and a kind of running proportions of > the > >> >> >> matched text parts are computed (the checks_distance could be > >> >> >> generally up to the order of the half the text length, usually > less > >> >> >> : > >> >> >> > >> >> >> >>> > >> >> >> >>> check_distance = 1 > >> >> >> >>> floating_checks_proportions = [] > >> >> >> >>> for i in numpy.arange(len(characters_matches)): > >> >> >> ... lo = i - check_distance > >> >> >> ... if lo < 0: > >> >> >> ... lo = None > >> >> >> ... hi = i + check_distance + 1 > >> >> >> ... checked_sublist = characters_matches[lo:hi] > >> >> >> ... proportion = (checked_sublist.sum() / (check_distance * 2 > + > >> >> >> 1.0)) > >> >> >> ... floating_checks_proportions.append(proportion) > >> >> >> ... > >> >> >> >>> floating_checks_proportions > >> >> >> [0.0, 0.33333333333333331, 0.66666666666666663, > 0.66666666666666663, > >> >> >> 0.66666666666666663, 0.66666666666666663, 1.0, 1.0, > >> >> >> 0.66666666666666663, 0.33333333333333331] > >> >> >> >>> > >> >> > > >> >> > Define a function for proportions: > >> >> > > >> >> > from numpy import r_ > >> >> > > >> >> > from numpy.lib.stride_tricks import as_strided as ast > >> >> > > >> >> > def proportions(matches, distance= 1): > >> >> > > >> >> > cd, cd2p1, s= distance, 2* distance+ 1, matches.strides[0] > >> >> > > >> >> > # pad > >> >> > > >> >> > m= r_[[0.]* cd, matches, [0.]* cd] > >> >> > > >> >> > # create a suitable view > >> >> > > >> >> > m= ast(m, shape= (m.shape[0], cd2p1), strides= (s, s)) > >> >> > > >> >> > # average > >> >> > > >> >> > return m[:-2* cd].sum(1)/ cd2p1 > >> >> > and use it like: > >> >> > In []: matches > >> >> > Out[]: array([ 0., 0., 1., 1., 0., 1., 1., 1., 1., 0.]) > >> >> > > >> >> > In []: proportions(matches).round(2) > >> >> > Out[]: array([ 0. , 0.33, 0.67, 0.67, 0.67, 0.67, 1. , 1. > , > >> >> > 0.67, > >> >> > 0.33]) > >> >> > In []: proportions(matches, 5).round(2) > >> >> > Out[]: array([ 0.27, 0.36, 0.45, 0.55, 0.55, 0.55, 0.55, > 0.55, > >> >> > 0.45, > >> >> > 0.36]) > >> >> >> > >> >> >> > >> >> >> I'd like to ask about the possible better approaches, as it > doesn't > >> >> >> look very elegant to me, and I obviously don't know the > implications > >> >> >> or possible drawbacks of numpy arrays in some scenarios. > >> >> >> > >> >> >> the pattern > >> >> >> for i in range(len(...)): is usually considered inadequate in > >> >> >> python, > >> >> >> but what should be used in this case as the indices are primarily > >> >> >> needed? > >> >> >> is something to be gained or lost using (x)range or np.arange as > the > >> >> >> python loop is (probably?) inevitable anyway? > >> >> > > >> >> > Here np.arange(.) will create a new array and potentially wasting > >> >> > memory > >> >> > if > >> >> > it's not otherwise used. IMO nothing wrong looping with xrange(.) > (if > >> >> > you > >> >> > really need to loop ;). > >> >> >> > >> >> >> Is there some mor elegant way to check for the "underflowing" > lower > >> >> >> bound "lo" to replace with None? > >> >> >> > >> >> >> Is it significant, which container is used to collect the results > of > >> >> >> the computation in the python loop - i.e. python list or a numpy > >> >> >> array? > >> >> >> (Could possibly matplotlib cooperate better with either > container?) > >> >> >> > >> >> >> And of course, are there maybe other things, which should be made > >> >> >> better/differently? > >> >> >> > >> >> >> (using Numpy 1.6.2, python 2.7.3, win XP) > >> >> > > >> >> > > >> >> > My 2 cents, > >> >> > -eat > >> >> >> > >> >> >> Thanks in advance for any hints or suggestions, > >> >> >> regards, > >> >> >> Vlastimil Brom > >> >> >> _______________________________________________ > >> >> >> NumPy-Discussion mailing list > >> >> >> NumPy-Discussion@scipy.org > >> >> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > >> >> > > >> >> Hi, > >> >> thank you very much for your suggestions! > >> >> > >> >> do I understand it correctly, that I have to special-case the > function > >> >> for distance = 0 (which should return the matches themselves without > >> >> recalculation)? > >> > > >> > Yes. > >> >> > >> >> > >> >> However, more importantly, I am getting a ValueError for some larger, > >> >> (but not completely unreasonable) "distance" > >> >> > >> >> >>> proportions(matches, distance= 8190) > >> >> Traceback (most recent call last): > >> >> File "<input>", line 1, in <module> > >> >> File "<input>", line 11, in proportions > >> >> File "C:\Python27\lib\site-packages\numpy\lib\stride_tricks.py", > >> >> line 28, in as_strided > >> >> return np.asarray(DummyArray(interface, base=x)) > >> >> File "C:\Python27\lib\site-packages\numpy\core\numeric.py", line > >> >> 235, in asarray > >> >> return array(a, dtype, copy=False, order=order) > >> >> ValueError: array is too big. > >> >> >>> > >> >> > >> >> the distance= 8189 was the largest which worked in this snippet, > >> >> however, it might be data-dependent, as I got this error as well e.g. > >> >> for distance=4529 for a 20k text. > >> >> > >> >> Is this implementation-limited, or could it be solved in some > >> >> alternative way which wouldn't have such limits (up to the order of, > >> >> say, millions)? > >> > > >> > Apparently ast(.) does not return a view of the original matches > rather > >> > a > >> > copy of size (n* (2* distance+ 1)), thus you may run out of memory. > >> > > >> > Surely it can be solved up to millions of matches, but perhaps much > >> > slower > >> > speed. > >> > > >> > > >> > Regards, > >> > -eat > >> >> > >> >> > >> >> Thanks again > >> >> regards > >> >> vbr > >> >> > >> >> _______________________________________________ > >> >> NumPy-Discussion mailing list > >> >> NumPy-Discussion@scipy.org > >> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > >> > > >> > >> Thank you for the confirmation, > >> I'll wait and see, whether the current speed isn't actually already > >> acceptable for the most cases... > >> I could already gain a speedup by using the array.sum() and other > >> features, maybe I will find yet other possibilities. > > > > I just cooked up some pure pyhton and running sum based solution, which > > actually may be faster and it scales quite well up to millions of > matches: > > > > def proportions_p(matches, distance= 1): > > > > cd, cd2p1= distance, 2* distance+ 1 > > > > m, r= [0]* cd+ matches+ [0]* (cd+ 1), [0]* len(matches) > > > > s= sum(m[:cd2p1]) > > > > for k in xrange(len(matches)): > > > > r[k]= s/ cd2p1 > > > > s-= m[k] > > > > s+= m[cd2p1+ k] > > > > return r > > > > > > Some verification and timings: > > In []: a= arange(1, 100000, dtype= float) > > In []: allclose(proportions(a, 1000), proportions_p(a.tolist(), 1000)) > > Out[]: True > > > > In []: %timeit proportions(a, 1000) > > 1 loops, best of 3: 288 ms per loop > > In []: %timeit proportions_p(a.tolist(), 1000) > > 10 loops, best of 3: 66.2 ms per loop > > > > In []: a= arange(1, 1000000, dtype= float) > > In []: %timeit proportions(a, 10000) > > ------------------------------------------------------------ > > Traceback (most recent call last): > > [snip] > > ValueError: array is too big. > > > > In []: %timeit proportions_p(a.tolist(), 10000) > > 1 loops, best of 3: 680 ms per loop > > > > > > Regards, > > -eat > >> > >> > >> regards, > >> > >> vbr > >> _______________________________________________ > >> NumPy-Discussion mailing list > >> NumPy-Discussion@scipy.org > >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > Thanks for further assistance; > I hope, I am not misunderstanding something in the code, but this > calculation of proportions is supposed to be run over an array of > either 0 or 1, rather than a range; > a test data would be something like: > It really shouldn't matter what the test case numbers are. We are just calculating averages over of certain 'window', like: In []: proportions(arange(1, 4, dtype= float)) Out[]: array([ 1. , 2. , 1.66666667]) In []: (0.+ 1.+ 2.)/ 3 Out[]: 1.0 In []: (1.+ 2.+ 3.)/ 3 Out[]: 2.0 In []: (2.+ 3.+ 0.)/ 3 Out[]: 1.6666666666666667 Regards, -eat > > import random > test_lst = [0,1]*500 > random.shuffle(test_lst) > > For this data, I am not getting the same results like with my > previously posted python function. > > Thanks and regards > vbr > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion