2012/7/31 eat <e.antero.ta...@gmail.com>: > Hi, > > On Tue, Jul 31, 2012 at 10:23 AM, Vlastimil Brom <vlastimil.b...@gmail.com> > wrote: >> >> 2012/7/30 eat <e.antero.ta...@gmail.com>: >> > Hi, >> > >> > A partial answer to your questions: >> > >> > On Mon, Jul 30, 2012 at 10:33 PM, Vlastimil Brom >> > <vlastimil.b...@gmail.com> >> > wrote: >> >> >> >> Hi all, >> >> I'd like to ask for some hints or advice regarding the usage of >> >> numpy.array and especially slicing. >> >> >> >> I only recently tried numpy and was impressed by the speedup in some >> >> parts of the code, hence I suspect, that I might miss some other >> >> oportunities in this area. >> >> >> >> I currently use the following code for a simple visualisation of the >> >> search matches within the text, the arrays are generally much larger >> >> than the sample - the texts size is generally hundreds of kilobytes up >> >> to a few MB - with an index position for each character. >> >> First there is a list of spans(obtained form the regex match objects), >> >> the respective character indices in between these slices should be set >> >> to 1: >> >> >> >> >>> import numpy >> >> >>> characters_matches = numpy.zeros(10) >> >> >>> matches_spans = numpy.array([[2,4], [5,9]]) >> >> >>> for start, stop in matches_spans: >> >> ... characters_matches[start:stop] = 1 >> >> ... >> >> >>> characters_matches >> >> array([ 0., 0., 1., 1., 0., 1., 1., 1., 1., 0.]) >> >> >> >> Is there maybe a way tu achieve this in a numpy-only way - without the >> >> python loop? >> >> (I got the impression, the powerful slicing capabilities could make it >> >> possible, bud haven't found this kind of solution.) >> >> >> >> >> >> In the next piece of code all the character positions are evaluated >> >> with their "neighbourhood" and a kind of running proportions of the >> >> matched text parts are computed (the checks_distance could be >> >> generally up to the order of the half the text length, usually less : >> >> >> >> >>> >> >> >>> check_distance = 1 >> >> >>> floating_checks_proportions = [] >> >> >>> for i in numpy.arange(len(characters_matches)): >> >> ... lo = i - check_distance >> >> ... if lo < 0: >> >> ... lo = None >> >> ... hi = i + check_distance + 1 >> >> ... checked_sublist = characters_matches[lo:hi] >> >> ... proportion = (checked_sublist.sum() / (check_distance * 2 + >> >> 1.0)) >> >> ... floating_checks_proportions.append(proportion) >> >> ... >> >> >>> floating_checks_proportions >> >> [0.0, 0.33333333333333331, 0.66666666666666663, 0.66666666666666663, >> >> 0.66666666666666663, 0.66666666666666663, 1.0, 1.0, >> >> 0.66666666666666663, 0.33333333333333331] >> >> >>> >> > >> > Define a function for proportions: >> > >> > from numpy import r_ >> > >> > from numpy.lib.stride_tricks import as_strided as ast >> > >> > def proportions(matches, distance= 1): >> > >> > cd, cd2p1, s= distance, 2* distance+ 1, matches.strides[0] >> > >> > # pad >> > >> > m= r_[[0.]* cd, matches, [0.]* cd] >> > >> > # create a suitable view >> > >> > m= ast(m, shape= (m.shape[0], cd2p1), strides= (s, s)) >> > >> > # average >> > >> > return m[:-2* cd].sum(1)/ cd2p1 >> > and use it like: >> > In []: matches >> > Out[]: array([ 0., 0., 1., 1., 0., 1., 1., 1., 1., 0.]) >> > >> > In []: proportions(matches).round(2) >> > Out[]: array([ 0. , 0.33, 0.67, 0.67, 0.67, 0.67, 1. , 1. , >> > 0.67, >> > 0.33]) >> > In []: proportions(matches, 5).round(2) >> > Out[]: array([ 0.27, 0.36, 0.45, 0.55, 0.55, 0.55, 0.55, 0.55, >> > 0.45, >> > 0.36]) >> >> >> >> >> >> I'd like to ask about the possible better approaches, as it doesn't >> >> look very elegant to me, and I obviously don't know the implications >> >> or possible drawbacks of numpy arrays in some scenarios. >> >> >> >> the pattern >> >> for i in range(len(...)): is usually considered inadequate in python, >> >> but what should be used in this case as the indices are primarily >> >> needed? >> >> is something to be gained or lost using (x)range or np.arange as the >> >> python loop is (probably?) inevitable anyway? >> > >> > Here np.arange(.) will create a new array and potentially wasting memory >> > if >> > it's not otherwise used. IMO nothing wrong looping with xrange(.) (if >> > you >> > really need to loop ;). >> >> >> >> Is there some mor elegant way to check for the "underflowing" lower >> >> bound "lo" to replace with None? >> >> >> >> Is it significant, which container is used to collect the results of >> >> the computation in the python loop - i.e. python list or a numpy >> >> array? >> >> (Could possibly matplotlib cooperate better with either container?) >> >> >> >> And of course, are there maybe other things, which should be made >> >> better/differently? >> >> >> >> (using Numpy 1.6.2, python 2.7.3, win XP) >> > >> > >> > My 2 cents, >> > -eat >> >> >> >> Thanks in advance for any hints or suggestions, >> >> regards, >> >> Vlastimil Brom >> >> _______________________________________________ >> >> NumPy-Discussion mailing list >> >> NumPy-Discussion@scipy.org >> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > >> Hi, >> thank you very much for your suggestions! >> >> do I understand it correctly, that I have to special-case the function >> for distance = 0 (which should return the matches themselves without >> recalculation)? > > Yes. >> >> >> However, more importantly, I am getting a ValueError for some larger, >> (but not completely unreasonable) "distance" >> >> >>> proportions(matches, distance= 8190) >> Traceback (most recent call last): >> File "<input>", line 1, in <module> >> File "<input>", line 11, in proportions >> File "C:\Python27\lib\site-packages\numpy\lib\stride_tricks.py", >> line 28, in as_strided >> return np.asarray(DummyArray(interface, base=x)) >> File "C:\Python27\lib\site-packages\numpy\core\numeric.py", line >> 235, in asarray >> return array(a, dtype, copy=False, order=order) >> ValueError: array is too big. >> >>> >> >> the distance= 8189 was the largest which worked in this snippet, >> however, it might be data-dependent, as I got this error as well e.g. >> for distance=4529 for a 20k text. >> >> Is this implementation-limited, or could it be solved in some >> alternative way which wouldn't have such limits (up to the order of, >> say, millions)? > > Apparently ast(.) does not return a view of the original matches rather a > copy of size (n* (2* distance+ 1)), thus you may run out of memory. > > Surely it can be solved up to millions of matches, but perhaps much slower > speed. > > > Regards, > -eat >> >> >> Thanks again >> regards >> vbr >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >
Thank you for the confirmation, I'll wait and see, whether the current speed isn't actually already acceptable for the most cases... I could already gain a speedup by using the array.sum() and other features, maybe I will find yet other possibilities. regards, vbr _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion