Hi, On Tue, Jul 31, 2012 at 10:23 AM, Vlastimil Brom <vlastimil.b...@gmail.com>wrote:
> 2012/7/30 eat <e.antero.ta...@gmail.com>: > > Hi, > > > > A partial answer to your questions: > > > > On Mon, Jul 30, 2012 at 10:33 PM, Vlastimil Brom < > vlastimil.b...@gmail.com> > > wrote: > >> > >> Hi all, > >> I'd like to ask for some hints or advice regarding the usage of > >> numpy.array and especially slicing. > >> > >> I only recently tried numpy and was impressed by the speedup in some > >> parts of the code, hence I suspect, that I might miss some other > >> oportunities in this area. > >> > >> I currently use the following code for a simple visualisation of the > >> search matches within the text, the arrays are generally much larger > >> than the sample - the texts size is generally hundreds of kilobytes up > >> to a few MB - with an index position for each character. > >> First there is a list of spans(obtained form the regex match objects), > >> the respective character indices in between these slices should be set > >> to 1: > >> > >> >>> import numpy > >> >>> characters_matches = numpy.zeros(10) > >> >>> matches_spans = numpy.array([[2,4], [5,9]]) > >> >>> for start, stop in matches_spans: > >> ... characters_matches[start:stop] = 1 > >> ... > >> >>> characters_matches > >> array([ 0., 0., 1., 1., 0., 1., 1., 1., 1., 0.]) > >> > >> Is there maybe a way tu achieve this in a numpy-only way - without the > >> python loop? > >> (I got the impression, the powerful slicing capabilities could make it > >> possible, bud haven't found this kind of solution.) > >> > >> > >> In the next piece of code all the character positions are evaluated > >> with their "neighbourhood" and a kind of running proportions of the > >> matched text parts are computed (the checks_distance could be > >> generally up to the order of the half the text length, usually less : > >> > >> >>> > >> >>> check_distance = 1 > >> >>> floating_checks_proportions = [] > >> >>> for i in numpy.arange(len(characters_matches)): > >> ... lo = i - check_distance > >> ... if lo < 0: > >> ... lo = None > >> ... hi = i + check_distance + 1 > >> ... checked_sublist = characters_matches[lo:hi] > >> ... proportion = (checked_sublist.sum() / (check_distance * 2 + > 1.0)) > >> ... floating_checks_proportions.append(proportion) > >> ... > >> >>> floating_checks_proportions > >> [0.0, 0.33333333333333331, 0.66666666666666663, 0.66666666666666663, > >> 0.66666666666666663, 0.66666666666666663, 1.0, 1.0, > >> 0.66666666666666663, 0.33333333333333331] > >> >>> > > > > Define a function for proportions: > > > > from numpy import r_ > > > > from numpy.lib.stride_tricks import as_strided as ast > > > > def proportions(matches, distance= 1): > > > > cd, cd2p1, s= distance, 2* distance+ 1, matches.strides[0] > > > > # pad > > > > m= r_[[0.]* cd, matches, [0.]* cd] > > > > # create a suitable view > > > > m= ast(m, shape= (m.shape[0], cd2p1), strides= (s, s)) > > > > # average > > > > return m[:-2* cd].sum(1)/ cd2p1 > > and use it like: > > In []: matches > > Out[]: array([ 0., 0., 1., 1., 0., 1., 1., 1., 1., 0.]) > > > > In []: proportions(matches).round(2) > > Out[]: array([ 0. , 0.33, 0.67, 0.67, 0.67, 0.67, 1. , 1. , > 0.67, > > 0.33]) > > In []: proportions(matches, 5).round(2) > > Out[]: array([ 0.27, 0.36, 0.45, 0.55, 0.55, 0.55, 0.55, 0.55, > 0.45, > > 0.36]) > >> > >> > >> I'd like to ask about the possible better approaches, as it doesn't > >> look very elegant to me, and I obviously don't know the implications > >> or possible drawbacks of numpy arrays in some scenarios. > >> > >> the pattern > >> for i in range(len(...)): is usually considered inadequate in python, > >> but what should be used in this case as the indices are primarily > >> needed? > >> is something to be gained or lost using (x)range or np.arange as the > >> python loop is (probably?) inevitable anyway? > > > > Here np.arange(.) will create a new array and potentially wasting memory > if > > it's not otherwise used. IMO nothing wrong looping with xrange(.) (if you > > really need to loop ;). > >> > >> Is there some mor elegant way to check for the "underflowing" lower > >> bound "lo" to replace with None? > >> > >> Is it significant, which container is used to collect the results of > >> the computation in the python loop - i.e. python list or a numpy > >> array? > >> (Could possibly matplotlib cooperate better with either container?) > >> > >> And of course, are there maybe other things, which should be made > >> better/differently? > >> > >> (using Numpy 1.6.2, python 2.7.3, win XP) > > > > > > My 2 cents, > > -eat > >> > >> Thanks in advance for any hints or suggestions, > >> regards, > >> Vlastimil Brom > >> _______________________________________________ > >> NumPy-Discussion mailing list > >> NumPy-Discussion@scipy.org > >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > Hi, > thank you very much for your suggestions! > > do I understand it correctly, that I have to special-case the function > for distance = 0 (which should return the matches themselves without > recalculation)? > Yes. > > However, more importantly, I am getting a ValueError for some larger, > (but not completely unreasonable) "distance" > > >>> proportions(matches, distance= 8190) > Traceback (most recent call last): > File "<input>", line 1, in <module> > File "<input>", line 11, in proportions > File "C:\Python27\lib\site-packages\numpy\lib\stride_tricks.py", > line 28, in as_strided > return np.asarray(DummyArray(interface, base=x)) > File "C:\Python27\lib\site-packages\numpy\core\numeric.py", line > 235, in asarray > return array(a, dtype, copy=False, order=order) > ValueError: array is too big. > >>> > > the distance= 8189 was the largest which worked in this snippet, > however, it might be data-dependent, as I got this error as well e.g. > for distance=4529 for a 20k text. > > Is this implementation-limited, or could it be solved in some > alternative way which wouldn't have such limits (up to the order of, > say, millions)? > Apparently ast(.) does not return a view of the original matches rather a copy of size (n* (2* distance+ 1)), thus you may run out of memory. Surely it can be solved up to millions of matches, but perhaps much slower speed. Regards, -eat > > Thanks again > regards > vbr > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion