On 4 April 2011 19:18, Dag Sverre Seljebotn <d.s.seljeb...@astro.uio.no> wrote:
> On 04/04/2011 05:22 PM, mark florisson wrote:
>>
>> On 4 April 2011 13:53, Dag Sverre Seljebotn<d.s.seljeb...@astro.uio.no>
>>  wrote:
>>>
>>> On 04/04/2011 01:23 PM, Stefan Behnel wrote:
>>>>
>>>> Dag Sverre Seljebotn, 04.04.2011 12:17:
>>>>>
>>>>> CEP up at http://wiki.cython.org/enhancements/prange
>>>>
>>>> """
>>>> Variable handling
>>>>
>>>> Rather than explicit declaration of shared/private variables we rely on
>>>> conventions:
>>>>
>>>>    * Thread-shared: Variables that are only read and not written in the
>>>> loop body are shared across threads. Variables that are only used in the
>>>> else block are considered shared as well.
>>>>
>>>>    * Thread-private: Variables that are assigned to in the loop body are
>>>> thread-private. Obviously, the iteration counter is thread-private as
>>>> well.
>>>>
>>>>    * Reduction: Variables that only used on the LHS of an inplace
>>>> operator, such as s above, are marked as targets for reduction. If the
>>>> variable is also used in other ways (LHS of assignment or in an
>>>> expression)
>>>> it does instead turn into a thread-private variable. Note: This means
>>>> that
>>>> if one, e.g., inserts printf(... s) above, s is turned into a
>>>> thread-local
>>>> variable. OTOH, there is simply no way to correctly emulate the effect
>>>> printf(... s) would have in a sequential loop, so such code must be
>>>> discouraged anyway.
>>>> """
>>>>
>>>> What about simply (ab-)using Python semantics and creating a new inner
>>>> scope for the prange loop body? That would basically make the loop
>>>> behave
>>>> like a closure function, but with the looping header at the 'right'
>>>> place
>>>> rather than after the closure.
>>>
>>> I'm not quite sure what the concrete changes to the CEP this would lead
>>> to
>>> (assuming you mean this as a proposal for alternative semantics, and not
>>> an
>>> implementation detail).
>>>
>>> How would we treat reduction variables? They need to be supported, and
>>> there's nothing in Python semantics to support reduction variables, they
>>> are
>>> a rather special case everywhere. I suppose keeping the reduction clause
>>> above, or use the "nonlocal" keyword in the loop body...
>>>
>>> Also there's the else:-block, although we could make that part of the
>>> scope.
>>> And the "lastprivate" functionality, although that could be dropped
>>> without
>>> much loss.
>>>
>>>> Also, in the example, the local variable declaration of "tmp" outside of
>>>> the loop looks somewhat misplaced, although it's precedented by
>>>> comprehensions (which also have their own local scope in Cython).
>>>
>>> Well, depending on the decision of lastprivate, the declaration would
>>> need
>>> to be outside; I really like the idea of moving "cdef", and am prepared
>>> to
>>> drop lastprivate for this.
>>>
>>> Being explicit about thread-local variables does make things a lot safer
>>> to
>>> use.
>>>
>>> (One problem is that switching between serial and parallel one needs to
>>> move
>>> variable declarations. But that only happens once, and one can use
>>> "nthreads=1" to disable parallel after that.)
>>>
>>> An example would then be:
>>>
>>> def f(np.ndarray[double] x, double alpha):
>>>    cdef double s = 0, globtmp
>>>    with nogil:
>>>        for i in prange(x.shape[0]):
>>>            cdef double tmp # thread-private
>>>            tmp = alpha * i # alpha available from global scope
>>>            s += x[i] * tmp # still automatic reduction for inplace
>>> operators
>>>            # printf(...s) ->  now leads to error, since s is not declared
>>> thread-private but is read
>>>        else:
>>>            # tmp still available here...looks a bit strange, but useful
>>>            s += tmp * 10
>>>            globtmp = tmp # we save tmp for later
>>>        # tmp not available here, globtmp is
>>>    return s
>>>
>>> Or, we just drop support for the else block on these loops.
>>
>> I think since we are disallowing break (yet) we shouldn't support the
>> else clause. Basically, I think we can make the CEP a tad more simple.
>>
>> I think we could declare everything outside of the prange body. Then,
>> in the prange loop body:
>>
>>     if a variable is assigned to anywhere ->  make it lastprivate
>>         - if a variable is read before assigned to ->  make it
>> firstprivate in addition to lastprivate (raise compiler error if the
>> variable is not initialized outside of the loop body)
>>
>>     if a variable is only ever read ->  make it shared (the default for
>> OpenMP)
>>
>>     if a variable has an inplace operator ->  make it a reduction
>>
>> There is really no reason to disallow reading of the reduction
>> variable (in e.g. a printf). The reduction should also be initialized
>> outside of the prange body.
>
> The reason for disallowing reading the reduction variable is that otherwise
> you have a contradiction above, since a reduction variable may also be a
> thread-local variable. Or, you disable inplace operators for thread-local
> variables? (ugh)

Yes, an inplace operator would make it a reduction variable, just like
assigning something makes it lastprivate, only reading makes it shared
and reading before writing makes it firstprivate in addition to
lastprivate. This is all implicit.

Alternatively, if you want it more explicit, then instead of the
inplace operator you could allow something like

    sum = cython.parallel.reduction('+', sum) + var1 * var2

instead of

    sum += var1 * var2

> That's the main reason I'm leaning towards explicit declaring local
> variables using "cdef".
>
> If we're reducing complexity BTW, I'd rather remove firstprivate/lastprivate
> alltogether, see below.
>> Then prange() could be implemented in pure mode as simply the
>> sequential version, i.e. range() which some more arguments.
>>
>> For any scratch space buffers etc, I'd prefer something like
>>
>>
>> with cython.parallel:
>>     cdef char *buf = malloc(100)
>>
>>     for i in prange(n):
>>         use buf
>>
>>     free(buf)
>>
>> At least it fits my brain pretty well :) (this code does however
>> assume that malloc is thread-safe).
>
> Yes...perhaps a cython.parellel block will make everybody happy:
>
>  - It's more obvious that we create a new scope, which at least answers some
> of Stefan's complaints
>
>  - We can use normal "for i in range", and put scheduling params on
> parallel(), which makes Nathaniel happy

That doesn't sound intuitive, as the scheduling pertains to the
worksharing 'for' construct, and not the entire parallel region. So
scheduling parameters should be provided to e.g.
cython.parallel.range() (or cython.prange, cython.parallel_range,
whatever).

Then if cython.parallel.range() is in a 'with cython.parallel' block,
it would have '#pragma omp for' semantics (considering OpenMP),
whereas it would be a '#pragma omp parallel for' if not closely nested
in such a block.

> In this case I'd say we simply do not support firstprivate, all thread-local
> variables must be declared in the block, and for firstprivate behaviour you
> just initialize them yourself which is more explicit and Pythonic. The
> "else:"-block on loops is still useful for lastprivate behaviour -- the
> point of executing the else block in one of the threads is that you can then
> copy thread-local variables of the "last" thread into shared variables to
> get lastprivate behaviour (again, more explicit and Python).

Why? They are entirely implicit in my proposal, and intuitively so.
Having the parallel range match the sequential range semantics in this
way feel much more Pythonic than having to copy things over in an else
block and having to declare and define simple variables in a special
place.

So basically you keep your options open: a simple and very concise way
to do a parallel range, and a slightly more convoluted way if you need
to initialize some thread-local buffers.
And the good thing is, you can move back to the sequential range by
simply renaming cython.parallel.range to range.


> If we allow "with cython.nogil, cython.parallel" we can keep the same number
> of indentation levels in some cases.

Yeah that would be nice. We could also make cython.parallel implicitly
nogil, but your approach is more flexible if we want to allow this
construct with the gil in the future.

> Also, I think there's still a use for my num_threads_that_would_spawn(), so
> that the malloc can be moved out to a GIL-holding section if one wants to --
> I may want to allocate with a NumPy array instead of malloc.

Yeah we can keep that in for full flexibility.

> Dag Sverre
> _______________________________________________
> cython-devel mailing list
> cython-devel@python.org
> http://mail.python.org/mailman/listinfo/cython-devel
>

For clarity, I'll add an example:

def f(np.ndarray[double] x, double alpha):
    cdef double s = 0
    cdef double tmp = 2
    cdef double other = 6.6

    with nogil:
        for i in prange(x.shape[0]):
            # reading 'tmp' makes it firstprivate in addition to lastprivate
            # 'other' is only ever read, so it's shared
            printf("%lf %lf %lf\n", tmp, s, other)

            # assigning 'tmp' makes it lastprivate
            tmp = alpha * i

            # using += on 's' makes it a reduction variable with operator '+'
            s += x[i] * tmp

    # at this point, all variables s, tmp and other are well defined

    return s

NOTE: any variable that is determined firstprivate, shared or
reduction must be defined, so there is no place for implicit behaviour
biting you in the behind
_______________________________________________
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel

Reply via email to