Re: Data dependent lane conflicts

Dmitry Babokin Tue, 21 Jul 2020 00:13:00 -0700

Hi Lars,

Thank you for a detailed motivating examples describing the need for
conflict detection capability in the language!


Currently there's no way to generate vpconflictd instruction in ISPC. I see
two paths to generate it:
  (1) introduce a library function, which has semantic of vpconflitd/q and
emulate it on the platforms that don't support it - as you suggested, or
  (2) introduce a generic way to call arbitrary intrinsics as proposed in
https://github.com/ispc/ispc/issues/1803

Given the importance of the instruction for many applications that you
outlined and as you see value for emulating it, I would suggest going with
a dedicated library function. I would say that your description of the
problem even sounds like we are missing a special type of "foreach" loop,
which would serialize conflicting iterations. If we get enough motivating
examples and prove that we can implement it efficiently enough, I would
support it as well.

I can implement the library function pretty easily. Could you please file
an issue for that?

Also, if you can contribute an example (as a self sufficient program that
we can distribute as part of our example set) of the algorithm, where this
instruction is useful, what would be really cool. All the cases that you've
described do make sense and sound very convincing, but I personally don't
have experience with these kinds of algorithms. This also sounds like an
interesting assignment for someone who is studying algorithms class.

And one more question, do you see the use for VP2INTERSECTD instructions in
these algorithms?

Dmitry.

On Sat, Jul 18, 2020 at 5:23 PM Lars Friend <[email protected]> wrote:

> Hello All,
>
>     I'm new to the ispc mailing list but I've been following the project
> itself with great interest since its first public release.  I've been
> interested in using SIMD parallelism for years at this point (since I read
> the first public spec for the AVX2 instruction set) and have a particular
> interest in using them to more efficiently perform tasks which, while
> inherently parallel, fall outside the computation-heavy-inner-loop category
> which, as low-hanging fruit, received the most initial attention as CPUs
> began to include wider and more general SIMD units.
>
>     Does anyone know if there is a way to induce ispc to emit the
> *vpconflictd* instruction (or its equivalent on targets without such an
> instruction)?  If not, I think it would be a worthwhile thing to add to the
> Cross-Program Instance Operations already supported, in the form of a
> function that, given a *varying* parameter, would return a lanemask of
> "conflicting" lower-numbered lanes.
>
>     In particular, the SPMD model is particularly attractive for realizing
> significant performance gains for certain class of problem where the core
> activity involves pulling incoming events off a queue, looking up the
> correct state machine instance, and processing the event in the context of
> its state machine instance, potentially producing an output event and
> generally updating the state machine instance as a result.  Some practical
> examples of this type of problem are:
>
>    - Packet processing (where the state machine that is looked up and
>    updated may be the connection the packet belongs to for instance).
>    - Simulations (where the state machine instances could be the
>    individual agents whose behavior the simulation models).
>    - Scanning streams of events for specified patterns (where each state
>    machine instance is a potential match-in-progress).
>    - and many more...
>
>     One peculiarity about this class of applications is that while they
> are *almost* "embarrassingly parallel" there is the problem of data
> dependent resource conflicts; If you're processing packets, for instance,
> and you have thousands of active connections *most of the time* you can
> grab the next eight packets off the queue and each will belong to a
> different connection (thus they can be processed completely in parallel).
> Sometimes, however rarely, you will grab the next eight packets off the
> queue and discover that two of them belong to the same connection and thus
> they must be serialized with respect to one another to avoid incorrect
> results (i.e. the state machine for that connection must be updated with
> the result of processing the first packet *before attempting to process
> the second packet* since the first packet may initiate a state transition
> that fully changes the context in which the second must be evaluated).
>
>     Unlike many problems to which SIMD parallelism is applied, this class
> of problem does not have static loop-carried dependencies that are
> predictable in advance.  Rather it has dynamic (data dependent)
> dependencies which are generally rare, but correctness requires that they
> be identified such that no two events that require serialization are
> processed concurrently.  Intel was clearly aware of the degree to which
> this class of problem would benefit from SIMD parallelism and in the
> AVX-512CD instruction set extension they provided instructions like
> *vpconflictd*/*vpconflictq* which, in combination with a few other
> AVX-512 instructions, provide an effective way to detect when it is
> required to serialize two work items based on their shared need of a
> resource.  (For instance, if you're looking up state machine instances in a
> hash table you could use *vpconflictd* to detect when two lanes are going
> to require access to the same bucket).  Detecting such lane-to-lane hazards
> early and dispatching a batch as two smaller batches to avoid the hazard
> allows the program logic for processing each event to remain simple and
> efficient.
>
>     One might wonder why it makes sense to go through the effort of
> parallelizing a seemingly computationally trivial task like updating simple
> state machines in light of a stream of events.  There are two reasons which
> surprised me quite a bit when I first started looking into this:
>
>    1. When you're processing a stream of events each of which can update
>    one state machine instance out of thousands, it doesn't take too many state
>    machine instances to reach the point where your cache hit rate in looking
>    up the state machines becomes unavoidably low.  This means that beyond some
>    threshold number of state machine instances your event processing rate is
>    ultimately limited by DRAM latency.  If you can use a gather to look up
>    eight state machines from your hash table in parallel the component loads
>    which the CPU breaks the gather into end up, on average, being dispatched
>    to different DRAM channels and banks.  My experiments have shown that the
>    upshot of this is that the average time it takes to fetch eight items at
>    random from a gigabyte-sized hash table is only about 2x the average time
>    to fetch a single item with a normal scalar load instruction.
>    2. Many simple state machines (if implemented in the most elementary
>    way, with *if*/*else* or *switch*/*case* constructs) end up being
>    computationally trivial but very "branchy" which is especially hard on
>    modern CPUs with deep pipelines and a relatively high cost for mispredicted
>    branches.  This makes it advantageous to re-state (no pun intended) these
>    state machines in terms of data flow rather than control flow (e.g. use
>    look-up tables, compute both sides of an *if*/*else* and use a
>    conditional move or similar to select the correct one.  Modern CPUs can
>    easily exploit the instruction-level parallelism generated by computing
>    both branches and selecting one later).  Once you've re-stated most of your
>    control flow in terms of data flow already, it is only a small step to use
>    SIMD instructions to process multiple event+instance pairs in parallel.
>
>     I feel that as more of the low-hanging fruit is picked, and as SIMD
> instruction sets that are more general and closer to orthogonal (compare
> AVX-512 with SSE for instance) become more widely available it will become
> more and more important to apply this sort of parallelism to general
> purpose computing tasks as well (rather than only number crunching).
>
>         Cheers,
>                 -lars
>
> --
> You received this message because you are subscribed to the Google Groups
> "Intel SPMD Program Compiler Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/ispc-users/2c895a62-f77f-47c7-a187-c52fce4cc807o%40googlegroups.com
> <https://groups.google.com/d/msgid/ispc-users/2c895a62-f77f-47c7-a187-c52fce4cc807o%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Intel SPMD Program Compiler Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/ispc-users/CACRFwujYHzPiUUjhfoZ-ctC1XwaurdC-bQJ0XzEa1-8GyRdpMA%40mail.gmail.com.

Re: Data dependent lane conflicts

Reply via email to