Hi Chris,
Am 14.11.2016 um 09:31 schrieb Chris Wilson:
> The primary operation on the shared fence arrays is insertion and
> retrieval. Retrieval is reasonably fast, as we just copy the array, but
> insertion into the shared fence array is slow as we must iterate over all
> current fences to discard an older fence. (Note also that since the
> array is not autopruning, we cannot hook in the fence signal callback due
> to locking constraints, thus it can grow very, very large.) This, coupled
> with the atomics to acquire and release the shared fence, causes a severe
> performance degradation and scaling bottleneck, as felt by i915.ko in
> commit d07f0e59b2c7 ("drm/i915: Move GEM activity tracking into a common
> struct reservation_object"). To speed up insertion, we want a direct
> replacement of the per-context slot. Enter the radix tree.
>
> The kernel already has a couple of options for integer radix trees, but
> both are not suitable for us for a couple of reasons. First we need a
> per-tree preallocation in order to ensure that adding a fence does not
> fail. Both radixtree and idr only offer preallocation under a preempt
> disabled critical section, unsuitable for our existing users. idr has
> an internal __idr_pre_get that could be exported - but idr does not yet
> support inserting entries at an arbitrary location. Secondly, neither
> tree supports full u64 integers as used by the fence context identifiers
> (though radixtree would support that on 64bit platforms). And finally,
> if we ignore the API shortcomings, radixtree still appears too high in
> the profiles!
>
> So what are our requirements for the shared fence array?
>
> - guaranteed insertion
> - fast insertion
> - RCU-safe, fast traversal
> - 64bit identifiers
>
> To guarantee insertion, we need to preallocate enough storage to satisfy
> a later insertion. The reservation object API requires every
> add_shared_fence to preceded by a call to reserve_shared_fence. For an
> uncompressed radix tree we would need to preallocate enough layers to cover
> the full 64bit height, but with a compressed radix tree we only require a
> maximum of 2 spare layers.
>
> The compressed tree also offers a useful property wrt to the pattern of
> fences used. The fence contexts are allocated as a pure cyclic atomic64,
> i.e. it is sparse and ever incrementing. However, timelines tend to
> cluster (i.e. they will allocate a cluster of fence contexts for
> themselves and primarily use these). So we may see that we have a
> mixture of low fence contents and a group of very high contexts (e.g.
> a group at id<16 and a group at id>65536). The compressed tree avoids
> having to allocate the intermediate layers (reducing the pointer dancing
> on lookup and traversal) - still not as dense as the current compact
> array, but under typical conditions as good as. (However, the current
> array requires contiguous allocations - a scarce resource for the ever
> expanding array!) The density could be improved by switching from a
> simple dangerously wrapping atomic64 to an ida for fence context
> allocation.
>
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
Yeah, I was thinking about a similar change for a while now cause we
clearly see problems with that on amdgpu as well.
That's also the reason amdgpu internally uses a separate container for
fences. This implementation is based on a fixed sized hash table and
performs well enough for our cases. Might be a good idea to check that
one as well, cause it doesn't introduce a new container object just for
this use case.
In general on such kind of changes I prefer that the implementation of
the container code is added separately and then the existing code
changed to use it. That makes reviewing the changes much easier.
Regards,
Christian.