On Tue, Apr 04, 2017 at 05:10:30PM -0700, Kenneth Graunke wrote: > The execbuf2 kernel API requires us to construct two kinds of lists. > First is a "validation list" (struct drm_i915_gem_exec_object2[]) > containing each BO referenced by the batch. (The batch buffer itself > must be the last entry in this list.) Each validation list entry > contains a pointer to the second kind of list: a relocation list. > The relocation list contains information about pointers to BOs that > the kernel may need to patch up if it relocates objects within the VMA. > > This is a very general mechanism, allowing every BO to contain pointers > to other BOs. libdrm_intel models this by giving each drm_intel_bo a > list of relocations to other BOs. Together, these form "reloc trees". > > Processing relocations involves a depth-first-search of the relocation > trees, starting from the batch buffer. Care has to be taken not to > double-visit buffers. Creating the validation list has to be deferred > until the last minute, after all relocations are emitted, so we have the > full tree present. Calculating the amount of aperture space required to > pin those BOs also involves tree walking, which is expensive, so libdrm > has hacks to try and perform less expensive estimates. > > For some reason, it also stored the validation list in the global > (per-screen) bufmgr structure, rather than as an local variable in the > execbuffer function, requiring locking for no good reason. > > It also assumed that the batch would probably contain a relocation > every 2 DWords - which is absurdly high - and simply aborted if there > were more relocations than the max. This meant the first relocation > from a BO would allocate 180kB of data structures! > > This is way too complicated for our needs. i965 only emits relocations > from the batchbuffer - all GPU commands and state such as SURFACE_STATE > live in the batch BO. No other buffer uses relocations. This means we > can have a single relocation list for the batchbuffer. We can add a BO > to the validation list (set) the first time we emit a relocation to it. > We can easily keep a running tally of the aperture space required for > that list by adding the BO size when we add it to the validation list. > > This patch overhauls the relocation system to do exactly that. There > are many nice benefits: > > - We have a flat relocation list instead of trees. > - We can produce the validation list up front. > - We can allocate smaller arrays and dynamically grow them. > - Aperture space checks are now (a + b <= c) instead of a tree walk. > - brw_batch_references() is a trivial validation list walk. > It should be straightforward to make it O(1) in the future. > - We don't need to bloat each drm_bacon_bo with 32B of reloc data. > - We don't need to lock in execbuffer, as the data structures are > context-local, and not per-screen. > - Significantly less code and a better match for what we're doing. > - The simpler system should make it easier to take advantage of > I915_EXEC_NO_RELOC in a future patch. > --- > src/mesa/drivers/dri/i965/brw_bufmgr.h | 53 +- > src/mesa/drivers/dri/i965/brw_compute.c | 2 +- > src/mesa/drivers/dri/i965/brw_context.h | 12 + > src/mesa/drivers/dri/i965/brw_draw.c | 2 +- > src/mesa/drivers/dri/i965/genX_blorp_exec.c | 2 +- > src/mesa/drivers/dri/i965/intel_batchbuffer.c | 215 +++++++- > src/mesa/drivers/dri/i965/intel_batchbuffer.h | 3 + > src/mesa/drivers/dri/i965/intel_blit.c | 30 +- > src/mesa/drivers/dri/i965/intel_bufmgr_gem.c | 718 > +------------------------- > 9 files changed, 229 insertions(+), 808 deletions(-) > > diff --git a/src/mesa/drivers/dri/i965/brw_bufmgr.h > b/src/mesa/drivers/dri/i965/brw_bufmgr.h > index 237f39bb078..d3db6a3967b 100644 > --- a/src/mesa/drivers/dri/i965/brw_bufmgr.h > +++ b/src/mesa/drivers/dri/i965/brw_bufmgr.h > @@ -88,6 +88,15 @@ struct _drm_bacon_bo { > * entries when calling drm_bacon_bo_emit_reloc() > */ > uint64_t offset64; > + > + /** > + * Boolean of whether the GPU is definitely not accessing the buffer. > + * > + * This is only valid when reusable, since non-reusable > + * buffers are those that have been shared with other > + * processes, so we don't know their state. > + */ > + bool idle; > }; > > #define BO_ALLOC_FOR_RENDER (1<<0) > @@ -178,37 +187,6 @@ void drm_bacon_bo_wait_rendering(drm_bacon_bo *bo); > */ > void drm_bacon_bufmgr_destroy(drm_bacon_bufmgr *bufmgr); > > -/** Executes the command buffer pointed to by bo. */ > -int drm_bacon_bo_exec(drm_bacon_bo *bo, int used); > - > -/** Executes the command buffer pointed to by bo on the selected ring buffer > */ > -int drm_bacon_bo_mrb_exec(drm_bacon_bo *bo, int used, unsigned int flags); > -int drm_bacon_bufmgr_check_aperture_space(drm_bacon_bo ** bo_array, int > count); > - > -/** > - * Add relocation entry in reloc_buf, which will be updated with the > - * target buffer's real offset on on command submission. > - * > - * Relocations remain in place for the lifetime of the buffer object. > - * > - * \param bo Buffer to write the relocation into. > - * \param offset Byte offset within reloc_bo of the pointer to > - * target_bo. > - * \param target_bo Buffer whose offset should be written into the > - * relocation entry. > - * \param target_offset Constant value to be added to target_bo's > - * offset in relocation entry. > - * \param read_domains GEM read domains which the buffer will be > - * read into by the command that this relocation > - * is part of. > - * \param write_domains GEM read domains which the buffer will be > - * dirtied in by the command that this > - * relocation is part of. > - */ > -int drm_bacon_bo_emit_reloc(drm_bacon_bo *bo, uint32_t offset, > - drm_bacon_bo *target_bo, uint32_t target_offset, > - uint32_t read_domains, uint32_t write_domain); > - > /** > * Ask that the buffer be placed in tiling mode > * > @@ -271,9 +249,6 @@ int drm_bacon_bo_disable_reuse(drm_bacon_bo *bo); > */ > int drm_bacon_bo_is_reusable(drm_bacon_bo *bo); > > -/** Returns true if target_bo is in the relocation tree rooted at bo. */ > -int drm_bacon_bo_references(drm_bacon_bo *bo, drm_bacon_bo *target_bo); > - > /* drm_bacon_bufmgr_gem.c */ > drm_bacon_bufmgr *drm_bacon_bufmgr_gem_init(struct gen_device_info *devinfo, > int fd, int batch_size); > @@ -290,8 +265,6 @@ void *drm_bacon_gem_bo_map__cpu(drm_bacon_bo *bo); > void *drm_bacon_gem_bo_map__gtt(drm_bacon_bo *bo); > void *drm_bacon_gem_bo_map__wc(drm_bacon_bo *bo); > > -int drm_bacon_gem_bo_get_reloc_count(drm_bacon_bo *bo); > -void drm_bacon_gem_bo_clear_relocs(drm_bacon_bo *bo, int start); > void drm_bacon_gem_bo_start_gtt_access(drm_bacon_bo *bo, int write_enable); > > int drm_bacon_gem_bo_wait(drm_bacon_bo *bo, int64_t timeout_ns); > @@ -300,14 +273,6 @@ drm_bacon_context > *drm_bacon_gem_context_create(drm_bacon_bufmgr *bufmgr); > int drm_bacon_gem_context_get_id(drm_bacon_context *ctx, > uint32_t *ctx_id); > void drm_bacon_gem_context_destroy(drm_bacon_context *ctx); > -int drm_bacon_gem_bo_context_exec(drm_bacon_bo *bo, drm_bacon_context *ctx, > - int used, unsigned int flags); > -int drm_bacon_gem_bo_fence_exec(drm_bacon_bo *bo, > - drm_bacon_context *ctx, > - int used, > - int in_fence, > - int *out_fence, > - unsigned int flags); > > int drm_bacon_bo_gem_export_to_prime(drm_bacon_bo *bo, int *prime_fd); > drm_bacon_bo *drm_bacon_bo_gem_create_from_prime(drm_bacon_bufmgr *bufmgr, > diff --git a/src/mesa/drivers/dri/i965/brw_compute.c > b/src/mesa/drivers/dri/i965/brw_compute.c > index d816d056aad..e924401c3af 100644 > --- a/src/mesa/drivers/dri/i965/brw_compute.c > +++ b/src/mesa/drivers/dri/i965/brw_compute.c > @@ -212,7 +212,7 @@ brw_dispatch_compute_common(struct gl_context *ctx) > > brw->no_batch_wrap = false; > > - if (drm_bacon_bufmgr_check_aperture_space(&brw->batch.bo, 1)) { > + if (!brw_batch_has_aperture_space(brw, 0)) { > if (!fail_next) { > intel_batchbuffer_reset_to_saved(brw); > intel_batchbuffer_flush(brw); > diff --git a/src/mesa/drivers/dri/i965/brw_context.h > b/src/mesa/drivers/dri/i965/brw_context.h > index 00e9224d7d7..186ce826801 100644 > --- a/src/mesa/drivers/dri/i965/brw_context.h > +++ b/src/mesa/drivers/dri/i965/brw_context.h > @@ -477,9 +477,21 @@ struct intel_batchbuffer { > bool needs_sol_reset; > bool state_base_address_emitted; > > + struct drm_i915_gem_relocation_entry *relocs; > + int reloc_count; > + int reloc_array_size; > + /** The validation list */ > + struct drm_i915_gem_exec_object2 *exec_objects; > + drm_bacon_bo **exec_bos; > + int exec_count; > + int exec_array_size; > + /** The amount of aperture space (in bytes) used by all exec_bos */ > + int aperture_space;
You are not using the aperture (that is the CPU addressable BAR). This is just the batch_size, and not to be confused with the rss that you may also want to track in future. > + > struct { > uint32_t *map_next; > int reloc_count; > + int exec_count; > } saved; > > /** Map from batch offset to brw_state_batch data (with DEBUG_BATCH) */ > diff --git a/src/mesa/drivers/dri/i965/brw_draw.c > b/src/mesa/drivers/dri/i965/brw_draw.c > index a6e229f2210..bf09915d0c3 100644 > --- a/src/mesa/drivers/dri/i965/brw_draw.c > +++ b/src/mesa/drivers/dri/i965/brw_draw.c > @@ -601,7 +601,7 @@ retry: > > brw->no_batch_wrap = false; > > - if (drm_bacon_bufmgr_check_aperture_space(&brw->batch.bo, 1)) { > + if (!brw_batch_has_aperture_space(brw, 0)) { > if (!fail_next) { > intel_batchbuffer_reset_to_saved(brw); > intel_batchbuffer_flush(brw); > diff --git a/src/mesa/drivers/dri/i965/genX_blorp_exec.c > b/src/mesa/drivers/dri/i965/genX_blorp_exec.c > index dc1c1ea2993..e1a799457a2 100644 > --- a/src/mesa/drivers/dri/i965/genX_blorp_exec.c > +++ b/src/mesa/drivers/dri/i965/genX_blorp_exec.c > @@ -231,7 +231,7 @@ retry: > * map all the BOs into the GPU at batch exec time later. If so, flush > the > * batch and try again with nothing else in the batch. > */ > - if (drm_bacon_bufmgr_check_aperture_space(&brw->batch.bo, 1)) { > + if (!brw_batch_has_aperture_space(brw, 0)) { > if (!check_aperture_failed_once) { > check_aperture_failed_once = true; > intel_batchbuffer_reset_to_saved(brw); > diff --git a/src/mesa/drivers/dri/i965/intel_batchbuffer.c > b/src/mesa/drivers/dri/i965/intel_batchbuffer.c > index 59ab55d31f7..444e33c48ee 100644 > --- a/src/mesa/drivers/dri/i965/intel_batchbuffer.c > +++ b/src/mesa/drivers/dri/i965/intel_batchbuffer.c > @@ -38,6 +38,8 @@ > #include <xf86drm.h> > #include <i915_drm.h> > > +#define FILE_DEBUG_FLAG DEBUG_BUFMGR > + > static void > intel_batchbuffer_reset(struct intel_batchbuffer *batch, > drm_bacon_bufmgr *bufmgr, > @@ -68,6 +70,17 @@ intel_batchbuffer_init(struct intel_batchbuffer *batch, > batch->map_next = batch->cpu_map; > } > > + batch->reloc_count = 0; > + batch->reloc_array_size = 768; Wowser! > + batch->relocs = malloc(batch->reloc_array_size * > + sizeof(struct drm_i915_gem_relocation_entry)); Allocation failure handling? Deferred to use? Even so, you will need to record you didn't allocate the array_size. > void > intel_batchbuffer_reset_to_saved(struct brw_context *brw) > { > - drm_bacon_gem_bo_clear_relocs(brw->batch.bo, > brw->batch.saved.reloc_count); > + for (int i = brw->batch.saved.exec_count; > + i < brw->batch.exec_count; i++) { > + if (brw->batch.exec_bos[i] != brw->batch.bo) { > + drm_bacon_bo_unreference(brw->batch.exec_bos[i]); > + } There's a trick to add to your todo list - treat being busy as an implicit reference. > +static void > +add_exec_bo(struct intel_batchbuffer *batch, drm_bacon_bo *bo) > +{ > + if (bo != batch->bo) { > + for (int i = 0; i < batch->exec_count; i++) { > + if (batch->exec_bos[i] == bo) > + return; > + } > + > + drm_bacon_bo_reference(bo); > + } > + > + if (batch->exec_count == batch->exec_array_size) { > + batch->exec_array_size *= 2; > + batch->exec_bos = > + realloc(batch->exec_bos, > + batch->exec_array_size * sizeof(batch->exec_bos[0])); > + batch->exec_objects = > + realloc(batch->exec_objects, > + batch->exec_array_size * sizeof(batch->exec_objects[0])); This has to be robust! What I suggested was to install a setjmp during batch_begin so that you could unwind and escape after an allocation failure and have it reported back to GL. Those patches to refactor the aperture checking / error checking around batch buffer handling can be applied to this series. > + } > + > + struct drm_i915_gem_exec_object2 *validation_entry = > + &batch->exec_objects[batch->exec_count]; > + validation_entry->handle = bo->handle; > + if (bo == batch->bo) { > + validation_entry->relocation_count = batch->reloc_count; > + validation_entry->relocs_ptr = (uintptr_t) batch->relocs; > + } else { > + validation_entry->relocation_count = 0; > + validation_entry->relocs_ptr = 0; > + } > + validation_entry->alignment = bo->align; > + validation_entry->offset = bo->offset64; > + validation_entry->flags = 0; > + validation_entry->rsvd1 = 0; > + validation_entry->rsvd2 = 0; > + > + batch->exec_bos[batch->exec_count] = bo; > + batch->exec_count++; > + batch->aperture_space += bo->size; > +} > + > +static int > +execbuffer(int fd, > + struct intel_batchbuffer *batch, > + drm_bacon_context *ctx, > + int used, > + int in_fence, > + int *out_fence, > + int flags) > +{ > + struct drm_i915_gem_execbuffer2 execbuf = { > + .buffers_ptr = (uintptr_t) batch->exec_objects, > + .buffer_count = batch->exec_count, > + .batch_start_offset = 0, > + .batch_len = used, > + .cliprects_ptr = 0, > + .num_cliprects = 0, > + .DR1 = 0, > + .DR4 = 0, Just don't mention the garbage. > + .flags = flags, > + .rsvd2 = 0, > + }; > + > + uint32_t ctx_id = 0; > + drm_bacon_gem_context_get_id(ctx, &ctx_id); > + i915_execbuffer2_set_context_id(execbuf, ctx_id); Urgh. > + > + if (in_fence != -1) { > + execbuf.rsvd2 = in_fence; > + execbuf.flags |= I915_EXEC_FENCE_IN; > + } > + > + if (out_fence != NULL) { > + *out_fence = -1; > + execbuf.flags |= I915_EXEC_FENCE_OUT; > + } > + > + int ret = drmIoctl(fd, DRM_IOCTL_I915_GEM_EXECBUFFER2_WR, &execbuf); You could improve this by cmd = DRM_IOCTL_I915_GEM_EXECBUFFER2; if (out_fence) cmd |= _IOC_WRITE << _IOC_DIRSHIFT; or cmd = DRM_IOCTL_I915_GEM_EXECBUFFER2_WR; > + if (ret != 0) > + ret = -errno; > + > + for (int i = 0; i < batch->exec_count; i++) { > + drm_bacon_bo *bo = batch->exec_bos[i]; > + > + bo->idle = false; > + > + /* Update drm_bacon_bo::offset64 */ > + if (batch->exec_objects[i].offset != bo->offset64) { > + DBG("BO %d migrated: 0x%" PRIx64 " -> 0x%llx\n", > + bo->handle, bo->offset64, batch->exec_objects[i].offset); > + bo->offset64 = batch->exec_objects[i].offset; > + } > + } > + > + if (ret == 0 && out_fence != NULL) > + *out_fence = execbuf.rsvd2 >> 32; > + > + return ret; > +} > + > static int > do_flush_locked(struct brw_context *brw, int in_fence_fd, int *out_fence_fd) > { > + __DRIscreen *dri_screen = brw->screen->driScrnPriv; > struct intel_batchbuffer *batch = &brw->batch; > int ret = 0; > > @@ -484,17 +625,18 @@ do_flush_locked(struct brw_context *brw, int > in_fence_fd, int *out_fence_fd) > flags |= I915_EXEC_GEN7_SOL_RESET; > > if (ret == 0) { > - if (brw->hw_ctx == NULL || batch->ring != RENDER_RING) { > + void *hw_ctx = batch->ring != RENDER_RING ? NULL : brw->hw_ctx; > + if (hw_ctx == NULL) { > assert(in_fence_fd == -1); > assert(out_fence_fd == NULL); These assertions are no longer related. They only applied to the libdrm_intel API. > - ret = drm_bacon_bo_mrb_exec(batch->bo, 4 * USED_BATCH(*batch), > - flags); > - } else { > - ret = drm_bacon_gem_bo_fence_exec(batch->bo, brw->hw_ctx, > - 4 * USED_BATCH(*batch), > - in_fence_fd, out_fence_fd, > - flags); > - } > + } > + > + /* Add the batch itself to the end of the validation list */ > + add_exec_bo(batch, batch->bo); > + > + ret = execbuffer(dri_screen->fd, batch, hw_ctx, > + 4 * USED_BATCH(*batch), > + in_fence_fd, out_fence_fd, flags); > } > > throttle(brw); > @@ -577,9 +719,20 @@ _intel_batchbuffer_flush_fence(struct brw_context *brw, > } > > bool > +brw_batch_has_aperture_space(struct brw_context *brw, unsigned extra_space) > +{ > + return brw->batch.aperture_space + extra_space <= > + brw->screen->aperture_threshold; > +} > + > +bool > brw_batch_references(struct intel_batchbuffer *batch, drm_bacon_bo *bo) > { > - return drm_bacon_bo_references(batch->bo, bo); > + for (int i = 0; i < batch->exec_count; i++) { > + if (batch->exec_bos[i] == bo) > + return true; > + } > + return false; > } > > /* This is the only way buffers get added to the validate list. > @@ -589,13 +742,31 @@ brw_emit_reloc(struct intel_batchbuffer *batch, > uint32_t batch_offset, > drm_bacon_bo *target, uint32_t target_offset, > uint32_t read_domains, uint32_t write_domain) > { > - int ret; > + if (batch->reloc_count == batch->reloc_array_size) { > + batch->reloc_array_size *= 2; > + batch->relocs = realloc(batch->relocs, > + batch->reloc_array_size * > + sizeof(struct drm_i915_gem_relocation_entry)); See above for ideas on handling malloc failures. > + } > + > + /* Check args */ > + assert(batch_offset <= BATCH_SZ - sizeof(uint32_t)); > + assert(_mesa_bitcount(write_domain) <= 1); > + > + if (target != batch->bo) > + add_exec_bo(batch, target); > + > + struct drm_i915_gem_relocation_entry *reloc = > + &batch->relocs[batch->reloc_count]; > + > + batch->reloc_count++; > > - ret = drm_bacon_bo_emit_reloc(batch->bo, batch_offset, > - target, target_offset, > - read_domains, write_domain); > - assert(ret == 0); > - (void)ret; > + reloc->offset = batch_offset; > + reloc->delta = target_offset; > + reloc->target_handle = target->handle; > + reloc->read_domains = read_domains; > + reloc->write_domain = write_domain; > + reloc->presumed_offset = target->offset64; Hmm, this is the wrong offset (its global not context). This is not the final version of relocation handling... -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev