On 26/11/2017 12:20, Chris Wilson wrote:
Looking at the distribution of i915_wait_request for a set of GL
benchmarks, we see:

broadwell# python bcc/tools/funclatency.py -u i915_wait_request
    usecs               : count     distribution
        0 -> 1          : 29184    |****************************************|
        2 -> 3          : 5767     |*******                                 |
        4 -> 7          : 3000     |****                                    |
        8 -> 15         : 491      |                                        |
       16 -> 31         : 140      |                                        |
       32 -> 63         : 203      |                                        |
       64 -> 127        : 543      |                                        |
      128 -> 255        : 881      |*                                       |
      256 -> 511        : 1209     |*                                       |
      512 -> 1023       : 1739     |**                                      |
     1024 -> 2047       : 22855    |*******************************         |
     2048 -> 4095       : 1725     |**                                      |
     4096 -> 8191       : 5813     |*******                                 |
     8192 -> 16383      : 5348     |*******                                 |
    16384 -> 32767      : 1000     |*                                       |
    32768 -> 65535      : 4400     |******                                  |
    65536 -> 131071     : 296      |                                        |
   131072 -> 262143     : 225      |                                        |
   262144 -> 524287     : 4        |                                        |
   524288 -> 1048575    : 1        |                                        |
  1048576 -> 2097151    : 1        |                                        |
  2097152 -> 4194303    : 1        |                                        |

broxton# python bcc/tools/funclatency.py -u i915_wait_request
    usecs               : count     distribution
        0 -> 1          : 5523     |*************************************   |
        2 -> 3          : 1340     |*********                               |
        4 -> 7          : 2100     |**************                          |
        8 -> 15         : 755      |*****                                   |
       16 -> 31         : 211      |*                                       |
       32 -> 63         : 53       |                                        |
       64 -> 127        : 71       |                                        |
      128 -> 255        : 113      |                                        |
      256 -> 511        : 262      |*                                       |
      512 -> 1023       : 358      |**                                      |
     1024 -> 2047       : 1105     |*******                                 |
     2048 -> 4095       : 848      |*****                                   |
     4096 -> 8191       : 1295     |********                                |
     8192 -> 16383      : 5894     |****************************************|
    16384 -> 32767      : 4270     |****************************            |
    32768 -> 65535      : 5622     |**************************************  |
    65536 -> 131071     : 306      |**                                      |
   131072 -> 262143     : 50       |                                        |
   262144 -> 524287     : 76       |                                        |
   524288 -> 1048575    : 34       |                                        |
  1048576 -> 2097151    : 0        |                                        |
  2097152 -> 4194303    : 1        |                                        |

Picking 20us for the context-switch busyspin has the dual advantage of
catching most frequent short waits while avoiding the cost of a context
switch. 20us is a typical latency of 2 context-switches, i.e. the cost
of taking the sleep, without the secondary effects of cache flushing.

Next thing I wanted to ask is cumulative time spent spinning vs test duration, or in other words, CPU usage before and after.

And of course was the benefit on benchmarks results measurable, by how much, and what does the perf per Watt say?

Regards,

Tvrtko

Signed-off-by: Chris Wilson <[email protected]>
Cc: Sagar Kamble <[email protected]>
Cc: Eero Tamminen <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: Ben Widawsky <[email protected]>
Cc: Joonas Lahtinen <[email protected]>
Cc: Michał Winiarski <[email protected]>
---
  drivers/gpu/drm/i915/Kconfig.profile | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/Kconfig.profile 
b/drivers/gpu/drm/i915/Kconfig.profile
index a1aed0e2aad5..c8fe5754466c 100644
--- a/drivers/gpu/drm/i915/Kconfig.profile
+++ b/drivers/gpu/drm/i915/Kconfig.profile
@@ -11,7 +11,7 @@ config DRM_I915_SPIN_REQUEST_IRQ
config DRM_I915_SPIN_REQUEST_CS
        int
-       default 2 # microseconds
+       default 20 # microseconds
        help
          After sleeping for a request (GPU operation) to complete, we will
          be woken up on the completion of every request prior to the one

_______________________________________________
Intel-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Reply via email to