https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82591

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2017-10-18
                 CC|                            |skimo at kotnet dot org,
                   |                            |spop at gcc dot gnu.org
   Target Milestone|---                         |8.0
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
With GCC 7 we have zero SCOPs detected, with GCC 8 we detect one SCOP.  What
takes all the time is code generation(!), namely
isl_ast_build_node_from_schedule.  The
schedule isn't that complicated:

domain: "[P_32] -> { S_6[i1, i2] : 0 <= P_32 <= 65535 and 0 <= i1 <= 127 and 0
<= i2 <= 2 and 65536*floor((-513 + P_32 - 1539i1 - 513i2)/65536) >= -514 + P_32
- 1539i1 - 513i2 }"
child:
  schedule: "[P_32] -> [{ S_6[i1, i2] -> [(floor((i1)/51))] }, { S_6[i1, i2] ->
[(floor((i2)/51))] }]"
  permutable: 1
  coincident: [ 1, 1 ]
  options: "{ separate[0] }"
  child:
    schedule: "[P_32] -> [{ S_6[i1, i2] -> [(i1 - 51*floor((i1)/51))] }, {
S_6[i1, i2] -> [(i2 - 51*floor((i2)/51))] }]"
    permutable: 1
    coincident: [ 1, 1 ]
    options: "{ separate[0] }"

note there's no limiting in place on the AST generator.

I don't immediately see what we can do about this - it looks like a bug in ISL
to me.  Sven?  It looks like the above might be enough to turn this into a
ISL testcase for test_inputs/codegen?  The separate option is not necessary to
trigger the problem.  The AST generated is

for (int c0 = 0; c0 <= 2; c0 += 1)
  for (int c2 = 0; c2 <= min(50, -51 * c0 + 127); c2 += 1)
    for (int c3 = 0; c3 <= 2; c3 += 1)
      if ((-P_32 + 78489 * c0 + 1539 * c2 + 513 * c3 + 66048) % 65536 >= 65534)
        S_6(51 * c0 + c2, c3);

The AST generated for the original schedule (before ISL optimizing and
blocking) is

for (int c0 = 0; c0 <= 127; c0 += 1)
  for (int c1 = 0; c1 <= 2; c1 += 1)
    if ((-P_32 + 1539 * c0 + 513 * c1 + 66048) % 65536 >= 65534)
      S_6(c0, c1);

and the corresponding schedule is

domain: "[P_32] -> { S_6[i1, i2] : 0 <= P_32 <= 65535 and 0 <= i1 <= 127 and 0
<= i2 <= 2 and 65536*floor((-513 + P_32 - 1539i1 - 513i2)/65536) >= -514 + P_32
- 1539i1 - 513i2 }"
child:
  schedule: "[P_32] -> L_1[{ S_6[i1, i2] -> [(i1)] }]"
  child:
    schedule: "[P_32] -> L_2[{ S_6[i1, i2] -> [(i2)] }]"

On the GCC side we can avoid the slowness with --param loop-block-tile-size=0
thus turn off forced blocking (we're doing quite stupid blocking by 51 even on
loops that we know iterate less...).

Our tiling code is doing

  long tile_size = PARAM_VALUE (PARAM_LOOP_BLOCK_TILE_SIZE);
  if (dims <= 1
      || tile_size == 0
      || !isl_schedule_node_band_get_permutable (node))
    {
      if (dump_file && dump_flags)
        fprintf (dump_file, "not tiled\n");
      return node;
    }

  /* Tile loops.  */
  space = isl_schedule_node_band_get_space (node);
  isl_multi_val *sizes = isl_multi_val_zero (space);
  isl_ctx *ctx = isl_schedule_node_get_ctx (node);

  for (unsigned i = 0; i < dims; i++)
    {
      sizes = isl_multi_val_set_val (sizes, i,
                                     isl_val_int_from_si (ctx, tile_size));
      if (dump_file && dump_flags)
        fprintf (dump_file, "tiled by %ld\n", tile_size);
    }

  node = isl_schedule_node_band_tile (node, sizes);
  node = isl_schedule_node_child (node, 0);

any hint where we can avoid tiling with obviously too big tiles?

Reply via email to