I'm working on implementing hardware loops for the CORE-V CV32E40P https://docs.openhwgroup.org/projects/cv32e40p-user-manual/en/latest/corev_hw_loop.html
This core supports nested hardware lops, but does not allow any other flow control inside hardware loops. I found that our existing interfaces do not allow sufficient control over when to emit doloop patterns, i.e. allowing nested doloops while rejecting other flow control inside the loop. TARGET_CAN_USE_DOLOOP_P does not get passed anything to look at the individual loop. Most convenient would be the loop structure, although that would cause tight coupling of the target port with the internal data structures of the loop optimizers. OTOH we already have a precedent with TARGET_PREDICT_DOLOOP_P . TARGET_INVALID_WITHIN_DOLOOP is missing context. We neither know the loop nesting depth, nor if any jump instruction under consideration is the final branch to jump back to the loop latch. Actually, the seccond part is the main problem for the CV32E40P: inner doloops that have been transformed can be recognized as such, but un-transformed condjumps could either be spaghetti code inside the loop or the final jump instruction of the loop. The doloop_end pattern is also missing context to make meaningful decisions. Although we know the label where the pattern is supposed to jump to, we don't know where the original branch is. Even if we scan the insn stream, this is ambigous, since there can be two (or more) nested doloop candidates. What we could do here is add optional arguments; there is precedence, e.g. for the call pattern. The advantage of this approach is that ports that are fine with the current interface need not be patched. To make it possible to scritinze the control flow of the loop, the branch at the end of the loop makes a good optional argument. There is also the issue that loop setup is a bit more costly for large loops, and it would be nice to weigh that against the iteration count. We had information about the iteration count at TARGET_CAN_USE_DOLOOP_P, but nothing to allow us to analyze the loop body. Although the port could stash avay the iteration count into a globalvariable or machine_function member, it would be more straightforward and robust to pass the information together so that it can be considered in context. Attached is an patch for an optional 3rd parameter to doloop_end .
2023-10-05 Joern Rennecke <joern.renne...@embecosm.com> gcc/ * doc/md.texi (doloop_end): Document oprional operand 2. * loop-doloop.cc (doloop_optimize): Provide 3rd operand to gen_doloop_end. * target-insns.def (doloop_end): Add optional 3rd operand. diff --git a/gcc/config.gcc b/gcc/config.gcc index ee46d96bf62..ba42ac3d425 100644 --- a/gcc/config.gcc +++ b/gcc/config.gcc @@ -544,7 +544,7 @@ riscv*) extra_objs="riscv-builtins.o riscv-c.o riscv-sr.o riscv-shorten-memrefs.o riscv-selftests.o riscv-string.o" extra_objs="${extra_objs} riscv-v.o riscv-vsetvl.o riscv-vector-costs.o" extra_objs="${extra_objs} riscv-vector-builtins.o riscv-vector-builtins-shapes.o riscv-vector-builtins-bases.o" - extra_objs="${extra_objs} thead.o" + extra_objs="${extra_objs} thead.o corev.o" d_target_objs="riscv-d.o" extra_headers="riscv_vector.h" target_gtfiles="$target_gtfiles \$(srcdir)/config/riscv/riscv-vector-builtins.cc" diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc index 4feb0a25ab9..d703cb5f2af 100644 --- a/gcc/loop-doloop.cc +++ b/gcc/loop-doloop.cc @@ -720,7 +720,8 @@ doloop_optimize (class loop *loop) count = copy_rtx (desc->niter_expr); start_label = block_label (desc->in_edge->dest); doloop_reg = gen_reg_rtx (mode); - rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label); + rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label, + BB_END (desc->in_edge->src)); word_mode_size = GET_MODE_PRECISION (word_mode); word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) - 1; @@ -737,7 +738,8 @@ doloop_optimize (class loop *loop) else count = lowpart_subreg (word_mode, count, mode); PUT_MODE (doloop_reg, word_mode); - doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label); + doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label, + BB_END (desc->in_edge->src)); } if (! doloop_seq) { diff --git a/gcc/target-insns.def b/gcc/target-insns.def index c4415d00735..962c5cc51d1 100644 --- a/gcc/target-insns.def +++ b/gcc/target-insns.def @@ -48,7 +48,7 @@ DEF_TARGET_INSN (casesi, (rtx x0, rtx x1, rtx x2, rtx x3, rtx x4)) DEF_TARGET_INSN (check_stack, (rtx x0)) DEF_TARGET_INSN (clear_cache, (rtx x0, rtx x1)) DEF_TARGET_INSN (doloop_begin, (rtx x0, rtx x1)) -DEF_TARGET_INSN (doloop_end, (rtx x0, rtx x1)) +DEF_TARGET_INSN (doloop_end, (rtx x0, rtx x1, rtx opt2)) DEF_TARGET_INSN (eh_return, (rtx x0)) DEF_TARGET_INSN (epilogue, (void)) DEF_TARGET_INSN (exception_receiver, (void))