How to implement efficiently builtins for dual-result instructions ?

Dmitry Cheresiz Mon, 04 Feb 2008 02:14:56 -0800

Hi,

I am implementing a gcc backend for a  target architecture which
contains  assembly isntructions writing two result registers.
I have a difficulty implementing builtins for such instructions efficiently.


For example, the "super-load" instruction has a form    super_ld32  rA
-> rX, rY.
This operation retrieves two consecutive 32-bit values from the
address given by
register rA and writes them two the two 32-bit registers rX and rY.
The registers rX and rY might be non-consecutive.

To invoke this instruction from the source level, a compiler builtin
is provided.
Since C syntax doesn't provide functions with two results, this builtin refers
to them via pointers:__super_ld32( int* x, int *y, int *a)

For example, let sampleC1, sampleC2, and currFrame be local variables. Then
__super_ld32(&sampleC1, &sampleC2, &currFrame[index_xy]);
means that the result of a super load from  address &currFrame[index_xy] should
be assigned to the variables sampleC1 and sampleC2.

I expand this builtin as follows. First, I  generate two new pseudo
regs and an RTL insn which assignes them to the results of the
superload. This instruction is
matched by the following definition

(define_insn "customop_super_ld32"
  [ (set:SI (match_operand:SI 0 "register_operand" "=r")
            (unspec:SI [(match_operand:SI 2 "register_operand" "r"
)]UNSPEC_customop_super_ld32))
    (set:SI (match_operand:SI 1 "register_operand" "=r")
            (unspec:SI [(match_dup:SI 2 )]UNSPEC_customop_super_ld32_2))
  ]
 ""
  "super_ld32 %2 -> %0 %1"
)
To generate correct code, builtin has to be expanded to a semantically
 equivalent sequence  of RTL insns. Therefore, I also generate two
store instructions, which write the generated pseudos to the addresses
given by the x and y parameters of the builtin.

Compiling the code
__super_ld32(&sampleC1, &sampleC2, &currFrame[index_xy]);
 I have observed that when a variable sampleC1 is used relatively far
away from its definition (e.g. in  a different basic block), GCC was
not able to determine that its value is still contained in the
destination register of the super_ld32, although that
was the case. Instead, GCC loaded the  variable from the stack. On the
other hand,
when  the use was close to the definition, GCC was avoiding the load.
Consequently, the store generated during the builtin expansion was also
 often eliminated by the dse pass, resulting in efficient code.

I would like to achieve such efficient code generation also in more
complex cases.
I will appreciate if somebody can suggest a mechanism in GCC which can
be useful
for this or comment on the following approaches I am currently thinking of.

 Approach A.
          Substitute the builtin with two results referred by pointer
by two builtins having
          single results:
          sampleC1 = __super_ld32_part1(&currFrame[index_xy]) ;
          sampleC2 = __super_ld32_part2(&currFrame[index_xy], sampleC1)

         These two builtins can be each expanded to a single unspec RTL insn.
         I enforce __super_ld32_part2 to use sampleC1 in order to create a
        dependency and to be able to identify that they form a pair.
        At a later stage, I would like to identify such insn pairs and
        substitute them with a single RTL insn which should eventually
produce desired
        super_ld32 rA -> rX rY insn.
        I wonder if combine stage would be able to do so or it is
better to implement it
           manually for example in pass_final ?
  Approach B.
         Append REG_EQUIV notes to the destination registers of the RTL for
         customop_super_ld32,  hoping that this will help optimization
stages to realize
         that these regs contain the values of the variables which
addresses are
          given to the builtin, and let these stages to optimize
unnecessary ld/st insns.
          I see, however, that such notes are supposed to be applied
only to insns
          which have a  single destination register.

How to implement efficiently builtins for dual-result instructions ?

Reply via email to