Hi, I am implementing a gcc backend for a target architecture which contains assembly isntructions writing two result registers. I have a difficulty implementing builtins for such instructions efficiently.
For example, the "super-load" instruction has a form super_ld32 rA -> rX, rY. This operation retrieves two consecutive 32-bit values from the address given by register rA and writes them two the two 32-bit registers rX and rY. The registers rX and rY might be non-consecutive. To invoke this instruction from the source level, a compiler builtin is provided. Since C syntax doesn't provide functions with two results, this builtin refers to them via pointers:__super_ld32( int* x, int *y, int *a) For example, let sampleC1, sampleC2, and currFrame be local variables. Then __super_ld32(&sampleC1, &sampleC2, &currFrame[index_xy]); means that the result of a super load from address &currFrame[index_xy] should be assigned to the variables sampleC1 and sampleC2. I expand this builtin as follows. First, I generate two new pseudo regs and an RTL insn which assignes them to the results of the superload. This instruction is matched by the following definition (define_insn "customop_super_ld32" [ (set:SI (match_operand:SI 0 "register_operand" "=r") (unspec:SI [(match_operand:SI 2 "register_operand" "r" )]UNSPEC_customop_super_ld32)) (set:SI (match_operand:SI 1 "register_operand" "=r") (unspec:SI [(match_dup:SI 2 )]UNSPEC_customop_super_ld32_2)) ] "" "super_ld32 %2 -> %0 %1" ) To generate correct code, builtin has to be expanded to a semantically equivalent sequence of RTL insns. Therefore, I also generate two store instructions, which write the generated pseudos to the addresses given by the x and y parameters of the builtin. Compiling the code __super_ld32(&sampleC1, &sampleC2, &currFrame[index_xy]); I have observed that when a variable sampleC1 is used relatively far away from its definition (e.g. in a different basic block), GCC was not able to determine that its value is still contained in the destination register of the super_ld32, although that was the case. Instead, GCC loaded the variable from the stack. On the other hand, when the use was close to the definition, GCC was avoiding the load. Consequently, the store generated during the builtin expansion was also often eliminated by the dse pass, resulting in efficient code. I would like to achieve such efficient code generation also in more complex cases. I will appreciate if somebody can suggest a mechanism in GCC which can be useful for this or comment on the following approaches I am currently thinking of. Approach A. Substitute the builtin with two results referred by pointer by two builtins having single results: sampleC1 = __super_ld32_part1(&currFrame[index_xy]) ; sampleC2 = __super_ld32_part2(&currFrame[index_xy], sampleC1) These two builtins can be each expanded to a single unspec RTL insn. I enforce __super_ld32_part2 to use sampleC1 in order to create a dependency and to be able to identify that they form a pair. At a later stage, I would like to identify such insn pairs and substitute them with a single RTL insn which should eventually produce desired super_ld32 rA -> rX rY insn. I wonder if combine stage would be able to do so or it is better to implement it manually for example in pass_final ? Approach B. Append REG_EQUIV notes to the destination registers of the RTL for customop_super_ld32, hoping that this will help optimization stages to realize that these regs contain the values of the variables which addresses are given to the builtin, and let these stages to optimize unnecessary ld/st insns. I see, however, that such notes are supposed to be applied only to insns which have a single destination register.