Hi
In a previous post I pointed to a strange code generation`by gcc in the
riscv-64 targets.
To resume:
Suppose a 64 bit operation: c = a OP b;
Gcc does the following:
Instead of loading 64 bits from memory gcc loads 8 bytes into 8
separate registers for both operands. Then it ORs the 8 bytes into a single 64
bit number. Then, it executes the 64 bit operation. And lastly, it splits the
64 bits result into 8 bytes into 8 different registers, and stores this 8 bytes
one after the other.
When I saw this I was impressed that that utterly bloated code did run faster
than a hastyly written assembly program I did in 10 minutes. Obviously I didn’t
take any pipeline turbulence into account and my program was slower. When I did
take pipeline turbulence into account, I managed to write a program that runs
several times faster than the bloated code.
You realize that for the example above, instead of
1) Load 64 bits into a register (2 operations)
2) Do the operation
3) Store the result
We have 2 loads, and 1 operation + a store. 4 instructions compared to 46
operations for the « gcc way » (16 loads of a byte, 14 x 2 OR operations and 8
shifts to split the result and 8 stores of a byte each.
I think this is a BUG, but I’m still not convinced that it is one, and I do
not have a clue WHY you do this.
Is here anyone doing the riscv backend? This happens only with -O3 by the way
Sample code:
#define ACCUM_MENGTH 9
#define WORDSIZE 64
Typedef struct {
Int sign, exponent;
Long long mantissa[ACCUM_LENGTH];
} QfloatAccum,*QfloatAccump;
void shup1(QfloatAccump x)
{
QELT newbits,bits;
int i;
bits = x->mantissa[ACCUM_LENGTH] >> (WORDSIZE-1);
x->mantissa[ACCUM_LENGTH] <<= 1;
for( i=ACCUM_LENGTH-1; i>0; i-- ) {
newbits = x->mantissa[i] >> (WORDSIZE - 1);
x->mantissa[i] <<= 1;
x->mantissa[i] |= bits;
bits = newbits;
}
x->mantissa[0] <<= 1;
x->mantissa[0] |= bits;
}
Please point me to the right person. Thanks