Hi Folks,
GCC 4.5.1 20100924 "-Os -minline-all-stringops" on Core i7
int
main( int argc, char *argv[] )
{
int i, a[256], b[256];
for( i = 0; i < 256; ++i ) // discourage optimization
a[i] = rand();
memcpy( b, a, argc * sizeof(int) );
printf( "%d\n", b[rand()] ); // discourage optimization
return 0;
}
I wonder if its possible to improve the code generation for inline
stringops when
the length is known to be a multiple of 4 bytes?
That is, instead of:
movsx rcx, ebp# argc
sal rcx, 2
rep movsb
it would be nice to see:
movsx rcx, ebp# argc
rep movsd
Note that memcpy( b, a, 1024 ) generates:
mov ecx, 256
rep movsd
The reason I think this might be possible is this:-
Use -mstringop-strategy=rep_4byte to force the use of movsd.
For memcpy( b, a, argc * sizeof(int) ) we get:
movsx rcx, ebp# argc
sal rcx, 2
cmp rcx, 4
jb .L5 #,
shr rcx, 2
rep movsd
.L5:
For memcpy( b, a, argc ) we get:
movsx rax, ebp# argc, argc
mov rdi, rsp# tmp76,
lea rsi, [rsp+1024] # tmp77,
cmp rax, 4 # argc,
jb .L3 #,
mov rcx, rax# tmp78, argc
shr rcx, 2 # tmp78,
rep movsd
.L3:
xor edx, edx# tmp80
testal, 2 # argc,
je .L4 #,
mov dx, WORD PTR [rsi] # tmp82,
mov WORD PTR [rdi], dx #, tmp82
mov edx, 2 # tmp80,
.L4:
testal, 1 # argc,
je .L5 #,
mov al, BYTE PTR [rsi+rdx] # tmp85,
mov BYTE PTR [rdi+rdx], al #, tmp85
.L5:
In the former case (* sizeof(int)) gcc has omitted all the code do deal with 1,
2, and 3 bytes so the stringop code generation has apparently spotted
that the length
is a multiple of 4 bytes.
I can see that the expression code for the length is separate from the stringop
stuff. Though it does do the right thing with a literal.
Incidentally, for the second case, memcpy( b, a, argc ), the Visual Studio
compiler generates code like this:
mov eax, ecx
shr ecx, 2
rep movsd
mov ecx, eax
and ecx, 3
rep movsb
which seems cleaner (no jumps) than the GCC code, though knowing GCC there is
probably a good reason for its choice as it generally seems to have a far more
sophisticated optimizer.
Best regards,
Jeremy