> This contradicts
>
> /* X86_TUNE_READ_MODIFY_WRITE: Enable use of read modify write instructions
> such as "add $1, mem". */
> DEF_TUNE (X86_TUNE_READ_MODIFY_WRITE, "read_modify_write",
> ~(m_PENT | m_LAKEMONT))
>
> which enables "andl $0, (%edx)" for PentiumPro. "andl $0, (%edx)" works
> well on PentiumPro.
It is also enabled for zen but it does not mean that andl $0, (%edx)
is a good way of clearing meomry when optimizing for speed.
jan@padlo:/tmp> cat t.c
int mem;
int
main()
{
for (int i = 0; i < 1000000000; i++)
#ifdef AND
asm volatile ("andl $0, %0":"=m"(mem));
#else
#ifdef SPLIT
asm volatile ("xorl %%eax, %%eax; movl $0,
%0":"=m"(mem)::"eax");
#else
asm volatile ("movl $0, %0":"=m"(mem));
#endif
#endif
return 0;
}
jan@padlo:/tmp> gcc -O2 t.c ; time ./a.out
real 0m0.405s
user 0m0.403s
sys 0m0.002s
jan@padlo:/tmp> gcc -O2 -DSPLIT t.c ; time ./a.out
real 0m0.406s
user 0m0.404s
sys 0m0.001s
jan@padlo:/tmp> gcc -O2 -DAND t.c ; time ./a.out
real 0m2.824s
user 0m2.822s
sys 0m0.001s
Andl is slower then movl because it inroduces unnecesary memory read.
I don't have PentiumPro to test, but there -DSPLIT variant should be
bit better, since instruction exceed 7 bytes.
Looking into history of that knob, it was added by me
https://gcc.gnu.org/pipermail/gcc-patches/1999-July/014219.html
to control behaviour of splitter that split the move if it was longer
then 7 bytes which was impementing the following recommendation of the
Intel optimization manual:
"Avoid instructions that contain four or more micro-ops or instructions that
are more than
seven bytes long. If possible, use instructions that require one
micro-op"
So the comment on SPLIT_LONG_MOVES is bit incorrect not mentining that
move needs to exceed long_insn threshold.
I am not sure how much we need to care about PPro perofmrance these days
though.
Honza