On Wednesday 08 March 2006 02:29, Benjamin LaHaise wrote:
> On Tue, Mar 07, 2006 at 05:27:37PM +0100, Andi Kleen wrote:
> > On Wednesday 08 March 2006 00:26, Benjamin LaHaise wrote:
> > > Hi Andi,
> > > 
> > > On x86-64 one inefficiency that shows up on profiles is the handling of 
> > > struct page conversion to/from idx and addresses.  This is mostly due to 
> > > the fact that struct page is currently 56 bytes on x86-64, so gcc has to 
> > > emit a slow division or multiplication to convert. 
> > 
> > Huh? 
> 
> You used an unsigned long, but ptrdiff_t is signed.  gcc cannot use any 
> shifting tricks because they round incorrectly in the signed case.


My vmlinux has

ffffffff80278382 <pfn_to_page>:
ffffffff80278382:       8b 0d 78 ea 41 00       mov    4319864(%rip),%ecx       
 # ffffffff80696e00 <memnode_shift>
ffffffff80278388:       48 89 f8                mov    %rdi,%rax
ffffffff8027838b:       48 c1 e0 0c             shl    $0xc,%rax
ffffffff8027838f:       48 d3 e8                shr    %cl,%rax
ffffffff80278392:       48 0f b6 80 00 5e 69    movzbq 
0xffffffff80695e00(%rax),%rax
ffffffff80278399:       80 
ffffffff8027839a:       48 8b 14 c5 40 93 71    mov    
0xffffffff80719340(,%rax,8),%rdx
ffffffff802783a1:       80 
ffffffff802783a2:       48 2b ba 40 36 00 00    sub    0x3640(%rdx),%rdi
ffffffff802783a9:       48 6b c7 38             imul   $0x38,%rdi,%rax
ffffffff802783ad:       48 03 82 30 36 00 00    add    0x3630(%rdx),%rax
ffffffff802783b4:       c3                      retq   


and

ffffffff802783b5 <page_to_pfn>:
ffffffff802783b5:       48 8b 07                mov    (%rdi),%rax
ffffffff802783b8:       48 c1 e8 38             shr    $0x38,%rax
ffffffff802783bc:       48 8b 14 c5 80 97 71    mov    
0xffffffff80719780(,%rax,8),%rdx
ffffffff802783c3:       80 
ffffffff802783c4:       48 b8 b7 6d db b6 6d    mov    $0x6db6db6db6db6db7,%rax
ffffffff802783cb:       db b6 6d 
ffffffff802783ce:       48 2b ba 20 03 00 00    sub    0x320(%rdx),%rdi
ffffffff802783d5:       48 c1 ff 03             sar    $0x3,%rdi
ffffffff802783d9:       48 0f af f8             imul   %rax,%rdi
ffffffff802783dd:       48 03 ba 28 03 00 00    add    0x328(%rdx),%rdi
ffffffff802783e4:       48 89 f8                mov    %rdi,%rax
ffffffff802783e7:       c3                      retq   


Both look quite optimized to me. I haven't timed them but it would surprise me 
if P4 needed more than 20 cycles to crunch through each of them.


> > AFAIK mul has a latency of < 10 cycles even on P4 so I can't imagine
> > it's a real problem. Something must be wrong with your measurements.
> 
> mul isn't particularly interesting in the profiles, it's the idiv.

Where is that idiv exactly? I don't see it.


> > My guess would be that on more macro loads it would be a loss due 
> > to more cache misses.
> 
> But you get less false sharing of struct page on SMP as well.  With a 56 byte 
> page a single struct page can overlap two cachelines, and on this workload 
> the page definately gets transferred from one CPU to the other.

Only in pathological workloads. Normally the working set is so large 
that the probability of two pages are near each other is very small.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to