On Wednesday 08 March 2006 02:29, Benjamin LaHaise wrote: > On Tue, Mar 07, 2006 at 05:27:37PM +0100, Andi Kleen wrote: > > On Wednesday 08 March 2006 00:26, Benjamin LaHaise wrote: > > > Hi Andi, > > > > > > On x86-64 one inefficiency that shows up on profiles is the handling of > > > struct page conversion to/from idx and addresses. This is mostly due to > > > the fact that struct page is currently 56 bytes on x86-64, so gcc has to > > > emit a slow division or multiplication to convert. > > > > Huh? > > You used an unsigned long, but ptrdiff_t is signed. gcc cannot use any > shifting tricks because they round incorrectly in the signed case.
My vmlinux has ffffffff80278382 <pfn_to_page>: ffffffff80278382: 8b 0d 78 ea 41 00 mov 4319864(%rip),%ecx # ffffffff80696e00 <memnode_shift> ffffffff80278388: 48 89 f8 mov %rdi,%rax ffffffff8027838b: 48 c1 e0 0c shl $0xc,%rax ffffffff8027838f: 48 d3 e8 shr %cl,%rax ffffffff80278392: 48 0f b6 80 00 5e 69 movzbq 0xffffffff80695e00(%rax),%rax ffffffff80278399: 80 ffffffff8027839a: 48 8b 14 c5 40 93 71 mov 0xffffffff80719340(,%rax,8),%rdx ffffffff802783a1: 80 ffffffff802783a2: 48 2b ba 40 36 00 00 sub 0x3640(%rdx),%rdi ffffffff802783a9: 48 6b c7 38 imul $0x38,%rdi,%rax ffffffff802783ad: 48 03 82 30 36 00 00 add 0x3630(%rdx),%rax ffffffff802783b4: c3 retq and ffffffff802783b5 <page_to_pfn>: ffffffff802783b5: 48 8b 07 mov (%rdi),%rax ffffffff802783b8: 48 c1 e8 38 shr $0x38,%rax ffffffff802783bc: 48 8b 14 c5 80 97 71 mov 0xffffffff80719780(,%rax,8),%rdx ffffffff802783c3: 80 ffffffff802783c4: 48 b8 b7 6d db b6 6d mov $0x6db6db6db6db6db7,%rax ffffffff802783cb: db b6 6d ffffffff802783ce: 48 2b ba 20 03 00 00 sub 0x320(%rdx),%rdi ffffffff802783d5: 48 c1 ff 03 sar $0x3,%rdi ffffffff802783d9: 48 0f af f8 imul %rax,%rdi ffffffff802783dd: 48 03 ba 28 03 00 00 add 0x328(%rdx),%rdi ffffffff802783e4: 48 89 f8 mov %rdi,%rax ffffffff802783e7: c3 retq Both look quite optimized to me. I haven't timed them but it would surprise me if P4 needed more than 20 cycles to crunch through each of them. > > AFAIK mul has a latency of < 10 cycles even on P4 so I can't imagine > > it's a real problem. Something must be wrong with your measurements. > > mul isn't particularly interesting in the profiles, it's the idiv. Where is that idiv exactly? I don't see it. > > My guess would be that on more macro loads it would be a loss due > > to more cache misses. > > But you get less false sharing of struct page on SMP as well. With a 56 byte > page a single struct page can overlap two cachelines, and on this workload > the page definately gets transferred from one CPU to the other. Only in pathological workloads. Normally the working set is so large that the probability of two pages are near each other is very small. -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html