How did this two-year old thread get resurrected? Anyway, it got resurrected without even answering one core question:
On Thu, Jul 20, 2017 at 4:24 AM, Peter Zijlstra <[email protected]> wrote: > On Mon, Feb 02, 2015 at 11:13:44AM -0800, Linus Torvalds wrote: >>>> On Mon, Feb 2, 2015 at 11:00 AM, Linus Torvalds >>>> <[email protected]> wrote: >> > >> > (I'm also not entirely sure what uses int_sqrt() that ends up being so >> > performance-critical, so it would be good to document that too, since >> > that probably also matters for the "what's the normal argument range" >> > question..) This is still the case. Which of the (very few) users really _care_? And what are the normal values for that? For example, the 802.11 minstrel code does a "MINSTREL_TRUNC()" on a "unsigned int" value that is in millions. It's not even "unsigned long", so we know it's not many thousands of millions, and MINSTREL_TRUNC shifts it down by 12 bits. So we know we have at most a 20-bit argument. The one case that uses actual unsigned long seems to be "slow_is_prime_number()", and honestly, the sqrt() is the *least* of our problems there. There's a few drivers and filesystems that use it. I do not believe performance matters in those cases. Even if you do a "int_sqrt()" per nertwork packet on some high-performance network (and none of them look anything like that). And there's a couple of VM users. They don't look particularly critical either. So why do you care? Because honestly, calling int_sqrt() once in a blue moon with caches cold and no branch prediction information tends to have very different performance characteristics from calling it in a loop with very predictable input. So I think your "benchmark" is just garbage, in that it's testing something entirely different than the actual load. Also, since this is a generic library routine, no way can we depend on fls being fast. But we could certainly improve on the initial value a lot. It's just that we should probably strive to improve on it without adding extra branch misprediction or I$ misses - both things that your benchmark isn't actually testing at all, since it does the exact opposite of that by basically preloading both. And the *most* important question is that first one: "Why does this matter, and what is the range it matters for?" Linus

