On Thu, Jul 20, 2017 at 04:24:32PM -0700, Linus Torvalds wrote:

> So mayube something like the attached?
> 
> NOTE NOTE NOTE! This may be pure garbage. It's untested. But the code
> generation looks ok even if gcc is being way too clever and turns that
> first loop into a counted loop instead. Whatever, maybe it's the right
> thing to do.

It passes a -DVALIDATE=1 run (up to a billion or so), so it seems to work.


I've made the test thing a bit more complicated in order to address your
concerns for primed branch predictors (and once the branch predictors
get wiped, the predictable input values don't matter anymore either I
think, although I could easily LFSR generate them as well of course).

Find it attached in a tarball.


The results, from running ./test.sh (left SKL, right SNB):

(values <100k)

Cycles:


EVENT=0 -DLINUS=1
event: 29.533080 +- 0.046261                    36.800100 +- 0.046067
EVENT=0 -DLINUS=1 -DWIPE_BTB=1
event: 143.513440 +- 0.152651                   153.054640 +- 0.099403
EVENT=0 -DNEW=1 -DANSHUL=1
event: 41.457300 +- 0.048409                    55.046100 +- 0.044968
EVENT=0 -DNEW=1 -DANSHUL=1 -DWIPE_BTB=1
event: 128.351400 +- 0.366873                   161.183690 +- 0.132301
EVENT=0 -DNEW=1 -DANSHUL=1 -DFLS=1
event: 21.433170 +- 0.043859                    27.672700 +- 0.063982
EVENT=0 -DNEW=1 -DANSHUL=1 -DFLS=1 -DWIPE_BTB=1
event: 79.850210 +- 0.133646                    112.422470 +- 0.101465
EVENT=0 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1
event: 31.771680 +- 0.037655                    37.515160 +- 0.080305
EVENT=0 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1 -DWIPE_BTB=1
event: 129.673440 +- 0.430308                   125.514390 +- 0.117590


Branches:


EVENT=4 -DLINUS=1
event: 31.507740 +- 0.010470                    31.507710 +- 0.010470
EVENT=4 -DLINUS=1 -DWIPE_BTB=1
event: 31.507720 +- 0.010470                    31.507760 +- 0.010470
EVENT=4 -DNEW=1 -DANSHUL=1
event: 45.125510 +- 0.002696                    45.125510 +- 0.002696
EVENT=4 -DNEW=1 -DANSHUL=1 -DWIPE_BTB=1
event: 45.125490 +- 0.002696                    45.125540 +- 0.002697
EVENT=4 -DNEW=1 -DANSHUL=1 -DFLS=1
event: 21.252320 +- 0.005266                    21.252330 +- 0.005266
EVENT=4 -DNEW=1 -DANSHUL=1 -DFLS=1 -DWIPE_BTB=1
event: 21.252320 +- 0.005266                    21.252340 +- 0.005266
EVENT=4 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1
event: 25.727940 +- 0.005975                    25.727930 +- 0.005975
EVENT=4 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1 -DWIPE_BTB=1
event: 25.727940 +- 0.005975                    25.727930 +- 0.005975


Branch-Misses:


EVENT=5 -DLINUS=1
event: 0.022670 +- 0.000789                     0.020920 +- 0.000812
EVENT=5 -DLINUS=1 -DWIPE_BTB=1
event: 5.481750 +- 0.006930                     5.955130 +- 0.005228
EVENT=5 -DNEW=1 -DANSHUL=1
event: 0.025990 +- 0.000798                     0.017170 +- 0.000690
EVENT=5 -DNEW=1 -DANSHUL=1 -DWIPE_BTB=1
event: 3.343600 +- 0.004249                     5.723570 +- 0.004876
EVENT=5 -DNEW=1 -DANSHUL=1 -DFLS=1
event: 0.019040 +- 0.000731                     0.022570 +- 0.000829
EVENT=5 -DNEW=1 -DANSHUL=1 -DFLS=1 -DWIPE_BTB=1
event: 2.156350 +- 0.004643                     4.297650 +- 0.004910
EVENT=5 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1
event: 0.019800 +- 0.000747                     0.016780 +- 0.000688
EVENT=5 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1 -DWIPE_BTB=1
event: 4.481360 +- 0.005505                     4.518300 +- 0.004877




Which still doesn't explain your hatred of fls(), as even with cold
branches and a software fls, its faster than your latest proposal (as
can be explained by the lower total number of branches for the software
fls variant compared to your proposal).


And when we start to look at larger values, the fls() one pulls even
further ahead.

So I'll stick with my proposal... I can write up a proper Changelog and
maybe add my test to tools/testing/sqrt/ if you think its worth it.


---
 lib/int_sqrt.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c
index 1ef4cc344977..611933760225 100644
--- a/lib/int_sqrt.c
+++ b/lib/int_sqrt.c
@@ -7,12 +7,13 @@
 
 #include <linux/kernel.h>
 #include <linux/export.h>
+#include <linux/bitops.h>
 
 /**
- * int_sqrt - rough approximation to sqrt
+ * int_sqrt - Computes the integer square root
  * @x: integer of which to calculate the sqrt
  *
- * A very rough approximation to the sqrt() function.
+ * Computes: floor(sqrt(@x))
  */
 unsigned long int_sqrt(unsigned long x)
 {
@@ -21,7 +22,11 @@ unsigned long int_sqrt(unsigned long x)
        if (x <= 1)
                return x;
 
-       m = 1UL << (BITS_PER_LONG - 2);
+       m = 1UL << (__fls(x) & ~1U);
+
+       while (m > x)
+               m >>= 2;
+
        while (m != 0) {
                b = y + m;
                y >>= 1;

Attachment: sqrt.tar
Description: Unix tar archive

Reply via email to