On 8/23/19 5:19 AM, Elen Kalda wrote: > Hi all, > > Libgcc currently implements the Smith's algorithm > (https://dl.acm.org/citation.cfm?id=368661) for complex division. It is > designed to prevent overflows and underflows and it does that well for around > 97.7% cases. However, there has been some interest in improving that result, > as > an example Baudin improves the robustness to ~99.5% by introducing an extra > underflow check and also gains in performance by replacing two divisions by > two > multiplications and one division (https://arxiv.org/abs/1210.4539). > > This patch proposes the implementation of Baudin algorithm to Libgcc (in case > the increase in rounding error is acceptable). > > In general, when it comes to deciding what is the best way of doing the > complex divisions, there are three things to consider: > - speed > - accuracy > - robustness (how often overflow/underflow happens, how many division results > are way off from the exact value) > > Improving the algorithms by looking at the failures of specific cases is not > simple. There are 4 real numbers to be manipulated to get the final result and > some minimum number of operations that needs to be done. The quite minimal > Smiths algorithm currently does 9 operations, of which 6 are multiplications > or divisions which are susceptible to overflow and underflow. It is not > feasible to check for both, underflow and overflow, in all 6 operations > without significantly impacting the performance. Most of the complex division > algorithms have been created to solve some special difficult cases, however, > because of the abundance of failure opportunities, algorithm that works well > for some type of divisions is likely to fail for other types of divisions. > > The simplest way to dodge overflow and underflow without impacting performance > much is to vary the order of operations. When naively expanding the complex > division x/y = (...)/(real(y)*real(y) + imag(y)*imag(y)), the squaring in the > denominator is the greatest source of overflow and underflow. Smiths > improvement was to introduce scaling factor r = real(y)/imag(y), such that the > denominator becomes real(y) + imag(y)*r. Baudin's improvement was to check > explicitly for underflow in r and change the order of operations if necessary. > > One way of comparing the algorithms is to generate lots of random complex > numbers and see how the different algorithms perform. That approach is closer > to the "real world" situation where the complex numbers are often random, not > powers of two (which oddly is the assumption many authors make, including > Baudin) and have an exponents in a less "dangerous" ranges. Smiths algorithm > was compared to two versions of Baudin. The difference is that in one version > (b2div), real and imaginary parts of the result are both divided through by > the same denominator d, in the other one (b1div) the real and imaginary > parts are multiplied through by the reciprocal of that d, effectively > replacing two divisions with one division and two multiplications. Since > division is significantly slower on AArch64, that swap introduces gains in > performance (though with the cost of rounding error, unfortunately). > > To compare the performance, I used a test that generates 1800 random complex > double pairs (which fit nicely into the 64 kb cache) and divide each pair 200 > 000 times, effectively doing 360 000 000 operations. > > | CPU time > ------------------ > smiths | 7 296 924 > b1div | 6 558 148 > b2div | 8 658 079 > > On AArch64, b1div is 10.13% faster than Smiths, b2div is 18.65% slower than > Smiths. > > For the accuracy, 1 000 000 pairs of complex doubles were divided and compared > to the exact results (assuming that the division of the same numbers as long > doubles gives the exact value). > > | >2ulp | >1ulp | >0.5ulp | <0.5ulp > --------------------------------------------- > smiths | 22 937 | 23 944 | 124 228 | 828 891 > b1div | 4 450 | 59 716 | 350 792 | 585 042 > b2div | 4 036 | 24 488 | 127 144 | 844 332 > > Same data, but showing the percentage of divisions falling into each > category: > > | >2ulp | >1ulp | >0.5ulp | <0.5ulp > --------------------------------------------- > smiths | 2.29% | 2.39% | 12.42% | 82.89% > b1div | 0.45% | 5.97% | 35.08% | 58.50% > b2div | 0.40% | 2.45% | 12.71% | 84.43% > > The rightmost column (<0.5ulp) shows the number of calculations for which the > ulp error of both, the real and imaginary parts, were inside 0.5 ulp, meaning > they were correctly rounded. >0.5ulp gives the number of calculations for > which the largest ulp error of the real and imaginary parts were between 0.5 > and 1.0 ulp, meaning that it was rounded to one of the nearest doubles, but > not THE nearest double. Anything less accurate is not great... > > FMA doesn't create any problems on AArch64. Bootstrapped and tested on > aarch64-none-linux-gnu -- no problems with bootstrap, but some regressions in > gcc/testsuite/gcc.dg/torture/builtin-math-7.c. The regressions are a result of > the loss in accuracy - for a division (3. + 4.i) / 5i = 0.8 - 0.6i, Smiths > gives: > 0.800000000000000044408920985006 - 0.599999999999999977795539507497 i > ulp error in real part: 0.4 > ulp error in imaginary part: 0.2 > b1div gives: > 0.800000000000000044408920985006 - 0.600000000000000088817841970013 i > ulp error in real part: 0.4 > ulp error in imaginary part: 0.8 > That means the imaginary part of the Baudin result is rounded to one of the > two > nearest floats but not the one which is the nearest. Similar thing happens to > another division in that test. > > > Some questions: > - Is the 10.13% improvement in performance with b1div worth the loss in > accuracy? > - In the case of b1div, most of the results that ceased to be correctly > rounded as a result of replacing the two divisions with multiplications, ended > up in being in 1.0ulp. Maybe that is acceptable? > - The improved algorithm reduces the number of bad fails from 2.3% to 0.5%. Is > that a significant/useful improvement? > > > Best wishes, > Elen > > libgcc/ChangeLog: > > 2019-07-31 Elen Kalda <elen.ka...@arm.com> > > * libgcc2.c CONCAT3(__div,MODE,3): Implement the Baudin's algorithm I'd really like to hear from Joseph on this. He's got *far* more experience in evaluating stuff in this space than I do and is almost certainly going to be able to give a more informed opinion than myself.
I believe he's digging out from some time off, so it might be a little while before he can respond. jeff >