On 2002.04.10 17:42 Brian Paul wrote:
>
> Jos�,
>
> I've checked in the code after testing with Glean and the OpenGL
> conformance
> tests.
>
Great.
> Was I supposed to change something in the C code? It passes the
> conformance tests as-is.
>
I was surprised that the C code passed the conformance tests, because of
the signed arithmetic it doesn't give the same results as before. So I've
made a small comparision with the several methods (test program attached):
// Nathan's method - unsigned 24bit arithmetic
// NOTE: this was the original Mesa code
t1 = p*a + q*(255 - a);
s1 = (t1 + (t1 << 8) + 256) >> 16;
// Nathan's method - signed 24bit arithmetic (less one multiply)
// NOTE: this is how I changed and is now
t2 = (p - q)*a;
s2 = (t2 + (t2 << 8) + 256) >> 16;
s2 += q;
s2 &= 0xff;
// Blin's method - unsigned 16bit arithmetic
// NOTE: is exact
t3 = p*a + q*(255-a) + 128;
s3 = (t3 + (t3 >> 8)) >> 8;
// Blin's method - signed 16bit arithmetic (less one multiply)
// NOTE: is exact because the negative sign is considered
t4 = ((p - q)*a + (p > q ? 128 : -128)) & 0xffff;
s4 = (t4 + (t4 >> 8)) >> 8;
s4 += q;
s4 &= 0xff;
When one compares with the exact result
// exact result - rounded
s = (unsigned) (((double)p)*(((double)a)/255.0) +
((double)q)*(1.0-((double)a)/255.0) + 0.5);
one gets:
1: 8164890 differences in 16777216
2: 8148697 differences in 16777216
3: 0 differences in 16777216
4: 0 differences in 16777216
So spite of the different results between 1 and 2, 2 gives better results
overall!!
What happens is that method 1 is aimed to follow the truncated results and
not the rounded. If one compares with the truncated result
// truncated result
s = (unsigned) (((double)p)*(((double)a)/255.0) +
((double)q)*(1.0-((double)a)/255.0));
one gets:
1: 15467 differences in 16777216
2: 31660 differences in 16777216
3: 8180357 differences in 16777216
4: 8180357 differences in 16777216
Notice that, by this point of view, the method 2 is indeed worst, but this
really doesn't matter because is the wrong point of view.
This explains why the current C code passes the conformance tests.
At this moment the MMX code implements method 4, which is very fast. There
is no point in implement method 2, spite being a little faster than method
4 (because of the simpler rounding) because it would requite 24bit
arithmetic instead of 16, so less numbers could be multiplied at the same
time.
So, in contrary of what I thought, there is no need to switch to method 1.
When I implement the double blend trick I will have to use the method 4,
again for the same reasons of above.
But since the specs give some tolerance it would be nice to run the
conformance tests with different settings in mmx_blend.S, specially the
"single multiply w/o rouding" which would give at least 5% improvement (it
will be a little more because it would allow to free some registers
allowing to leaving some necessary constants there).
For that is just necessary to change
#define GMBT_ROUNDOFF 0
leaving the rest as before
#define GMBT_ALPHA_PLUS_ONE 0
#define GMBT_GEOMETRIC_SERIES 1
#define GMBT_SIGNED_ARITHMETIC 1
Using the alpha+1 method and not using the geometric series would be the
even faster but it is already marked on the C code as rejected by glean...
> Thanks for you work!
>
> -Brian
>
Regards,
Jos� Fonseca
#include <stdio.h>
#include <stdlib.h>
int main()
{
unsigned short p, q, a;
unsigned c1 = 0, c2 = 0, c3 = 0, c4 = 0;
for (p = 0; p <= 255; ++p)
for (q = 0; q <= 255; ++q)
for (a = 0; a <= 255; ++a)
{
unsigned s;
unsigned s1, s2, s3, s4;
unsigned t1, t2, t3, t4;
#if 1
// exact result - rounded
s = (unsigned) (((double)p)*(((double)a)/255.0) + ((double)q)*(1.0-((double)a)/255.0) + 0.5);
#else
// truncated result
s = (unsigned) (((double)p)*(((double)a)/255.0) + ((double)q)*(1.0-((double)a)/255.0));
#endif
// Nathan's method - unsigned 24bit arithmetic
t1 = p*a + q*(255 - a);
s1 = (t1 + (t1 << 8) + 256) >> 16;
// Nathan's method - signed 24bit arithmetic
t2 = (p - q)*a;
s2 = (t2 + (t2 << 8) + 256) >> 16;
s2 += q;
s2 &= 0xff;
// Blin's method - unsigned 16bit arithmetic
// NOTE: is exact
t3 = p*a + q*(255-a) + 128;
s3 = (t3 + (t3 >> 8)) >> 8;
// Blin's method - signed 16bit arithmetic
// NOTE: is exact because the negative sign is considered
t4 = ((p - q)*a + (p > q ? 128 : -128)) & 0xffff;
s4 = (t4 + (t4 >> 8)) >> 8;
s4 += q;
s4 &= 0xff;
if(s1 != s) ++c1;
if(s2 != s) ++c2;
if(s3 != s) ++c3;
if(s4 != s) ++c4;
if (s1 != s || s2 != s || s3 != s || s4 != s)
{
// printf("%3ux%3ux%3u:\t(%3u)\t%3u\t%3u\t%3u\t%3u\n", p, a, q, s, s1, s2, s3, s4);
}
}
printf("1: %u differences in %u\n", c1, 256*256*256);
printf("2: %u differences in %u\n", c2, 256*256*256);
printf("3: %u differences in %u\n", c3, 256*256*256);
printf("4: %u differences in %u\n", c4, 256*256*256);
return 0;
}