Keith Whitwell wrote:
Ian Romanick wrote:
One thing about Jakub's patch is that, on x86, it eliminates the need for the specialized _ts_* versions of the dispatch functions. It basically converts the DISPATCH macro (as used in src/mesa/main/dispatch.c) from:
#define DISPATCH(FUNC, ARGS, MESSAGE) \ (_glapi_Dispatch->FUNC) ARGS
to:
#define DISPATCH(FUNC, ARGS, MESSAGE) \ const struct _glapi_table * d = _glapi_Dispatch; \ if ( __builtin_expect( d == NULL, 0 ) ) \ d = get_dispatch(); \ (d->FUNC) ARGS
There is some extra cost in the non-threaded case, but it seems very minimal. In the x86 assembly case, it's only a test and a conditional branch that is usually not taken. Does this seem like a reasonable change to make across the board?
Hmm. The _ts_* macros were introduced to eliminate exactly that sort of test - though we probably coded it up in a less optimal way than that. Are you saying that the dispatch tables would really become compiled 'C'? At the moment they are typically generated as assembly and use a jmp rather than calling a new function as in either of the examples above.
My feeling is that the non-threaded case should run as fast as possible, being the normal usage. Maybe some timings would make things clearer.
Attached is the test program I used. It takes turns calling a few API functions 1,000,000 (or more if specified on the command line) times. I tried it on a 2.4GHz Pentium 4 and a 400MHz K6-3. Both systems are Redhat 7.3 + patches (and in need of upgrades, I know). All code was compiled with gcc 2.96-113.
On the K6-3, the results were within the measurable margin of error for the two x86 assembly dispatch methods.
On the P4, the old-style dispatch was between 5 and 20 clock cycles faster. This amounts to an increase of between 5% and 38% on each call. The worst was glTexCoord3fv, which increased from ~52 cycles to ~72 cycles. The two exceptions were glMultiTexCoord2fv and glMultiTexCoord2f. The timings for these were virtually identical.
I'm a bit confused as to why the overhead isn't constant from function to function. The difference per-call should be identical. I suspect there is some other difference in my build. :( I'll keep looking into it...
#include <stdio.h> #include <stdlib.h> #define GL_GLEXT_PROTOTYPES #include <GL/gl.h> #include <GL/glext.h> #include <GL/glut.h>
#include <asm/timex.h>
static float Width = 400.0;
static float Height = 400.0;
static unsigned count = 1000000;
static void Idle( void )
{
glutPostRedisplay();
}
#define DO_FUNC(f,p) \
do { \
t0 = get_cycles(); \
for ( i = 0 ; i < count ; i++ ) { \
f p ; \
} \
t1 = get_cycles(); \
printf("%u calls to % 20s required %llu cycles.\n", count, # f, t1 - t0); \
} while( 0 )
static void Display( void )
{
int i;
const float v[3] = { 1.0, 0.0, 0.0 };
cycles_t t0;
cycles_t t1;
glBegin(GL_TRIANGLE_STRIP);
DO_FUNC( glColor3fv, (v) );
DO_FUNC( glNormal3fv, (v) );
DO_FUNC( glTexCoord2fv, (v) );
DO_FUNC( glTexCoord3fv, (v) );
DO_FUNC( glMultiTexCoord2fv, (GL_TEXTURE0, v) );
DO_FUNC( glMultiTexCoord2f, (GL_TEXTURE0, 0.0, 0.0) );
DO_FUNC( glFogCoordfv, (v) );
DO_FUNC( glFogCoordf, (0.5) );
glEnd();
exit(0);
}
static void Reshape( int width, int height )
{
Width = width;
Height = height;
glViewport( 0, 0, width, height );
glMatrixMode( GL_PROJECTION );
glLoadIdentity();
glOrtho(0.0, width, 0.0, height, -1.0, 1.0);
glMatrixMode( GL_MODELVIEW );
glLoadIdentity();
}
static void Key( unsigned char key, int x, int y )
{
(void) x;
(void) y;
switch (key) {
case 27:
exit(0);
break;
}
glutPostRedisplay();
}
int main( int argc, char *argv[] )
{
glutInit( &argc, argv );
glutInitWindowSize( (int) Width, (int) Height );
glutInitWindowPosition( 0, 0 );
glutInitDisplayMode( GLUT_RGB );
glutCreateWindow( argv[0] );
if ( argc > 1 ) {
count = strtoul( argv[1], NULL, 0 );
}
glutReshapeFunc( Reshape );
glutKeyboardFunc( Key );
glutDisplayFunc( Display );
glutIdleFunc( Idle );
glutMainLoop();
return 0;
}
