Following patches aim to reduce useless work done by xserver in often executed code paths. I also fixed a few random bugs that I spotted in code while doing the optimizations.
Main optimization is registering block handlers only when there is work for them. It is very common case that block handlers are called even there is no work to do. On arm doing nothing in these block handlers takes 20-40us when data isn't in CPU caches (very common case on arm). Next optimization is using NoopDDA everywhere when wanting to register handler to do nothing. This eliminates about 800ns (on arm) for each registered handler doing nothing. Then the last change is to cache result for LocalClient that is called for each DRI2 request. Caching eliminates about 9us for each DRI2 call. The largest win comes from avoiding doing malloc/free for _XSERVTransGetPeerAddr. Ondemand registered block handlers are making 2 assumptions how xserver handles BlockHandlers: 1. BlockHandlers that are registered ondemand are always registered after init. This makes it safe to remove the ondemand handler later on because all ondemand handlers implement unwrap/call/wrap sequence correctly. 2. CloseScreen is only called before screen structure is freed. CloseScreen handler is traditionally used to remove BlockHandlers but that assumes CloseScreen is called in same order that function pointer wrapping has happened. IMO it is better to trust that the function pointer is never called after close screen. That makes unwrapping the handlers pointless in CloseScreen solving the problem that ondemand wrapped handlers are in random order. Then to dared x11perf showing the difference in "real world" on arm x11perf -prop Without patches Sync time adjustment is 0.1290 msecs. 60000 reps @ 0.0983 msec ( 10200.0/sec): GetProperty 60000 reps @ 0.0981 msec ( 10200.0/sec): GetProperty 60000 reps @ 0.0982 msec ( 10200.0/sec): GetProperty 60000 reps @ 0.0982 msec ( 10200.0/sec): GetProperty 60000 reps @ 0.0981 msec ( 10200.0/sec): GetProperty 300000 trep @ 0.0982 msec ( 10200.0/sec): GetProperty With patches Sync time adjustment is 0.1232 msecs. 60000 reps @ 0.0903 msec ( 11100.0/sec): GetProperty 60000 reps @ 0.0903 msec ( 11100.0/sec): GetProperty 60000 reps @ 0.0904 msec ( 11100.0/sec): GetProperty 60000 reps @ 0.0903 msec ( 11100.0/sec): GetProperty 60000 reps @ 0.0904 msec ( 11100.0/sec): GetProperty 300000 trep @ 0.0904 msec ( 11100.0/sec): GetProperty Doesn't look as much as detailed profiling traces from real world applications would point for performance. But it is clearly visible in profiles that time taken for execution is highly depend on cache. Result for 90-100us for round trip is a lot less than what real worl profiles are showing where only WaitForSomething takes 100-200us. _______________________________________________ [email protected]: X.Org development Archives: http://lists.x.org/archives/xorg-devel Info: http://lists.x.org/mailman/listinfo/xorg-devel
