Hello, I want to share the ideas, code and some benchmarks of the new Servo WebGL architecture and refactor I'm working on. I'd love to hear some feedback about the big picture. For smaller things or nits we could do that in the PR review (which I'll plan to do beginning next week if there aren´t strong opinions about the overall architecture).
First, I'll mention some of the problems of the current Servo/WebRender WebGL implementation and source code organization * Part of the WebGL implementation is included in WebRender. This makes very slow to make WebGL contributions. For example in order to add a new WebGL method I usually need push a PR to gleam (wait until it's merged and the crate published), then submit another PR to WebRender (wait for the review & merge) and wait again until the new WR is merged into Servo (with low control about it because WR can have in-progress non-related features not ready-to merge) * When adding WebGL related code to WR, the WebGL code is not tested when it lands into WR. It's only tested when a WR update lands into Servo. When a WebGL related test fails in a WR update it's usually fixed by reverting the commit that caused it to avoid delaying the WR update. This makes the testing process difficult. * The current implementation creates a new WebGL renderer thread (with it's ipc-channel) for each canvas created in JavaScript. This can quickly exhaust the file descriptor's available in the system for webpages that create a lot of canvases (e.g Shadertoy) or on games that require many canvases. Creating a canvas instance in JavaScript is slower than it should because of this too. * WebGL commands are currently run in the WR backend thread which could be bad for UI latency (see https://github.com/servo/webrender/issues/607) * Currently there are two ipc-channels steps for each WebGL-call to hit the driver (JS ==> WebGL Render Thread ==> WebRender). This adds a lot of overhead. * Slow compilation times: Modifying a WebGL command code recompiles WebRender_traits which causes additional component recompilations in Servo. * Flickering and shyncronization problems with the WR compositor (I created this issue some time ago: https://github.com/servo/servo/issues/14235) Goals that I want to achieve in the new WebGL architecture design: * Make WebGL development cycle easier. * Minimize the render path overhead for WebGL commands. * Flexible architecture for multiprocess and inprocess scenarios. * A special mode for packaging WebGL applications (not enabled by default, more details later) * Faster compilation times * Proper shyncronization to avoid flickering issues * Compatibility with the WebVR render path Now I'm going into some detail of how I tried to best achieve each stated goal. You can check the "almost ready for PR" source code here: https://github.com/MortimerGoro/servo/commit/c2db42bbbae6d6cb612d0aa2580552a2ff3b0acf (still have to add more docs, do a rebase and clean some tests) Make WebGL development cycle easier. ------------------------------------------------------- This is fixed by splitting the WebGL code into it's own component and removing all WebGL code from WebRender. Additional benefits for WebRender (see https://github.com/servo/webrender/issues/1353): * Simplify the internals of resource cache by removing the GL context management code. * Remove the WebGL cargo feature, since not all clients use it. I opted to move the WebGL code into a Servo component. It eases development, testability and I don't think that there are use cases to use the component outside Servo. Anyway, it's a low coupled component, so it will be easy to move if required but I don't think that it's worth it. The connection between a WebGL canvas and WR is implemented via the WR ExternalImage APIs as Glennw recommended. When the new Architecture Lands in servo, I'll make a separate PR to removes all the unneeded WebGL management code in WR. Minimize the render path overhead for WebGL commands ------------------------------------------------------------------------------ The new architecture creates a single thread/process to manage all WebGL commands from multiple canvas sources. This will reduce the footprint for creating a new canvas (avoid creating specific thread + ipc-channel) and reduces the render-path for each WebGL command by using a single "channel step" Flexible architecture for multiprocess and inprocess scenarios ---------------------------------------------------------------------------------- One of the things I don't like about the current implementation is that it uses a IpcSender<WebGLCommand> instance as the entry point to send the WebGL commands from the Script Thread to the renderer in many parts of the source code. IMO this is not very flexible because we may be interested in batching WebGLCommands, using shared memory, or other channel approaches to improve performance. A WebGL demo can easily send e.g 200 channels messages per frame (doubled in VR mode) and we need to achieve steady framerates (e.g. 90 fps on desktop VR). I decided to create custom/opaque types for all WebGL API traits: WebGLSender, WebGLReceiver, WebGLChan, etc. instead of using fixed IpcSender<T> types. This types are defined at compilation time with no runtime overhead (e.g.: we can easily do pub type WebGLSender<T> = IpcChanne l<T>). Additionally I also tried to make the threading implementation flexible using some templating in the WebGLThread struct implementation, which uses types that can be changed at compile time based on a cargo feature (or using a runtime preference just adding a enum wrapper). The thing I like best is that we can easily change the threading and channel models only modifying one or two small files, while the remaining WebGL will keep untouched. I created 2 WebGL threading/channel models based on this idea (some interesting performance benchmarks at the end o the post!) * Multiprocess. This model creates a unique process/thread to handle all WebGL commands from all canvases from all script threads. It uses ipc-channels for communication. Servo is not ready yet for a real multiprocess mode and we'll need to add a lot of platform dependant code in Webrender and rust-offscreen-context (eg. IOSurface/GLXPixmap/etc). But I wanted to add this model now so we can start testing or adding some steps towards a real multiprocess mode. * Inprocess. This mode lazily creates a WebGL renderer thread for each Servo ScriptThread/tab and uses pure std mpsc channels. Using mpsc channels has a lot of impact in the performace because the channel is faster and it avoids serialization overhead (serialization also happens enabling the force-inprocess mode in servo/ipc-channel crate). w.r.t threading model it could be easily changed to create a single thread for all ScriptThread/tabs. It was a bit more difficult to implement (and to sync with the WR ExternalImage callbacks) but I opted for different threads per tab because: -- This way we only need to use two threads at the same time for sending WebGL commands from script to the renderer. This allows to use a spsc channel instead of a mpsc channel which allows a faster shyncronization algorithm. AFAIK a rust std mpsc channel uses a faster spsc algorithm until it is cloned. We could add another kind of ad-hoc spsc channel implementations for the fastest performance (have you used/ do you recommend any of them?) -- Having different threads per tab wii not exhaust CPU because requestAnimationFrame only runs in the current active tab per spec so the other threads will be paused. A special mode for packaging WebGL applications -------------------------------------------------------------------- In addition to the multiprocess/inprocess modes I also added a "packaging mode" to improve performance in this scenario. Disclaimer: Before joining Mozilla I was a core developer in Cocoon.io. We created a custom "browser" from scratch called canvas+ (using standalone V8 and JavaScriptCore VM engines) to run WebGL/Canvas2D games. We were able to achieve better WebGL/Canvas performance than the Chromium/Safari webviews available in Android/iOS. Some of the optimizations we did are not allowed by the spec or security rules in a multiparadigm browser that loads arbitrary content. There are other engines/companies that opted for this approach too instead of using multiparadigm webviews (e.g: Impact Engine, Chukong and cocos2dx-html, etc.). My idea is to use this special mode (really a cargo feature) in order include some optimizations that these kind of standalone VM engines do. It will only be enabled for packaging trusted source code (e.g. android apks, electron, and more). When you package a tested and trusted source code you don't really need a lot of the validations, error checking or security rules that the spec enforces or a full browser requires. Some examples or forbidden things that could be allowed in this mode: * Immediate WebGL calls to a GL context living in the Script thread (I did some benchmarks with this enabled (sometimes it's faster running the gl call than sending it) . I'd like to focus on a thread bases render path though in order to parallelize JS code better) * Rendering a full screen WebGL scene to the main window context (e.g. bypass all the compositor) * Disable some validations and error checking enforced by the spec. Faster compilation times ----------------------------------------------------------- WebGL code is now splitted into canvas_traits and canvas components. Changing the DOM code or traits is still compilation costly but now we can change the WebGL commands implementation or do things like adding logs for each WebGL call (which I tend to use for testing) with a lot shorter compilation times (no need to recompile WebRender_traits anymore!) Proper synchronization to avoid flickering issues ---------------------------------------------------------------- Flcikering issues have improved a lot in the three.js demos I have tested. The synchronization is done using the ExternalImageHandler lock and unlock calls that Glennw recommended combined with OpenGL Sync Objects. There are some flickering issues when moving the mouse on Linux. I think that's probably a bug in servo. Mouse move seem to trigger extra composites that aren't synced with the servo animation loop. IMO this could be fixed when we finish the wip Glutin unfork ( https://github.com/servo/servo/pull/17311) Benchmarks ----------------------------------------------------------------- Here are the benchmark results for the old WebGL implementation, the three modes in the new Architecture (multiprocess, inprocess and packaging) and Chrome/Firefox. I used a desktop linux to do the benchmarks (Ubuntu 16.04, GTX 1060, Skylake i7 4GHz) I used the Three.js performance test with a fixed camera, and with 5K and 10 3D objects: https://threejs.org/examples/webgl_performance.html Browser------------------Average FPS-----------Average-FrameTime Servo Old WebGL 5K: -------26 fps------------------37.31 ms Servo Old WebGL 10K: ------13 fps------------------71.4 ms Servo "multiprocess" 5K----43 fps------------------22.61 ms Servo "multiprocess" 10K---21 fps------------------46.27 ms Servo "inprocess" 5K-------56-60fps----------------17.13 ms Servo "inprocess" 10K------28-30fps----------------33.5 ms Servo "packaging" 5K-------60fps-------------------15.83 ms Servo "packaging" 10K------32fps-------------------30.92 ms Firefox 5K------------------42-45fps---------------16.98 ms Firefox 10K-----------------25fps------------------34.01 ms Chrome 5K-------------------57-60fps---------------11.5 ms Chrome 10K------------------42-46ps----------------22.5 ms Future Steps ----------------------------------------------------------------- There are other ideas that I wanted to implement but I'll do that in later PRs (or contributions in this areas are welcomed too!) * WebGL double buffering: It may be a good idea to improve performance and synchronization (e.g. ping pong WebGL render textures each frame, so WR uses a texture to composite the main context while the WebGL frame is rendering the next frame to a different texture). This may add some memory overhead. We can implement some texture pooling (I think that Firefox does that) * Test/benchmark different spsc channels or message batching or shared memory in order to improve frame times. We want to be fastest ones in the benchmarks! * WebGL 2.0 branch (My idea is to create the layer that reuses some code with WebGL 1.0 implementation and open a issue will a list of TODOs in WebGL 2.0 so contributors can start implementing features step by step). I have some additional questions about Servo that I think may affect the benchmark performances: * How does the SpiderMonkey version used in Servo compare to V8? * Is the cost of JS to Rust call measured for normal functions and for functions with many arguments? is it compared to bindings implemented in Firefox/Chrome? When I worked on the standalone VM engine area the binding JS/C++ binding code had a lot of impact. Long post... Thanks for reading! Looking forward to hearing your feedback about architecture and benchmarks ;) Cheers Imanol Fernandez _______________________________________________ dev-servo mailing list dev-servo@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-servo