On 09/17/2013 05:13 AM, Rogovin, Kevin wrote:
> Hello,
>
> Thank you for the very fast answers, some more questions:
>
>
>> It's not a preference question. The registers are 8 floats wide.
>> Vertex shaders get invoked 2 vertices at a time, with a register containing
>> these values:
>>
>> . +------+------+------+------+------+------+------+------+
>> . | v0.x | v0.y | v0.z | v0.w | v1.x | v1.y | v1.z | v1.w |
>> . +------+------+------+------+------+------+------+------+
>
> This seems best to me: run two vertices in each invocation with the hopes
> that the
> shader compiler will merge (multiple) float, vec2 and maybe even vec3
> operations into
> vec4 operations (does it)?
Not as well as it should. There's a lot of room for improvement in our
SIMD4x2/vector backend. We haven't spent a ton of effort optimizing it
since vertex shaders have rarely been the bottleneck in application
performance.
>> while these 8 pixels in screen space:
>>
>> . +----+----+----+----+
>> . | p0 | p1 | p2 | p3 |
>> . +----+----+----+----+
>> . | p4 | p5 | p6 | p7 |
>> . +----+----+----+----+
>>
>> are loaded in fragment shader registers as:
>>
>> . +------+------+------+------+------+------+------+------+
>> . | p0.x | p1.x | p4.x | p5.x | p2.x | p3.x | p6.x | p7.x |
>> . +------+------+------+------+------+------+------+------+
>>
>> Note how one register just holds a single channel ('.x' here) of a vector.
>> A vec4 would take up 4 registers, and to do value0.xyzw * value1.xyzw, you'd
>> emit 4 MULs.
>
> This is exactly what I was trying to ask/say about the fragment shader
> running, i.e. n-fragments are processed with 1 n-SIMD command (for i965, n=8),
> sighs my e-mail communications leave something to be desired.
> Some questions:
> 1) do the fragments need to be in a 4x2 block, or can it be two separate 2x2
> blocks?
The GPU processes two separate 2x2 blocks of pixels, which may actually
not be anywhere near each other.
> 2) for tiny triangles for fragment shaders that do not require dFdx, dFdy or
> fwidth, can the fragments be totally scattered?
Nope, the pixel shader always works on 2x2 blocks.
> Along further lines, for non-dependent texture lookups, are there code lines
> where the derivatives are computed
> analytically so that selecting the correct LOD does not require to process
> fragments in 2x2 (or larger) blocks? Or does
> the i965 hardware sampler interface does not allow this kind of madness?
>
>>> On a related note, where are the beans about the dispatch table?
>> I don't know this one (or particularly what you're asking, I guess).
>
> Viewing docs/index.html, on the side panel "Developer Topics --> GL
> Dispatch" there is text (broken into sections "1. Complexity of GL
> Dispatch", "2. Overview of Mesa's Implementation" and "3. Optimizations
> " describing how different GL contexts for the same hardware can do
> different things for the same GL function and that mesa has stubs which
> in turn call the "real" function. The documents go on to talk about
> various ways the function tables are filled and accessed across separate
> threads. My questions are:
> 0) is that information text still accurate? In particular, the directory
> src/glapi is gone from Mesa (atleast what I git cloned) and I thought that
> was the location of it.
> 1) where/how does the i965 driver fill that table, if it exists?
>
> Along similar lines, I see that some of the code in src/mesa/main performs
> various checks of various API calls and at times has some conditions
> dependent on what context type it is, which kind of contradicts the idea of
> different context have different dispatch tables [sort of, since the
> functions might just be the driver magick, where as the stub is validate and
> then call driver magick].
>
> -Kevin
_______________________________________________
mesa-dev mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/mesa-dev