Re: [Pixman] [PATCH 2/2] sse2, mmx: Remove initial unaligned loops in fetchers

Bill Spitzak Wed, 04 Sep 2013 12:37:03 -0700

Søren Sandmann wrote:

Here is another proposal, but I'm not sure it's really better:


- The combiners are made to return a buffer. The returned buffer is
  expected to contain the combined result and may be any of the passed
  src/mask/dest buffers. Almost all combiners will continue to combine
  into the dest buffer, which they will also return. But the SRC
  combiner will simply return the source buffer without any memcpy()ing.

This sounds exactly like how Nuke (compositing software I wrote that isused in special effects) works, which is doing compositing of oftenhundreds of steps on the CPU quite fast.

Work is done per scanline. The final destination first allocates orlocates a buffer to write the scanline to. It then calls the last stepof the compositing, passing it the buffer. This last step then returns apointer to the resulting scanline. This may be in the passed buffer, orother memory (typically a pointer to a source buffer). As you point out,if the final step wants to put the result in the buffer it must do amemcpy if the returned pointer is not to the buffer:


  float* buffer = framebuf + y*w;
  float* result = final_step(buffer, w);
  if (result != buffer) memcpy(buffer, result, w * sizeof(*buffer));

The compositing step recursively calls input steps. In most cases theoutput buffer it got is passed to the input, so the final step does notneed to allocate any temporary buffer. The input may write it's resultto the output buffer, and then the final step replaces it with it'scalculation. In most cases this requires no actual thought and worksperfectly. Here is a pseudo code version of a step that adds 1 to everypixel:


    float* add1(float* buffer, int w) {
      float* source = input_step(buffer, w);
      for (int i = 0; i < w; i++)
        buffer[i] = source[i] + 1;
      return buffer;
    }

Note that a no-op in the middle does not need to do a memcpy. Instead itjust returns the buffer returned from the input:


    float* noop(float* buffer, int w) {
      return input_step(buffer, w);
    }

Sometimes a step needs to use a second buffer. These are allocated onthe stack using alloca (except on Windows, sigh...). The most commonreason is that a step merges the output of two or more input steps:


    float* add_two_inputs(float* buffer, int w) {
      float* a = input_a(buffer, w);
      float temp[w];
      float* b = input_b(temp, w);
      for (int i = 0; i < w; i++)
        buffer[i] = a[i] + b[i];
      return buffer;
    }

I think some things you may think need different buffers can reuse them:

    int* convert_565_to_888(int* buffer, int w) {
      short* a = input((short*)buffer, w);
      for (int i = w; i--; ) // note it must go backwards!
        buffer[i] = pixel_565_to_888(a[i]);
      return buffer;
    }

In any case this scheme greatly reduces the amount of copying, and(probably more important) the amount of memory allocation/free beingdone. I would recommend it for cairo.

_______________________________________________
Pixman mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/pixman

Re: [Pixman] [PATCH 2/2] sse2, mmx: Remove initial unaligned loops in fetchers

Reply via email to