Am 31.12.23 um 13:56 schrieb Tomas Härdin:
+ for (int y = 0; y < height; y++) { + const uint8_t *src1 = src1_data[0] + y * src1_linesize[0]; + const uint8_t *src2 = src2_data[0] + (y + pos_y) * src2_linesize[0] + pos_x * src2_step[0]; + uint8_t *dest = dest_data[0] + (y + pos_y) * dest_linesize[0] + pos_x * sizeof(uint32_t); + for (int x = 0; x < width; x++) { + int src1_alpha = src1[0]; + int src2_alpha = src2[0]; + + if (src1_alpha == 255) { + memcpy(dest, src1, sizeof(uint32_t)); + } else if (src1_alpha + src2_alpha == 0) { + memset(dest, 0, sizeof(uint32_t)); + } else { + int tmp_alpha = src2_alpha - ROUNDED_DIV(src1_alpha * src2_alpha, 255); + int blend_alpha = src1_alpha + tmp_alpha; + + dest[0] = blend_alpha; + dest[1] = ROUNDED_DIV(src1[1] * src1_alpha + src2[1] * tmp_alpha, blend_alpha); + dest[2] = ROUNDED_DIV(src1[2] * src1_alpha + src2[2] * tmp_alpha, blend_alpha); + dest[3] = ROUNDED_DIV(src1[3] * src1_alpha + src2[3] * tmp_alpha, blend_alpha); + }Is branching and a bunch of function calls (which I hope get optimized out) really faster than just always doing the blending?
If I trust my START_TIMER/STOP_TIMER interpretation, I'd say so: With branches: 253315 UNITS in blend_alpha_yuva, 128 runs, 0 skips Always blending: 351104 UNITS in blend_alpha_yuva, 128 runs, 0 skips
It mgiht also be worthwhile to check 8 bytes at a time against UINT64_MAX and 0. That doesn't need to hold up this patch though. Same with the YUVA version.+static int blend_frame_into_canvas(WebPContext *s) +{ + AVFrame *canvas = s->canvas_frame.f; + AVFrame *frame = s->frame; + int width, height; + int pos_x, pos_y; + + if ((s->anmf_flags & ANMF_BLENDING_METHOD) == ANMF_BLENDING_METHOD_OVERWRITE + || frame->format == AV_PIX_FMT_YUV420P) { + // do not blend, overwrite + + if (canvas->format == AV_PIX_FMT_ARGB) { + width = s->width; + height = s->height; + pos_x = s->pos_x; + pos_y = s->pos_y; + + for (int y = 0; y < height; y++) { + const uint32_t *src = (uint32_t *) (frame->data[0] + y * frame->linesize[0]); + uint32_t *dst = (uint32_t *) (canvas->data[0] + (y + pos_y) * canvas->linesize[0]) + pos_x; + memcpy(dst, src, width * sizeof(uint32_t)); + }This could be reduced to a single memcpy() when linesizes are equal. Same for the other memcpy()s
Its a subimage copied into a canvas (see pos_x and pos_y). Has to be copied line-by-line. Same for the other loops. -Thilo _______________________________________________ ffmpeg-devel mailing list [email protected] https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email [email protected] with subject "unsubscribe".
