Re: [FFmpeg-devel] [PATCH] Boost FPS and performance: Optimize vertical loop for cache-friendly access [libavcodec/jpeg2000dwt.c:dwt_decode97_float]

Michael Niedermayer Fri, 16 May 2025 02:43:51 -0700

On Thu, May 15, 2025 at 10:19:57PM +0200, Michael Niedermayer wrote:
> Hi
> 
> On Wed, May 14, 2025 at 06:40:03PM +0200, Michael Niedermayer wrote:
> > Hi Chitra
> > 
> > On Wed, May 14, 2025 at 03:55:59AM +0000, Chitra Dey Sarkar via 
> > ffmpeg-devel wrote:
> > > Original Implementation:
> > > ---------------------------------
> > > In the original implementation, the "VER_SD" section processes image data 
> > > stored in *data using strided memory access in a vertical fashion This 
> > > leads to inefficient memory access patterns and cache thrashing due to 
> > > non-sequential data access across multiple inner loops.
> > > 
> > > Proposed Refactor:
> > > ---------------------------------
> > > The proposed refactor replaces this  by allocating a cache-friendly 2D 
> > > array buffer. This change eliminates strided memory access across the 
> > > three inner loops, significantly improving cache locality and reducing 
> > > cache thrashing.
> > > 
> > > Additionally, the data is transposed outside the lp loop, which allows 
> > > for efficient per-line access and write-back to the l buffer, further 
> > > optimizing performance.
> > > 
> > > Performance improvements
> > > -------------------------------------------------------
> > > This change results in a substantial performance improvement  Sharing the 
> > > FPS data benchmarked on our end for the file 'Tears of Steel' using 
> > > HandBrake
> > > 
> > > Device / CPU Model                                    Official FPS        
> > >    Optimized FPS   % Improvement
> > > Surface Laptop 11 (10-core X1P64100, L2: 36MB)                  3.18      
> > >      6.15                      +93%
> > > Surface Laptop 11(10-core X1P64100, L2: 36MB)     5.16           7.31     
> > >                  +41%
> > > Surface Laptop 11 (10-core X1P64100, L2: 36MB)                  5.57      
> > >      9.21                      +65%
> > > AMD Ryzen + NVIDIA RTX 4060 Laptop (12C/24T)                9.97          
> > >    11.22                   +12%
> > > Mac Mini Apple M4         Chip                           9.00          
> > > 12.00                   +30%
> > > 
> > > -------------------------------------------------------------------------------------------------------
> > > ---
> > >  libavcodec/jpeg2000dwt.c | 72 +++++++++++++++++++++++++++++++---------
> > >  1 file changed, 57 insertions(+), 15 deletions(-)
> > > 
> > > diff --git a/libavcodec/jpeg2000dwt.c b/libavcodec/jpeg2000dwt.c index 
> > > 9ee8122658..45d7897893 100644
> > > --- a/libavcodec/jpeg2000dwt.c
> > > +++ b/libavcodec/jpeg2000dwt.c
> > > @@ -409,6 +409,15 @@ static void dwt_decode97_float(DWTContext *s, float 
> > > *t)
> > >      /* position at index O of line range [0-5,w+5] cf. extend function */
> > >      line += 5;
> > > 
> > 
> > > +    /* Find the largest lv and lv to allocate a 2D Array*/
> > 
> > lv and lv ?
> > you mean lv anf lh ?
> > 
> > 
> > > +    int max_dim = 0;
> > > +    for (lev = 0; lev < s->ndeclevels; lev++) {
> > > +        if (s->linelen[lev][0]  > max_dim) max_dim = s->linelen[lev][0];
> > > +        if (s->linelen[lev][1] > max_dim) max_dim = s->linelen[lev][1];
> > 
> > FFMAX()
> > 
> > 
> > > +    }
> > > +    float *array2DBlock = av_malloc(max_dim * max_dim * sizeof(float));
> > > +    int useFallback = !array2DBlock;
> > 
> > also is this supposed to be max_dim_h * max_dim_v ?
> > 
> > 
> > 
> > > +
> > >      for (lev = 0; lev < s->ndeclevels; lev++) {
> > >          int lh = s->linelen[lev][0],
> > >              lv = s->linelen[lev][1],
> > > @@ -431,23 +440,56 @@ static void dwt_decode97_float(DWTContext *s, float 
> > > *t)
> > >              for (i = 0; i < lh; i++)
> > >                  data[w * lp + i] = l[i];
> > >          }
> > > -
> > > -        // VER_SD
> > > -        l = line + mv;
> > > -        for (lp = 0; lp < lh; lp++) {
> > > -            int i, j = 0;
> > > -            // copy with interleaving
> > > -            for (i = mv; i < lv; i += 2, j++)
> > > -                l[i] = data[w * j + lp];
> > > -            for (i = 1 - mv; i < lv; i += 2, j++)
> > > -                l[i] = data[w * j + lp];
> > > -
> > 
> > > -            sr_1d97_float(line, mv, mv + lv);
> > 
> > this should be run linewise not columnwise
> > if you dont understand what i mean here, please say so and ill elaborate
> 
> For the record: (may be interresting for others, or others may have comments 
> too)
> 
> <michaelni> The 1D transform is made of 4 passes of lifting transforms, 
> currently all pixels of a column are run through the first lifting step 
> before anything runs through the 2nd. One could try to run the first through 
> the 4th lifting step after the 2nd finishes the 3rd lifting step and the 3rd 
> pixel of the column finishes the 2nd lifting step
> <michaelni> and then do this for all pixels in a row so that the whole 
> transform is finished for the whole first row before more than 5 rows or so 
> have been touched
> <michaelni> this _COULD_ be faster, but it needs to be tried to be sure
> 
> basically, there would be a area covering s small number of whole rows and
> this area would move down by 1 or 2 rows in each iteration
> above it both horizontal and vertical transforms are finished below it
> its the untouched input.
> as long as this sliding window fits in the cache this should outperform
> anything that copyies the data around
> 
> I think its important to look into this before optimizing the code for
> CPU or GPU because the structure of this is different than the transpose
> based code. In fact the horizontal transform is quite unfriendly for SIMD
> 
> so the code as is in C might be to worst possible starting point for SIMD
> it transforms the easy vertical transform with a transpose into a horizontal
> one ...


More elaboration, i think what i wrote is still unclear

If one looks at spatial_decompose97i() in libavcodec/snow_dwt.c
thats approximately what i had in mind.
Its just one page of code

[...]

-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

The difference between a dictatorship and a democracy is that every 4 years
the population together is allowed to provide 1 bit of input to the government.

signature.asc
Description: PGP signature

_______________________________________________
ffmpeg-devel mailing list
[email protected]
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
[email protected] with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] Boost FPS and performance: Optimize vertical loop for cache-friendly access [libavcodec/jpeg2000dwt.c:dwt_decode97_float]

Reply via email to