[FFmpeg-devel] [PR] avfilter: add CUDA accelerated HDR-to-SDR tonemapping filter (PR #22360)

f1k2faeez via ffmpeg-devel Tue, 03 Mar 2026 07:23:07 -0800

PR #22360 opened by f1k2faeez
URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/22360
Patch URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/22360.patch


Add `tonemap_cuda`, a hardware-accelerated HDR-to-SDR tonemapping filter that 
runs natively on NVIDIA GPUs using CUDA, avoiding the need for OpenCL interop 
and host memory round-trips.

## Features

- Six tonemapping algorithms: none, linear, gamma, clip, reinhard, hable, 
mobius (matching `tonemap_opencl`)
- PQ (SMPTE ST.2084) and HLG (ARIB STD-B67) input transfer functions
- BT.2020-to-BT.709 gamut conversion with automatic primaries detection
- Configurable highlight desaturation
- P010 input, NV12 or P010 output
- Signal peak from HDR metadata (MaxCLL / mastering display) or manual override
- Chroma-location-aware 4:2:0 downsampling

## Implementation

Follows the established CUDA filter pattern (PTX module loading, 
`cuLaunchKernel`, proper CUDA context management). Each thread block processes 
a 2x2 luma quad with its shared chroma sample through the full pipeline: EOTF 
linearization, OOTF (HLG), primaries conversion, tonemapping with desaturation, 
delinearization, and YUV re-encoding.

Kernel uses CUDA fast-math intrinsics (`__powf`, `__expf`, `__fmaf_rn`), 
read-only texture cache (`__ldg`), and fused multiply-add for matrix operations.

## Performance

Validated with 4K60 HDR10+ content on consumer NVIDIA hardware:

| Pipeline | FPS |
|---|---|
| Tonemap only (null output) | 272 fps |
| Full decode → tonemap → h264_nvenc | 118 fps |
| Full-duration transcode (59s) | 135 fps |

## Usage Examples

```bash
# HDR to SDR with hable, all on GPU
ffmpeg -hwaccel cuda -hwaccel_output_format cuda -i hdr_input.mkv \
  -vf "tonemap_cuda=tonemap=hable:desat=0.5:format=nv12" \
  -c:v h264_nvenc output.mp4

# With explicit BT.709 output properties
ffmpeg -hwaccel cuda -hwaccel_output_format cuda -i hdr_input.mkv \
  -vf "tonemap_cuda=tonemap=mobius:t=bt709:m=bt709:p=bt709:range=tv" \
  -c:v hevc_nvenc output.mkv
```

Also submitted to ffmpeg-devel mailing list:
https://patchwork.ffmpeg.org/project/ffmpeg/patch/[email protected]/

Signed-off-by: Faeez Kadiri <[email protected]>


From 5b0b6a975c336a02ef9724544754dba69db68d1b Mon Sep 17 00:00:00 2001
From: Faeez Kadiri <[email protected]>
Date: Tue, 3 Mar 2026 20:15:39 +0530
Subject: [PATCH] avfilter: add CUDA accelerated HDR-to-SDR tonemapping filter

Add tonemap_cuda, a hardware-accelerated HDR-to-SDR tonemapping filter
that runs natively on NVIDIA GPUs using CUDA, avoiding the need for
OpenCL interop and host memory round-trips.

Features:
- Six tonemapping algorithms: none, linear, gamma, clip, reinhard,
  hable, mobius (matching tonemap_opencl)
- PQ (SMPTE ST.2084) and HLG (ARIB STD-B67) input transfer functions
- BT.2020-to-BT.709 gamut conversion with automatic primaries detection
- Configurable highlight desaturation
- P010 input, NV12 or P010 output
- Signal peak from HDR metadata (MaxCLL / mastering display) or manual
  override
- Chroma-location-aware 4:2:0 downsampling

The implementation follows the established CUDA filter pattern using PTX
modules for kernel execution and proper CUDA context management. Each
thread block processes a 2x2 luma quad with its shared chroma sample
through the full pipeline: EOTF linearization, OOTF (HLG), primaries
conversion, tonemapping with desaturation, delinearization, and YUV
re-encoding.

Performance validated at 272 fps for 4K60 HDR tonemapping (null output)
and 118 fps for full decode-tonemap-encode pipeline with h264_nvenc on
consumer NVIDIA hardware.

Signed-off-by: Faeez Kadiri <[email protected]>
---
 Changelog                      |   1 +
 configure                      |   2 +
 doc/filters.texi               |  67 ++++
 libavfilter/Makefile           |   1 +
 libavfilter/allfilters.c       |   1 +
 libavfilter/version.h          |   2 +-
 libavfilter/vf_tonemap_cuda.c  | 641 +++++++++++++++++++++++++++++++++
 libavfilter/vf_tonemap_cuda.cu | 420 +++++++++++++++++++++
 libavfilter/vf_tonemap_cuda.h  |  92 +++++
 9 files changed, 1226 insertions(+), 1 deletion(-)
 create mode 100644 libavfilter/vf_tonemap_cuda.c
 create mode 100644 libavfilter/vf_tonemap_cuda.cu
 create mode 100644 libavfilter/vf_tonemap_cuda.h

diff --git a/Changelog b/Changelog
index ce49a5fff0..d410e7a6ca 100644
--- a/Changelog
+++ b/Changelog
@@ -26,6 +26,7 @@ version <next>:
 - swscale Vulkan support
 - LCEVC metadata bitstream filter
 - Add vf_deinterlace_d3d12 filter
+- tonemap_cuda filter
 
 
 version 8.0:
diff --git a/configure b/configure
index 5ad2e6787d..a1d83f7fc1 100755
--- a/configure
+++ b/configure
@@ -4212,6 +4212,8 @@ tinterlace_merge_test_deps="tinterlace_filter"
 tinterlace_pad_test_deps="tinterlace_filter"
 tonemap_filter_deps="const_nan"
 tonemap_vaapi_filter_deps="vaapi VAProcFilterParameterBufferHDRToneMapping"
+tonemap_cuda_filter_deps="ffnvcodec const_nan"
+tonemap_cuda_filter_deps_any="cuda_nvcc cuda_llvm"
 tonemap_opencl_filter_deps="opencl const_nan"
 transpose_opencl_filter_deps="opencl"
 transpose_vaapi_filter_deps="vaapi VAProcPipelineCaps_rotation_flags"
diff --git a/doc/filters.texi b/doc/filters.texi
index e49dd9ef0d..82027d80cd 100644
--- a/doc/filters.texi
+++ b/doc/filters.texi
@@ -27291,6 +27291,73 @@ Thumbnails are extracted from every @var{n}=150-frame 
batch, selecting one per b
 
 @end itemize
 
+@anchor{tonemap_cuda}
+@subsection tonemap_cuda
+Perform HDR to SDR conversion with tonemapping using CUDA.
+
+This is a CUDA-accelerated HDR-to-SDR tonemapping filter similar to
+@code{tonemap_opencl}.  It runs natively on NVIDIA GPUs using CUDA,
+avoiding the need for OpenCL interop and host memory round-trips.
+
+Unlike @code{tonemap_opencl}, this filter does not currently support
+the @option{threshold} option for scene-change detection; signal peak
+is evaluated per-frame from HDR metadata or from the @option{peak}
+override.
+
+It accepts the following options:
+
+@table @option
+@item tonemap
+Specify the tone-mapping operator. Same as @ref{tonemap}.
+
+@item param
+Tune the tone-mapping algorithm. Same as @ref{tonemap}.
+
+@item desat
+Apply desaturation for highlights. 0.0 disables. Default is 0.5.
+
+@item format
+Specify the output pixel format. Supported values are @var{p010} and 
@var{nv12}.
+Default is @var{nv12}.
+
+@item range, r
+Set the output color range. Accepted values are @var{tv}/@var{limited} and
+@var{pc}/@var{full}.
+
+@item primaries, p
+Set the output color primaries. Accepted values include @var{bt709} and
+@var{bt2020}.
+
+@item transfer, t
+Set the output transfer characteristics. Accepted values include @var{bt709}
+and @var{bt2020}.
+
+@item matrix, m
+Set the output colorspace matrix. Accepted values include @var{bt709} and
+@var{bt2020}.
+
+@item peak
+Override the signal peak value (in multiples of reference white, 100 cd/m2).
+When set to 0 (default), the peak is read from the input HDR metadata
+(Content Light Level or Mastering Display).
+@end table
+
+@subsubsection Examples
+
+@itemize
+@item
+Convert an HDR video to SDR using hable tonemapping, fully on GPU:
+@example
+ffmpeg -hwaccel cuda -hwaccel_output_format cuda -i hdr_input.mkv -vf 
"tonemap_cuda=tonemap=hable:desat=0.5:format=nv12" -c:v h264_nvenc output.mp4
+@end example
+
+@item
+Convert HDR to SDR with explicit BT.709 output properties:
+@example
+ffmpeg -hwaccel cuda -hwaccel_output_format cuda -i hdr_input.mkv -vf 
"tonemap_cuda=tonemap=mobius:t=bt709:m=bt709:p=bt709:range=tv" -c:v hevc_nvenc 
output.mkv
+@end example
+@end itemize
+
 @subsection yadif_cuda
 
 Deinterlace the input video using the @ref{yadif} algorithm, but implemented
diff --git a/libavfilter/Makefile b/libavfilter/Makefile
index a530cfae29..a9a41b34fd 100644
--- a/libavfilter/Makefile
+++ b/libavfilter/Makefile
@@ -541,6 +541,7 @@ OBJS-$(CONFIG_TMEDIAN_FILTER)                += 
vf_xmedian.o framesync.o
 OBJS-$(CONFIG_TMIDEQUALIZER_FILTER)          += vf_tmidequalizer.o
 OBJS-$(CONFIG_TMIX_FILTER)                   += vf_mix.o framesync.o
 OBJS-$(CONFIG_TONEMAP_FILTER)                += vf_tonemap.o
+OBJS-$(CONFIG_TONEMAP_CUDA_FILTER)           += vf_tonemap_cuda.o 
vf_tonemap_cuda.ptx.o cuda/load_helper.o colorspace.o
 OBJS-$(CONFIG_TONEMAP_OPENCL_FILTER)         += vf_tonemap_opencl.o opencl.o \
                                                 opencl/tonemap.o 
opencl/colorspace_common.o
 OBJS-$(CONFIG_TONEMAP_VAAPI_FILTER)          += vf_tonemap_vaapi.o vaapi_vpp.o
diff --git a/libavfilter/allfilters.c b/libavfilter/allfilters.c
index e26859e159..a670d9a26d 100644
--- a/libavfilter/allfilters.c
+++ b/libavfilter/allfilters.c
@@ -509,6 +509,7 @@ extern const FFFilter ff_vf_tmedian;
 extern const FFFilter ff_vf_tmidequalizer;
 extern const FFFilter ff_vf_tmix;
 extern const FFFilter ff_vf_tonemap;
+extern const FFFilter ff_vf_tonemap_cuda;
 extern const FFFilter ff_vf_tonemap_opencl;
 extern const FFFilter ff_vf_tonemap_vaapi;
 extern const FFFilter ff_vf_tpad;
diff --git a/libavfilter/version.h b/libavfilter/version.h
index 537df129cd..7642b670d1 100644
--- a/libavfilter/version.h
+++ b/libavfilter/version.h
@@ -31,7 +31,7 @@
 
 #include "version_major.h"
 
-#define LIBAVFILTER_VERSION_MINOR  13
+#define LIBAVFILTER_VERSION_MINOR  14
 #define LIBAVFILTER_VERSION_MICRO 100
 
 
diff --git a/libavfilter/vf_tonemap_cuda.c b/libavfilter/vf_tonemap_cuda.c
new file mode 100644
index 0000000000..2311c51c1d
--- /dev/null
+++ b/libavfilter/vf_tonemap_cuda.c
@@ -0,0 +1,641 @@
+/*
+ * Copyright (c) 2026, Faeez Kadiri < f1k2faeez at gmail dot com>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+/**
+ * @file
+ * CUDA accelerated HDR to SDR tonemapping filter
+ */
+
+#include <float.h>
+#include <math.h>
+#include <string.h>
+
+#include "filters.h"
+#include "libavutil/cuda_check.h"
+#include "libavutil/csp.h"
+#include "libavutil/hwcontext.h"
+#include "libavutil/hwcontext_cuda_internal.h"
+#include "libavutil/opt.h"
+#include "libavutil/pixdesc.h"
+
+#include "colorspace.h"
+#include "cuda/load_helper.h"
+#include "vf_tonemap_cuda.h"
+
+#define CHECK_CU(x) FF_CUDA_CHECK_DL(ctx, device_hwctx->internal->cuda_dl, x)
+#define DIV_UP(a, b) (((a) + (b) - 1) / (b))
+#define BLOCKX 32
+#define BLOCKY 16
+
+/** Private context for the tonemap_cuda filter. */
+typedef struct TonemapCUDAContext {
+    const AVClass *class;
+
+    /** @name User-facing options (set via AVOption) */
+    /**@{*/
+    enum AVColorSpace colorspace;
+    enum AVColorTransferCharacteristic trc;
+    enum AVColorPrimaries primaries;
+    enum AVColorRange range;
+    int tonemap;                   ///< selected algorithm
+    enum AVPixelFormat format;     ///< output pixel format
+    double peak;                   ///< signal peak override (0 = auto)
+    double param;                  ///< algorithm tuning knob
+    double desat_param;            ///< highlight desaturation strength
+    /**@}*/
+
+    /** @name Resolved per-stream color properties */
+    /**@{*/
+    enum AVColorSpace colorspace_in, colorspace_out;
+    enum AVColorTransferCharacteristic trc_in, trc_out;
+    enum AVColorPrimaries primaries_in, primaries_out;
+    enum AVColorRange range_in, range_out;
+    enum AVChromaLocation chroma_loc;
+    /**@}*/
+
+    double target_peak;            ///< SDR target peak (normally 1.0)
+    int initialised;               ///< set once CUDA module is loaded
+
+    AVCUDADeviceContext *hwctx;    ///< CUDA device context
+    CUmodule cu_module;            ///< loaded PTX module
+    CUfunction cu_func;            ///< tonemap kernel entry point
+    AVBufferRef *frames_ctx;       ///< output hw frames context
+
+    /** @name Precomputed matrices uploaded to the GPU each frame */
+    /**@{*/
+    float rgb_matrix[9];           ///< YUV-to-RGB (source)
+    float yuv_matrix[9];           ///< RGB-to-YUV (destination)
+    float rgb2rgb_matrix[9];       ///< gamut conversion
+    float luma_src[3];             ///< source luma coefficients
+    float luma_dst[3];             ///< destination luma coefficients
+    int rgb2rgb_passthrough;       ///< 1 if primaries match
+    int src_trc;                   ///< TransferFuncCUDA value
+    int dst_trc;                   ///< DelinearizeFuncCUDA value
+    /**@}*/
+} TonemapCUDAContext;
+
+/**
+ * Compute a 3x3 RGB-to-RGB colour-primary conversion matrix.
+ *
+ * @param in     source colour primaries
+ * @param out    destination colour primaries
+ * @param rgb2rgb output 3x3 matrix
+ * @return 0 on success, negative AVERROR on failure
+ */
+static int get_rgb2rgb_matrix(enum AVColorPrimaries in,
+                              enum AVColorPrimaries out,
+                              double rgb2rgb[3][3])
+{
+    double rgb2xyz[3][3], xyz2rgb[3][3];
+    const AVColorPrimariesDesc *in_primaries =
+        av_csp_primaries_desc_from_id(in);
+    const AVColorPrimariesDesc *out_primaries =
+        av_csp_primaries_desc_from_id(out);
+
+    if (!in_primaries || !out_primaries)
+        return AVERROR(EINVAL);
+
+    ff_fill_rgb2xyz_table(&out_primaries->prim, &out_primaries->wp, rgb2xyz);
+    ff_matrix_invert_3x3(rgb2xyz, xyz2rgb);
+    ff_fill_rgb2xyz_table(&in_primaries->prim, &in_primaries->wp, rgb2xyz);
+    ff_matrix_mul_3x3(rgb2rgb, rgb2xyz, xyz2rgb);
+
+    return 0;
+}
+
+/** Flatten a 3x3 double matrix into a row-major float[9] array. */
+static void double_matrix_to_float9(const double m[3][3], float out[9])
+{
+    for (int i = 0; i < 3; i++)
+        for (int j = 0; j < 3; j++)
+            out[i * 3 + j] = (float)m[i][j];
+}
+
+/**
+ * Compute colour-space conversion matrices and transfer-function
+ * mappings from the current stream colour properties.
+ *
+ * Called once on the first frame and again whenever input colour
+ * metadata changes mid-stream.
+ *
+ * @return 0 on success, negative AVERROR on failure
+ */
+static int tonemap_cuda_setup(AVFilterContext *ctx)
+{
+    TonemapCUDAContext *s = ctx->priv;
+    double rgb2yuv_src[3][3], yuv2rgb_src[3][3];
+    double rgb2yuv_dst[3][3];
+    double rgb2rgb[3][3];
+    const AVLumaCoefficients *luma_src, *luma_dst;
+
+    /* Tonemap param defaults (matching tonemap_opencl) */
+    switch (s->tonemap) {
+    case TONEMAP_GAMMA_CUDA:
+        if (isnan(s->param))
+            s->param = 1.8;
+        break;
+    case TONEMAP_REINHARD_CUDA:
+        if (!isnan(s->param))
+            s->param = (1.0 - s->param) / s->param;
+        break;
+    case TONEMAP_MOBIUS_CUDA:
+        if (isnan(s->param))
+            s->param = 0.3;
+        break;
+    }
+    if (isnan(s->param))
+        s->param = 1.0;
+
+    s->target_peak = 1.0;
+
+    av_log(ctx, AV_LOG_DEBUG, "tonemap transfer from %s to %s\n",
+           av_color_transfer_name(s->trc_in),
+           av_color_transfer_name(s->trc_out));
+    av_log(ctx, AV_LOG_DEBUG, "mapping colorspace from %s to %s\n",
+           av_color_space_name(s->colorspace_in),
+           av_color_space_name(s->colorspace_out));
+    av_log(ctx, AV_LOG_DEBUG, "mapping primaries from %s to %s\n",
+           av_color_primaries_name(s->primaries_in),
+           av_color_primaries_name(s->primaries_out));
+    av_log(ctx, AV_LOG_DEBUG, "mapping range from %s to %s\n",
+           av_color_range_name(s->range_in),
+           av_color_range_name(s->range_out));
+
+    if (s->trc_in != AVCOL_TRC_SMPTE2084 &&
+        s->trc_in != AVCOL_TRC_ARIB_STD_B67) {
+        av_log(ctx, AV_LOG_ERROR,
+               "unsupported input transfer %s, expected PQ or HLG\n",
+               av_color_transfer_name(s->trc_in));
+        return AVERROR(EINVAL);
+    }
+
+    /* Map source HDR transfer to kernel enum */
+    s->src_trc = (s->trc_in == AVCOL_TRC_ARIB_STD_B67) ? TRC_HLG_CUDA
+                                                         : TRC_ST2084_CUDA;
+    /*
+     * Output delinearization: BT.2020-10 uses the BT.709 OETF
+     * (they share the same curve).  For the default BT.709 SDR
+     * target we use the BT.1886 inverse EOTF (pure gamma 2.4)
+     * which produces display-referred output better suited to
+     * tonemapped content than the BT.709 OETF linear toe.
+     */
+    s->dst_trc = (s->trc_out == AVCOL_TRC_BT2020_10) ? DELIN_BT709_CUDA
+                                                       : DELIN_BT1886_CUDA;
+
+    /* Compute YUV-to-RGB matrix (input colorspace) */
+    luma_src = av_csp_luma_coeffs_from_avcsp(s->colorspace_in);
+    if (!luma_src) {
+        av_log(ctx, AV_LOG_ERROR, "unsupported input colorspace %d (%s)\n",
+               s->colorspace_in, av_color_space_name(s->colorspace_in));
+        return AVERROR(EINVAL);
+    }
+    ff_fill_rgb2yuv_table(luma_src, rgb2yuv_src);
+    ff_matrix_invert_3x3(rgb2yuv_src, yuv2rgb_src);
+    double_matrix_to_float9(yuv2rgb_src, s->rgb_matrix);
+
+    /* Compute RGB-to-YUV matrix (output colorspace) */
+    luma_dst = av_csp_luma_coeffs_from_avcsp(s->colorspace_out);
+    if (!luma_dst) {
+        av_log(ctx, AV_LOG_ERROR, "unsupported output colorspace %d (%s)\n",
+               s->colorspace_out, av_color_space_name(s->colorspace_out));
+        return AVERROR(EINVAL);
+    }
+    ff_fill_rgb2yuv_table(luma_dst, rgb2yuv_dst);
+    double_matrix_to_float9(rgb2yuv_dst, s->yuv_matrix);
+
+    /* Luma coefficients */
+    s->luma_src[0] = (float)av_q2d(luma_src->cr);
+    s->luma_src[1] = (float)av_q2d(luma_src->cg);
+    s->luma_src[2] = (float)av_q2d(luma_src->cb);
+    s->luma_dst[0] = (float)av_q2d(luma_dst->cr);
+    s->luma_dst[1] = (float)av_q2d(luma_dst->cg);
+    s->luma_dst[2] = (float)av_q2d(luma_dst->cb);
+
+    /* Primaries conversion matrix */
+    s->rgb2rgb_passthrough = 1;
+    if (s->primaries_out != s->primaries_in) {
+        int ret = get_rgb2rgb_matrix(s->primaries_in,
+                                     s->primaries_out,
+                                     rgb2rgb);
+        if (ret < 0) {
+            av_log(ctx, AV_LOG_ERROR, "failed to compute primaries matrix\n");
+            return ret;
+        }
+        double_matrix_to_float9(rgb2rgb, s->rgb2rgb_matrix);
+        s->rgb2rgb_passthrough = 0;
+    }
+
+    return 0;
+}
+
+/**
+ * Load the compiled PTX module and resolve the tonemap kernel.
+ *
+ * @return 0 on success, negative AVERROR on failure
+ */
+static av_cold int tonemap_cuda_load_functions(AVFilterContext *ctx)
+{
+    TonemapCUDAContext *s = ctx->priv;
+    CudaFunctions *cu = s->hwctx->internal->cuda_dl;
+    AVCUDADeviceContext *device_hwctx = s->hwctx;
+    CUcontext dummy;
+    int ret;
+
+    extern const unsigned char ff_vf_tonemap_cuda_ptx_data[];
+    extern const unsigned int  ff_vf_tonemap_cuda_ptx_len;
+
+    ret = CHECK_CU(cu->cuCtxPushCurrent(device_hwctx->cuda_ctx));
+    if (ret < 0)
+        return ret;
+
+    ret = ff_cuda_load_module(ctx, device_hwctx, &s->cu_module,
+                              ff_vf_tonemap_cuda_ptx_data,
+                              ff_vf_tonemap_cuda_ptx_len);
+    if (ret < 0) {
+        av_log(ctx, AV_LOG_ERROR, "Failed to load CUDA module\n");
+        goto fail;
+    }
+
+    ret = CHECK_CU(cu->cuModuleGetFunction(&s->cu_func,
+                                           s->cu_module,
+                                           "tonemap"));
+    if (ret < 0) {
+        av_log(ctx, AV_LOG_ERROR, "Failed to load tonemap kernel\n");
+        goto fail;
+    }
+
+fail:
+    CHECK_CU(cu->cuCtxPopCurrent(&dummy));
+    return ret;
+}
+
+/**
+ * Validate input format, allocate output CUDA hw-frames context,
+ * and propagate link properties.
+ */
+static int tonemap_cuda_config_output(AVFilterLink *outlink)
+{
+    AVFilterContext *ctx = outlink->src;
+    TonemapCUDAContext *s = ctx->priv;
+    AVFilterLink *inlink = ctx->inputs[0];
+    FilterLink *inl = ff_filter_link(inlink);
+    FilterLink *ol  = ff_filter_link(outlink);
+    AVHWFramesContext *in_frames_ctx;
+    enum AVPixelFormat out_sw_format;
+    int ret;
+
+    if (!inl->hw_frames_ctx) {
+        av_log(ctx, AV_LOG_ERROR, "No hw context provided on input\n");
+        return AVERROR(EINVAL);
+    }
+
+    in_frames_ctx = (AVHWFramesContext *)inl->hw_frames_ctx->data;
+    s->hwctx = in_frames_ctx->device_ctx->hwctx;
+
+    if (in_frames_ctx->sw_format != AV_PIX_FMT_P010) {
+        av_log(ctx, AV_LOG_ERROR,
+               "Unsupported input format %s, only p010 is supported\n",
+               av_get_pix_fmt_name(in_frames_ctx->sw_format));
+        return AVERROR(EINVAL);
+    }
+
+    if (s->format == AV_PIX_FMT_NONE) {
+        av_log(ctx, AV_LOG_WARNING,
+               "Output format not set, defaulting to nv12\n");
+        out_sw_format = AV_PIX_FMT_NV12;
+    } else if (s->format != AV_PIX_FMT_NV12 && s->format != AV_PIX_FMT_P010) {
+        av_log(ctx, AV_LOG_ERROR,
+               "Unsupported output format %s, only nv12 and p010 are 
supported\n",
+               av_get_pix_fmt_name(s->format));
+        return AVERROR(EINVAL);
+    } else {
+        out_sw_format = s->format;
+    }
+
+    s->frames_ctx = av_hwframe_ctx_alloc(in_frames_ctx->device_ref);
+    if (!s->frames_ctx)
+        return AVERROR(ENOMEM);
+
+    AVHWFramesContext *out_frames = (AVHWFramesContext *)s->frames_ctx->data;
+    out_frames->format    = AV_PIX_FMT_CUDA;
+    out_frames->sw_format = out_sw_format;
+    out_frames->width     = FFALIGN(inlink->w, 32);
+    out_frames->height    = FFALIGN(inlink->h, 32);
+
+    ret = av_hwframe_ctx_init(s->frames_ctx);
+    if (ret < 0) {
+        av_log(ctx, AV_LOG_ERROR, "Failed to init output hw frames ctx\n");
+        av_buffer_unref(&s->frames_ctx);
+        return ret;
+    }
+
+    ol->hw_frames_ctx = av_buffer_ref(s->frames_ctx);
+    if (!ol->hw_frames_ctx)
+        return AVERROR(ENOMEM);
+
+    outlink->w = inlink->w;
+    outlink->h = inlink->h;
+    outlink->time_base = inlink->time_base;
+    outlink->format = AV_PIX_FMT_CUDA;
+
+    return 0;
+}
+
+/**
+ * Process one input frame: resolve colour metadata, (re-)initialise
+ * matrices if needed, launch the CUDA tonemap kernel, and forward
+ * the resulting SDR frame downstream.
+ */
+static int tonemap_cuda_filter_frame(AVFilterLink *inlink, AVFrame *input)
+{
+    AVFilterContext *ctx = inlink->dst;
+    TonemapCUDAContext *s = ctx->priv;
+    AVFilterLink *outlink = ctx->outputs[0];
+    FilterLink *ol = ff_filter_link(outlink);
+    AVHWFramesContext *out_frames_ctx =
+        (AVHWFramesContext *)ol->hw_frames_ctx->data;
+    AVCUDADeviceContext *device_hwctx = out_frames_ctx->device_ctx->hwctx;
+    CudaFunctions *cu = device_hwctx->internal->cuda_dl;
+    CUcontext dummy;
+    AVFrame *output = NULL;
+    double peak;
+    int ret;
+
+    if (!input->hw_frames_ctx)
+        return AVERROR(EINVAL);
+
+    if (input->width % 2 || input->height % 2) {
+        av_log(ctx, AV_LOG_ERROR,
+               "Input dimensions %dx%d must be even for 4:2:0\n",
+               input->width, input->height);
+        av_frame_free(&input);
+        return AVERROR(EINVAL);
+    }
+
+    output = av_frame_alloc();
+    if (!output) {
+        ret = AVERROR(ENOMEM);
+        goto fail;
+    }
+
+    ret = av_hwframe_get_buffer(ol->hw_frames_ctx, output, 0);
+    if (ret < 0) {
+        av_log(ctx, AV_LOG_ERROR, "Failed to get output buffer: %s\n",
+               av_err2str(ret));
+        goto fail;
+    }
+
+    ret = av_frame_copy_props(output, input);
+    if (ret < 0)
+        goto fail;
+
+    /* Determine signal peak */
+    peak = s->peak;
+    if (peak <= 0.0)
+        peak = ff_determine_signal_peak(input);
+
+    /* Set output color properties */
+    if (s->trc != -1)
+        output->color_trc = s->trc;
+    if (s->primaries != -1)
+        output->color_primaries = s->primaries;
+    if (s->colorspace != -1)
+        output->colorspace = s->colorspace;
+    if (s->range != -1)
+        output->color_range = s->range;
+
+    {
+        int props_changed =
+            s->trc_in        != input->color_trc       ||
+            s->trc_out       != output->color_trc      ||
+            s->colorspace_in != input->colorspace      ||
+            s->colorspace_out != output->colorspace    ||
+            s->primaries_in  != input->color_primaries ||
+            s->primaries_out != output->color_primaries;
+
+        s->trc_in        = input->color_trc;
+        s->trc_out       = output->color_trc;
+        s->colorspace_in = input->colorspace;
+        s->colorspace_out = output->colorspace;
+        s->primaries_in  = input->color_primaries;
+        s->primaries_out = output->color_primaries;
+        s->range_in      = input->color_range;
+        s->range_out     = output->color_range;
+        s->chroma_loc    = output->chroma_location;
+
+        if (!s->initialised || props_changed) {
+            if (s->initialised)
+                av_log(ctx, AV_LOG_INFO,
+                       "Color properties changed, "
+                       "recomputing matrices\n");
+
+            ret = tonemap_cuda_setup(ctx);
+            if (ret < 0)
+                goto fail;
+
+            if (!s->initialised) {
+                ret = tonemap_cuda_load_functions(ctx);
+                if (ret < 0)
+                    goto fail;
+            }
+
+            s->initialised = 1;
+        }
+    }
+
+    /* Push CUDA context */
+    ret = CHECK_CU(cu->cuCtxPushCurrent(device_hwctx->cuda_ctx));
+    if (ret < 0)
+        goto fail;
+
+    {
+        CUDATonemapParams params = {0};
+
+        params.dst_y  = (CUdeviceptr)output->data[0];
+        params.dst_uv = (CUdeviceptr)output->data[1];
+        params.src_y  = (CUdeviceptr)input->data[0];
+        params.src_uv = (CUdeviceptr)input->data[1];
+
+        params.width     = input->width;
+        params.height    = input->height;
+        params.src_pitch = input->linesize[0];
+        params.dst_pitch = output->linesize[0];
+
+        memcpy(params.rgb_matrix,     s->rgb_matrix,     
sizeof(s->rgb_matrix));
+        memcpy(params.yuv_matrix,     s->yuv_matrix,     
sizeof(s->yuv_matrix));
+        memcpy(params.rgb2rgb_matrix, s->rgb2rgb_matrix,
+               sizeof(s->rgb2rgb_matrix));
+        memcpy(params.luma_src,       s->luma_src,       sizeof(s->luma_src));
+        memcpy(params.luma_dst,       s->luma_dst,       sizeof(s->luma_dst));
+
+        params.tonemap_func       = s->tonemap;
+        params.param              = (float)s->param;
+        params.desat_param        = (float)s->desat_param;
+        params.signal_peak        = (float)peak;
+        params.target_peak        = (float)s->target_peak;
+
+        params.src_trc            = s->src_trc;
+        params.dst_trc            = s->dst_trc;
+        params.src_range_full     = (s->range_in == AVCOL_RANGE_JPEG);
+        params.dst_range_full     = (s->range_out == AVCOL_RANGE_JPEG);
+        params.rgb2rgb_passthrough = s->rgb2rgb_passthrough;
+        params.chroma_loc         = (int)s->chroma_loc;
+        params.out_depth =
+            (out_frames_ctx->sw_format == AV_PIX_FMT_P010)
+            ? 10 : 8;
+
+        void *args[] = { &params };
+        ret = CHECK_CU(cu->cuLaunchKernel(s->cu_func,
+                                           DIV_UP(input->width / 2, BLOCKX),
+                                           DIV_UP(input->height / 2, BLOCKY), 
1,
+                                           BLOCKX, BLOCKY, 1,
+                                           0, device_hwctx->stream, args, 
NULL));
+        if (ret < 0)
+            goto pop_ctx;
+    }
+
+pop_ctx:
+    CHECK_CU(cu->cuCtxPopCurrent(&dummy));
+    if (ret < 0)
+        goto fail;
+
+    output->width  = input->width;
+    output->height = input->height;
+
+    ff_update_hdr_metadata(output, s->target_peak);
+
+    av_frame_free(&input);
+
+    av_log(ctx, AV_LOG_DEBUG,
+           "Tonemap output: %s, %ux%u (%"PRId64")\n",
+           av_get_pix_fmt_name(output->format),
+           output->width, output->height, output->pts);
+
+    return ff_filter_frame(outlink, output);
+
+fail:
+    av_frame_free(&input);
+    av_frame_free(&output);
+    return ret;
+}
+
+/** Release the CUDA module and output frames context. */
+static av_cold void tonemap_cuda_uninit(AVFilterContext *ctx)
+{
+    TonemapCUDAContext *s = ctx->priv;
+    CUcontext dummy;
+
+    av_buffer_unref(&s->frames_ctx);
+
+    if (s->hwctx && s->cu_module) {
+        CudaFunctions *cu = s->hwctx->internal->cuda_dl;
+        AVCUDADeviceContext *device_hwctx = s->hwctx;
+        CHECK_CU(cu->cuCtxPushCurrent(s->hwctx->cuda_ctx));
+        CHECK_CU(cu->cuModuleUnload(s->cu_module));
+        CHECK_CU(cu->cuCtxPopCurrent(&dummy));
+    }
+
+    s->cu_module = NULL;
+    s->hwctx = NULL;
+}
+
+#define OFFSET(x) offsetof(TonemapCUDAContext, x)
+#define FLAGS (AV_OPT_FLAG_FILTERING_PARAM | AV_OPT_FLAG_VIDEO_PARAM)
+
+static const AVOption tonemap_cuda_options[] = {
+    { "tonemap", "tonemap algorithm selection",
+      OFFSET(tonemap), AV_OPT_TYPE_INT,
+      {.i64 = TONEMAP_NONE_CUDA},
+      TONEMAP_NONE_CUDA, TONEMAP_MAX_CUDA - 1,
+      FLAGS, .unit = "tonemap" },
+    {     "none",     0, 0, AV_OPT_TYPE_CONST, {.i64 = TONEMAP_NONE_CUDA},     
0, 0, FLAGS, .unit = "tonemap" },
+    {     "linear",   0, 0, AV_OPT_TYPE_CONST, {.i64 = TONEMAP_LINEAR_CUDA},   
0, 0, FLAGS, .unit = "tonemap" },
+    {     "gamma",    0, 0, AV_OPT_TYPE_CONST, {.i64 = TONEMAP_GAMMA_CUDA},    
0, 0, FLAGS, .unit = "tonemap" },
+    {     "clip",     0, 0, AV_OPT_TYPE_CONST, {.i64 = TONEMAP_CLIP_CUDA},     
0, 0, FLAGS, .unit = "tonemap" },
+    {     "reinhard", 0, 0, AV_OPT_TYPE_CONST, {.i64 = TONEMAP_REINHARD_CUDA}, 
0, 0, FLAGS, .unit = "tonemap" },
+    {     "hable",    0, 0, AV_OPT_TYPE_CONST, {.i64 = TONEMAP_HABLE_CUDA},    
0, 0, FLAGS, .unit = "tonemap" },
+    {     "mobius",   0, 0, AV_OPT_TYPE_CONST, {.i64 = TONEMAP_MOBIUS_CUDA},   
 0, 0, FLAGS, .unit = "tonemap" },
+    { "transfer", "set transfer characteristic",
+      OFFSET(trc), AV_OPT_TYPE_INT, {.i64 = AVCOL_TRC_BT709}, -1, INT_MAX, 
FLAGS, .unit = "transfer" },
+    { "t",        "set transfer characteristic",
+      OFFSET(trc), AV_OPT_TYPE_INT, {.i64 = AVCOL_TRC_BT709}, -1, INT_MAX, 
FLAGS, .unit = "transfer" },
+    {     "bt709",  0, 0, AV_OPT_TYPE_CONST, {.i64 = AVCOL_TRC_BT709},     0, 
0, FLAGS, .unit = "transfer" },
+    {     "bt2020", 0, 0, AV_OPT_TYPE_CONST, {.i64 = AVCOL_TRC_BT2020_10}, 0, 
0, FLAGS, .unit = "transfer" },
+    { "matrix", "set colorspace matrix",
+      OFFSET(colorspace), AV_OPT_TYPE_INT, {.i64 = -1}, -1, INT_MAX, FLAGS, 
.unit = "matrix" },
+    { "m",      "set colorspace matrix",
+      OFFSET(colorspace), AV_OPT_TYPE_INT, {.i64 = -1}, -1, INT_MAX, FLAGS, 
.unit = "matrix" },
+    {     "bt709",  0, 0, AV_OPT_TYPE_CONST, {.i64 = AVCOL_SPC_BT709},      0, 
0, FLAGS, .unit = "matrix" },
+    {     "bt2020", 0, 0, AV_OPT_TYPE_CONST, {.i64 = AVCOL_SPC_BT2020_NCL}, 0, 
0, FLAGS, .unit = "matrix" },
+    { "primaries", "set color primaries",
+      OFFSET(primaries), AV_OPT_TYPE_INT, {.i64 = -1}, -1, INT_MAX, FLAGS, 
.unit = "primaries" },
+    { "p",         "set color primaries",
+      OFFSET(primaries), AV_OPT_TYPE_INT, {.i64 = -1}, -1, INT_MAX, FLAGS, 
.unit = "primaries" },
+    {     "bt709",  0, 0, AV_OPT_TYPE_CONST, {.i64 = AVCOL_PRI_BT709},  0, 0, 
FLAGS, .unit = "primaries" },
+    {     "bt2020", 0, 0, AV_OPT_TYPE_CONST, {.i64 = AVCOL_PRI_BT2020}, 0, 0, 
FLAGS, .unit = "primaries" },
+    { "range", "set color range",
+      OFFSET(range), AV_OPT_TYPE_INT, {.i64 = -1}, -1, INT_MAX, FLAGS, .unit = 
"range" },
+    { "r",     "set color range",
+      OFFSET(range), AV_OPT_TYPE_INT, {.i64 = -1}, -1, INT_MAX, FLAGS, .unit = 
"range" },
+    {     "tv",      0, 0, AV_OPT_TYPE_CONST, {.i64 = AVCOL_RANGE_MPEG}, 0, 0, 
FLAGS, .unit = "range" },
+    {     "pc",      0, 0, AV_OPT_TYPE_CONST, {.i64 = AVCOL_RANGE_JPEG}, 0, 0, 
FLAGS, .unit = "range" },
+    {     "limited", 0, 0, AV_OPT_TYPE_CONST, {.i64 = AVCOL_RANGE_MPEG}, 0, 0, 
FLAGS, .unit = "range" },
+    {     "full",    0, 0, AV_OPT_TYPE_CONST, {.i64 = AVCOL_RANGE_JPEG}, 0, 0, 
FLAGS, .unit = "range" },
+    { "format", "output pixel format",
+      OFFSET(format), AV_OPT_TYPE_PIXEL_FMT, {.i64 = AV_PIX_FMT_NONE},
+      AV_PIX_FMT_NONE, INT_MAX, FLAGS, .unit = "fmt" },
+    { "peak", "signal peak override",
+      OFFSET(peak), AV_OPT_TYPE_DOUBLE, {.dbl = 0}, 0, DBL_MAX, FLAGS },
+    { "param", "tonemap parameter",
+      OFFSET(param), AV_OPT_TYPE_DOUBLE, {.dbl = NAN}, DBL_MIN, DBL_MAX, FLAGS 
},
+    { "desat", "desaturation parameter",
+      OFFSET(desat_param), AV_OPT_TYPE_DOUBLE, {.dbl = 0.5}, 0, DBL_MAX, FLAGS 
},
+    { NULL }
+};
+
+AVFILTER_DEFINE_CLASS(tonemap_cuda);
+
+static const AVFilterPad tonemap_cuda_inputs[] = {
+    {
+        .name         = "default",
+        .type         = AVMEDIA_TYPE_VIDEO,
+        .filter_frame = tonemap_cuda_filter_frame,
+    },
+};
+
+static const AVFilterPad tonemap_cuda_outputs[] = {
+    {
+        .name         = "default",
+        .type         = AVMEDIA_TYPE_VIDEO,
+        .config_props = tonemap_cuda_config_output,
+    },
+};
+
+const FFFilter ff_vf_tonemap_cuda = {
+    .p.name         = "tonemap_cuda",
+    .p.description  = NULL_IF_CONFIG_SMALL("CUDA accelerated HDR to SDR 
tonemapping"),
+    .p.priv_class   = &tonemap_cuda_class,
+    .priv_size      = sizeof(TonemapCUDAContext),
+    .uninit         = tonemap_cuda_uninit,
+    FILTER_INPUTS(tonemap_cuda_inputs),
+    FILTER_OUTPUTS(tonemap_cuda_outputs),
+    FILTER_SINGLE_PIXFMT(AV_PIX_FMT_CUDA),
+    .flags_internal = FF_FILTER_FLAG_HWFRAME_AWARE,
+};
diff --git a/libavfilter/vf_tonemap_cuda.cu b/libavfilter/vf_tonemap_cuda.cu
new file mode 100644
index 0000000000..285b1c0f09
--- /dev/null
+++ b/libavfilter/vf_tonemap_cuda.cu
@@ -0,0 +1,420 @@
+/*
+ * Copyright (c) 2026, Faeez Kadiri < f1k2faeez at gmail dot com>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "vf_tonemap_cuda.h"
+
+#define ST2084_MAX_LUMINANCE 10000.0f
+#define REFERENCE_WHITE      100.0f
+#define ST2084_M1            0.1593017578125f
+#define ST2084_M2            78.84375f
+#define ST2084_C1            0.8359375f
+#define ST2084_C2            18.8515625f
+#define ST2084_C3            18.6875f
+#define HLG_A                0.17883277f
+#define HLG_B                0.28466892f
+#define HLG_C                0.55991073f
+#define SDR_AVG              0.25f
+
+/* --- Transfer functions ------------------------------------------ */
+
+__device__ static inline float eotf_st2084(float x)
+{
+    float p = __powf(x, 1.0f / ST2084_M2);
+    float a = fmaxf(p - ST2084_C1, 0.0f);
+    float b = fmaxf(ST2084_C2 - ST2084_C3 * p, 1e-6f);
+    float c = __powf(a / b, 1.0f / ST2084_M1);
+    return x > 0.0f ? c * (ST2084_MAX_LUMINANCE / REFERENCE_WHITE) : 0.0f;
+}
+
+__device__ static inline float inverse_oetf_hlg(float x)
+{
+    float a = 4.0f * x * x;
+    float b = __expf((x - HLG_C) / HLG_A) + HLG_B;
+    return x < 0.5f ? a : b;
+}
+
+__device__ static inline float inverse_eotf_bt1886(float c)
+{
+    return c < 0.0f ? 0.0f : __powf(c, 1.0f / 2.4f);
+}
+
+__device__ static inline float oetf_bt709(float c)
+{
+    c = fmaxf(c, 0.0f);
+    float r1 = 4.5f * c;
+    float r2 = 1.099f * __powf(c, 0.45f) - 0.099f;
+    return c < 0.018f ? r1 : r2;
+}
+
+__device__ static inline float linearize(float x, int trc)
+{
+    if (trc == TRC_HLG_CUDA)
+        return inverse_oetf_hlg(x);
+    return eotf_st2084(x);
+}
+
+__device__ static inline float delinearize(float x, int trc)
+{
+    if (trc == DELIN_BT709_CUDA)
+        return oetf_bt709(x);
+    return inverse_eotf_bt1886(x);
+}
+
+/* --- OOTF (HLG) ------------------------------------------------- */
+
+__device__ static inline void ootf_hlg(float *r, float *g, float *b,
+                                        const float *luma_src, float peak)
+{
+    float luma = luma_src[0] * (*r) + luma_src[1] * (*g) + luma_src[2] * (*b);
+    float gamma = 1.2f + 0.42f * __log10f(peak * REFERENCE_WHITE / 1000.0f);
+    gamma = fmaxf(1.0f, gamma);
+    float factor = peak * __powf(luma, gamma - 1.0f) / __powf(12.0f, gamma);
+    *r *= factor;
+    *g *= factor;
+    *b *= factor;
+}
+
+/* --- Tonemap algorithms ------------------------------------------ */
+
+__device__ static inline float hable_f(float in)
+{
+    float a = 0.15f, b = 0.50f, c = 0.10f;
+    float d = 0.20f, e = 0.02f, f = 0.30f;
+    float num = in * (in * a + b * c) + d * e;
+    float den = in * (in * a + b) + d * f;
+    return num / den - e / f;
+}
+
+__device__ static inline float apply_tonemap(float sig, float peak,
+                                              int algo, float param)
+{
+    switch (algo) {
+    case TONEMAP_LINEAR_CUDA:
+        return sig * param / peak;
+    case TONEMAP_GAMMA_CUDA: {
+        float p = sig > 0.05f ? sig / peak : 0.05f / peak;
+        float v = __powf(p, 1.0f / param);
+        return sig > 0.05f ? v : (sig * v / 0.05f);
+    }
+    case TONEMAP_CLIP_CUDA:
+        return fminf(fmaxf(sig * param, 0.0f), 1.0f);
+    case TONEMAP_REINHARD_CUDA:
+        return sig / (sig + param) * (peak + param) / peak;
+    case TONEMAP_HABLE_CUDA:
+        return hable_f(sig) / hable_f(peak);
+    case TONEMAP_MOBIUS_CUDA: {
+        float j = param;
+        if (sig <= j)
+            return sig;
+        float a = -j * j * (peak - 1.0f) / (j * j - 2.0f * j + peak);
+        float b = (j * j - 2.0f * j * peak + peak) / fmaxf(peak - 1.0f, 1e-6f);
+        return (b * b + 2.0f * b * j + j * j) / (b - a) * (sig + a) / (sig + 
b);
+    }
+    default: /* TONEMAP_NONE_CUDA */
+        return sig;
+    }
+}
+
+/* --- Color helpers ----------------------------------------------- */
+
+__device__ static inline void mat3x3_mul(const float *m,
+                                          float r, float g, float b,
+                                          float *or_, float *og, float *ob)
+{
+    *or_ = __fmaf_rn(m[0], r, __fmaf_rn(m[1], g, m[2] * b));
+    *og  = __fmaf_rn(m[3], r, __fmaf_rn(m[4], g, m[5] * b));
+    *ob  = __fmaf_rn(m[6], r, __fmaf_rn(m[7], g, m[8] * b));
+}
+
+__device__ static inline void yuv2rgb(float y, float u, float v,
+                                       const CUDATonemapParams &p,
+                                       float *r, float *g, float *b)
+{
+    if (p.src_range_full) {
+        u -= 0.5f;
+        v -= 0.5f;
+    } else {
+        y = (y * 255.0f -  16.0f) / 219.0f;
+        u = (u * 255.0f - 128.0f) / 224.0f;
+        v = (v * 255.0f - 128.0f) / 224.0f;
+    }
+    mat3x3_mul(p.rgb_matrix, y, u, v, r, g, b);
+}
+
+__device__ static inline void rgb2yuv(float r, float g, float b,
+                                       const CUDATonemapParams &p,
+                                       float *y, float *u, float *v)
+{
+    mat3x3_mul(p.yuv_matrix, r, g, b, y, u, v);
+    if (p.dst_range_full) {
+        *u += 0.5f;
+        *v += 0.5f;
+    } else {
+        *y = (219.0f * (*y) + 16.0f) / 255.0f;
+        *u = (224.0f * (*u) + 128.0f) / 255.0f;
+        *v = (224.0f * (*v) + 128.0f) / 255.0f;
+    }
+}
+
+__device__ static inline float rgb2y(float r, float g, float b,
+                                      const CUDATonemapParams &p)
+{
+    float y = p.yuv_matrix[0] * r + p.yuv_matrix[1] * g + p.yuv_matrix[2] * b;
+    if (p.dst_range_full)
+        return y;
+    return (219.0f * y + 16.0f) / 255.0f;
+}
+
+__device__ static inline void chroma_sample(float r0, float g0, float b0,
+                                             float r1, float g1, float b1,
+                                             float r2, float g2, float b2,
+                                             float r3, float g3, float b3,
+                                             int loc,
+                                             float *cr, float *cg, float *cb)
+{
+    switch (loc) {
+    case 1: /* AVCHROMA_LOC_LEFT */
+        *cr = (r0 + r2) * 0.5f;
+        *cg = (g0 + g2) * 0.5f;
+        *cb = (b0 + b2) * 0.5f;
+        break;
+    case 3: /* AVCHROMA_LOC_TOPLEFT */
+        *cr = r0; *cg = g0; *cb = b0;
+        break;
+    case 4: /* AVCHROMA_LOC_TOP */
+        *cr = (r0 + r1) * 0.5f;
+        *cg = (g0 + g1) * 0.5f;
+        *cb = (b0 + b1) * 0.5f;
+        break;
+    case 5: /* AVCHROMA_LOC_BOTTOMLEFT */
+        *cr = r2; *cg = g2; *cb = b2;
+        break;
+    case 6: /* AVCHROMA_LOC_BOTTOM */
+        *cr = (r2 + r3) * 0.5f;
+        *cg = (g2 + g3) * 0.5f;
+        *cb = (b2 + b3) * 0.5f;
+        break;
+    default: /* CENTER / UNSPECIFIED */
+        *cr = (r0 + r1 + r2 + r3) * 0.25f;
+        *cg = (g0 + g1 + g2 + g3) * 0.25f;
+        *cb = (b0 + b1 + b2 + b3) * 0.25f;
+        break;
+    }
+}
+
+/* --- Per-pixel tonemap pipeline ---------------------------------- */
+
+__device__ static inline void map_one_pixel_rgb(float *r, float *g, float *b,
+                                                 const CUDATonemapParams &p,
+                                                 float peak)
+{
+    float sig = fmaxf(fmaxf(*r, fmaxf(*g, *b)), 1e-6f);
+
+    if (p.target_peak > 1.0f) {
+        sig  *= 1.0f / p.target_peak;
+        peak *= 1.0f / p.target_peak;
+    }
+
+    float sig_old = sig;
+
+    if (p.desat_param > 0.0f) {
+        float luma = p.luma_dst[0] * (*r) +
+                     p.luma_dst[1] * (*g) +
+                     p.luma_dst[2] * (*b);
+        float coeff = fmaxf(sig - 0.18f, 1e-6f) / fmaxf(sig, 1e-6f);
+        coeff = __powf(coeff, 10.0f / p.desat_param);
+        *r = *r * (1.0f - coeff) + luma * coeff;
+        *g = *g * (1.0f - coeff) + luma * coeff;
+        *b = *b * (1.0f - coeff) + luma * coeff;
+        sig = sig * (1.0f - coeff) + luma * coeff;
+    }
+
+    sig = apply_tonemap(sig, peak, p.tonemap_func, p.param);
+    sig = fminf(sig, 1.0f);
+
+    float ratio = sig / sig_old;
+    *r *= ratio;
+    *g *= ratio;
+    *b *= ratio;
+}
+
+/* --- Quantization helpers ---------------------------------------- */
+
+__device__ static inline float read_p010(const unsigned short *p,
+                                         int idx)
+{
+    return (float)(__ldg(&p[idx]) >> 6) / 1023.0f;
+}
+
+__device__ static inline float saturate(float x)
+{
+    return fminf(fmaxf(x, 0.0f), 1.0f);
+}
+
+__device__ static inline unsigned short quant_p010(float v)
+{
+    return (unsigned short)(saturate(v) * 1023.0f + 0.5f) << 6;
+}
+
+__device__ static inline unsigned char quant_nv12(float v)
+{
+    return (unsigned char)(saturate(v) * 255.0f + 0.5f);
+}
+
+/* --- Main kernel ------------------------------------------------- */
+
+extern "C"
+__global__ void tonemap(CUDATonemapParams p)
+{
+    int xi = blockIdx.x * blockDim.x + threadIdx.x;
+    int yi = blockIdx.y * blockDim.y + threadIdx.y;
+
+    int x = 2 * xi;
+    int y = 2 * yi;
+
+    if (x + 1 >= p.width || y + 1 >= p.height)
+        return;
+
+    int src_pitch_y  = p.src_pitch / sizeof(unsigned short);
+    int src_pitch_uv = p.src_pitch / sizeof(unsigned short);
+    int dst_pitch_y, dst_pitch_uv;
+
+    const unsigned short *src_y  = (const unsigned short *)p.src_y;
+    const unsigned short *src_uv = (const unsigned short *)p.src_uv;
+
+    /* Read 4 Y samples and 1 UV pair from P010 */
+    float y0 = read_p010(src_y, y       * src_pitch_y + x);
+    float y1 = read_p010(src_y, y       * src_pitch_y + x + 1);
+    float y2 = read_p010(src_y, (y + 1) * src_pitch_y + x);
+    float y3 = read_p010(src_y, (y + 1) * src_pitch_y + x + 1);
+
+    float u_val = read_p010(src_uv, yi * src_pitch_uv + 2 * xi);
+    float v_val = read_p010(src_uv, yi * src_pitch_uv + 2 * xi + 1);
+
+    /* YUV to linear RGB for each of 4 pixels */
+    float r0, g0, b0, r1, g1, b1, r2, g2, b2, r3, g3, b3;
+
+    yuv2rgb(y0, u_val, v_val, p, &r0, &g0, &b0);
+    yuv2rgb(y1, u_val, v_val, p, &r1, &g1, &b1);
+    yuv2rgb(y2, u_val, v_val, p, &r2, &g2, &b2);
+    yuv2rgb(y3, u_val, v_val, p, &r3, &g3, &b3);
+
+    /* Linearize (EOTF) */
+    r0 = linearize(r0, p.src_trc);
+    g0 = linearize(g0, p.src_trc);
+    b0 = linearize(b0, p.src_trc);
+    r1 = linearize(r1, p.src_trc);
+    g1 = linearize(g1, p.src_trc);
+    b1 = linearize(b1, p.src_trc);
+    r2 = linearize(r2, p.src_trc);
+    g2 = linearize(g2, p.src_trc);
+    b2 = linearize(b2, p.src_trc);
+    r3 = linearize(r3, p.src_trc);
+    g3 = linearize(g3, p.src_trc);
+    b3 = linearize(b3, p.src_trc);
+
+    /* OOTF (HLG only) */
+    if (p.src_trc == TRC_HLG_CUDA) {
+        ootf_hlg(&r0, &g0, &b0, p.luma_src, p.signal_peak);
+        ootf_hlg(&r1, &g1, &b1, p.luma_src, p.signal_peak);
+        ootf_hlg(&r2, &g2, &b2, p.luma_src, p.signal_peak);
+        ootf_hlg(&r3, &g3, &b3, p.luma_src, p.signal_peak);
+    }
+
+    /* Primaries conversion (e.g. BT.2020 to BT.709) */
+    if (!p.rgb2rgb_passthrough) {
+        float tr, tg, tb;
+        mat3x3_mul(p.rgb2rgb_matrix, r0, g0, b0,
+                   &tr, &tg, &tb);
+        r0 = tr; g0 = tg; b0 = tb;
+        mat3x3_mul(p.rgb2rgb_matrix, r1, g1, b1,
+                   &tr, &tg, &tb);
+        r1 = tr; g1 = tg; b1 = tb;
+        mat3x3_mul(p.rgb2rgb_matrix, r2, g2, b2,
+                   &tr, &tg, &tb);
+        r2 = tr; g2 = tg; b2 = tb;
+        mat3x3_mul(p.rgb2rgb_matrix, r3, g3, b3,
+                   &tr, &tg, &tb);
+        r3 = tr; g3 = tg; b3 = tb;
+    }
+
+    /* Tonemap each pixel */
+    float peak = p.signal_peak;
+    map_one_pixel_rgb(&r0, &g0, &b0, p, peak);
+    map_one_pixel_rgb(&r1, &g1, &b1, p, peak);
+    map_one_pixel_rgb(&r2, &g2, &b2, p, peak);
+    map_one_pixel_rgb(&r3, &g3, &b3, p, peak);
+
+    /* Delinearize */
+    r0 = delinearize(r0, p.dst_trc);
+    g0 = delinearize(g0, p.dst_trc);
+    b0 = delinearize(b0, p.dst_trc);
+    r1 = delinearize(r1, p.dst_trc);
+    g1 = delinearize(g1, p.dst_trc);
+    b1 = delinearize(b1, p.dst_trc);
+    r2 = delinearize(r2, p.dst_trc);
+    g2 = delinearize(g2, p.dst_trc);
+    b2 = delinearize(b2, p.dst_trc);
+    r3 = delinearize(r3, p.dst_trc);
+    g3 = delinearize(g3, p.dst_trc);
+    b3 = delinearize(b3, p.dst_trc);
+
+    /* Compute output luma (Y) for each pixel */
+    float out_y0 = rgb2y(r0, g0, b0, p);
+    float out_y1 = rgb2y(r1, g1, b1, p);
+    float out_y2 = rgb2y(r2, g2, b2, p);
+    float out_y3 = rgb2y(r3, g3, b3, p);
+
+    /* Compute chroma from the 4 delinearized RGB values */
+    float cr, cg, cb;
+    chroma_sample(r0, g0, b0, r1, g1, b1, r2, g2, b2, r3, g3, b3,
+                  p.chroma_loc, &cr, &cg, &cb);
+    float out_u, out_v, dummy_y;
+    rgb2yuv(cr, cg, cb, p, &dummy_y, &out_u, &out_v);
+
+    /* Write output */
+    if (p.out_depth == 10) {
+        unsigned short *dy = (unsigned short *)p.dst_y;
+        unsigned short *duv = (unsigned short *)p.dst_uv;
+        dst_pitch_y  = p.dst_pitch / sizeof(unsigned short);
+        dst_pitch_uv = dst_pitch_y;
+
+        dy[y       * dst_pitch_y + x]     = quant_p010(out_y0);
+        dy[y       * dst_pitch_y + x + 1] = quant_p010(out_y1);
+        dy[(y + 1) * dst_pitch_y + x]     = quant_p010(out_y2);
+        dy[(y + 1) * dst_pitch_y + x + 1] = quant_p010(out_y3);
+
+        duv[yi * dst_pitch_uv + 2 * xi]     = quant_p010(out_u);
+        duv[yi * dst_pitch_uv + 2 * xi + 1] = quant_p010(out_v);
+    } else {
+        unsigned char *dy = (unsigned char *)p.dst_y;
+        unsigned char *duv = (unsigned char *)p.dst_uv;
+        dst_pitch_y  = p.dst_pitch;
+        dst_pitch_uv = dst_pitch_y;
+
+        dy[y       * dst_pitch_y + x]     = quant_nv12(out_y0);
+        dy[y       * dst_pitch_y + x + 1] = quant_nv12(out_y1);
+        dy[(y + 1) * dst_pitch_y + x]     = quant_nv12(out_y2);
+        dy[(y + 1) * dst_pitch_y + x + 1] = quant_nv12(out_y3);
+
+        duv[yi * dst_pitch_uv + 2 * xi]     = quant_nv12(out_u);
+        duv[yi * dst_pitch_uv + 2 * xi + 1] = quant_nv12(out_v);
+    }
+}
diff --git a/libavfilter/vf_tonemap_cuda.h b/libavfilter/vf_tonemap_cuda.h
new file mode 100644
index 0000000000..f8d44aca35
--- /dev/null
+++ b/libavfilter/vf_tonemap_cuda.h
@@ -0,0 +1,92 @@
+/*
+ * Copyright (c) 2026, Faeez Kadiri < f1k2faeez at gmail dot com>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef AVFILTER_VF_TONEMAP_CUDA_H
+#define AVFILTER_VF_TONEMAP_CUDA_H
+
+#if defined(__CUDACC__) || defined(__CUDA__)
+#include <stdint.h>
+typedef uint8_t* CUdeviceptr_t;
+#else
+#include <ffnvcodec/dynlink_cuda.h>
+typedef CUdeviceptr CUdeviceptr_t;
+#endif
+
+/** Tonemap algorithm selection, mirrored in the host-side enum. */
+enum TonemapAlgoCUDA {
+    TONEMAP_NONE_CUDA,
+    TONEMAP_LINEAR_CUDA,
+    TONEMAP_GAMMA_CUDA,
+    TONEMAP_CLIP_CUDA,
+    TONEMAP_REINHARD_CUDA,
+    TONEMAP_HABLE_CUDA,
+    TONEMAP_MOBIUS_CUDA,
+    TONEMAP_MAX_CUDA,
+};
+
+/** Source HDR transfer function for the EOTF linearization step. */
+enum TransferFuncCUDA {
+    TRC_ST2084_CUDA = 0,
+    TRC_HLG_CUDA    = 1,
+};
+
+/** Output SDR delinearization curve selection. */
+enum DelinearizeFuncCUDA {
+    DELIN_BT1886_CUDA = 0, ///< inverse EOTF, gamma 2.4
+    DELIN_BT709_CUDA  = 1, ///< BT.709 OETF with linear segment
+};
+
+/**
+ * Kernel parameter block passed by value to the CUDA tonemap kernel.
+ * Shared between the host C code and the device .cu code.
+ */
+typedef struct CUDATonemapParams {
+    CUdeviceptr_t dst_y;          ///< output luma plane
+    CUdeviceptr_t dst_uv;         ///< output chroma plane (interleaved UV)
+    CUdeviceptr_t src_y;          ///< input luma plane (P010)
+    CUdeviceptr_t src_uv;         ///< input chroma plane (P010, interleaved 
UV)
+
+    int width;                    ///< frame width in pixels
+    int height;                   ///< frame height in pixels
+    int src_pitch;                ///< input plane pitch in bytes
+    int dst_pitch;                ///< output plane pitch in bytes
+
+    float rgb_matrix[9];          ///< YUV-to-RGB matrix (source colorspace)
+    float yuv_matrix[9];          ///< RGB-to-YUV matrix (output colorspace)
+    float rgb2rgb_matrix[9];      ///< gamut conversion (e.g. BT.2020 to 
BT.709)
+    float luma_src[3];            ///< source luma coefficients (cr, cg, cb)
+    float luma_dst[3];            ///< destination luma coefficients
+
+    int tonemap_func;             ///< algorithm, one of TonemapAlgoCUDA
+    float param;                  ///< algorithm-specific tuning parameter
+    float desat_param;            ///< highlight desaturation strength
+    float signal_peak;            ///< HDR signal peak (multiples of ref white)
+    float target_peak;            ///< SDR target peak (normally 1.0)
+
+    int src_trc;                  ///< source transfer, one of TransferFuncCUDA
+    int dst_trc;                  ///< dest transfer, one of 
DelinearizeFuncCUDA
+    int src_range_full;           ///< 1 if source is full-range (JPEG)
+    int dst_range_full;           ///< 1 if output is full-range (JPEG)
+    int rgb2rgb_passthrough;      ///< 1 if source and dest primaries match
+    int chroma_loc;               ///< chroma sample location for downsampling
+    int out_depth;                ///< output bit depth (8 or 10)
+} CUDATonemapParams;
+
+#endif /* AVFILTER_VF_TONEMAP_CUDA_H */
-- 
2.52.0

_______________________________________________
ffmpeg-devel mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[FFmpeg-devel] [PR] avfilter: add CUDA accelerated HDR-to-SDR tonemapping filter (PR #22360)

Reply via email to