PR #22360 opened by f1k2faeez URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/22360 Patch URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/22360.patch
Add `tonemap_cuda`, a hardware-accelerated HDR-to-SDR tonemapping filter that runs natively on NVIDIA GPUs using CUDA, avoiding the need for OpenCL interop and host memory round-trips. ## Features - Six tonemapping algorithms: none, linear, gamma, clip, reinhard, hable, mobius (matching `tonemap_opencl`) - PQ (SMPTE ST.2084) and HLG (ARIB STD-B67) input transfer functions - BT.2020-to-BT.709 gamut conversion with automatic primaries detection - Configurable highlight desaturation - P010 input, NV12 or P010 output - Signal peak from HDR metadata (MaxCLL / mastering display) or manual override - Chroma-location-aware 4:2:0 downsampling ## Implementation Follows the established CUDA filter pattern (PTX module loading, `cuLaunchKernel`, proper CUDA context management). Each thread block processes a 2x2 luma quad with its shared chroma sample through the full pipeline: EOTF linearization, OOTF (HLG), primaries conversion, tonemapping with desaturation, delinearization, and YUV re-encoding. Kernel uses CUDA fast-math intrinsics (`__powf`, `__expf`, `__fmaf_rn`), read-only texture cache (`__ldg`), and fused multiply-add for matrix operations. ## Performance Validated with 4K60 HDR10+ content on consumer NVIDIA hardware: | Pipeline | FPS | |---|---| | Tonemap only (null output) | 272 fps | | Full decode → tonemap → h264_nvenc | 118 fps | | Full-duration transcode (59s) | 135 fps | ## Usage Examples ```bash # HDR to SDR with hable, all on GPU ffmpeg -hwaccel cuda -hwaccel_output_format cuda -i hdr_input.mkv \ -vf "tonemap_cuda=tonemap=hable:desat=0.5:format=nv12" \ -c:v h264_nvenc output.mp4 # With explicit BT.709 output properties ffmpeg -hwaccel cuda -hwaccel_output_format cuda -i hdr_input.mkv \ -vf "tonemap_cuda=tonemap=mobius:t=bt709:m=bt709:p=bt709:range=tv" \ -c:v hevc_nvenc output.mkv ``` Also submitted to ffmpeg-devel mailing list: https://patchwork.ffmpeg.org/project/ffmpeg/patch/[email protected]/ Signed-off-by: Faeez Kadiri <[email protected]> From 5b0b6a975c336a02ef9724544754dba69db68d1b Mon Sep 17 00:00:00 2001 From: Faeez Kadiri <[email protected]> Date: Tue, 3 Mar 2026 20:15:39 +0530 Subject: [PATCH] avfilter: add CUDA accelerated HDR-to-SDR tonemapping filter Add tonemap_cuda, a hardware-accelerated HDR-to-SDR tonemapping filter that runs natively on NVIDIA GPUs using CUDA, avoiding the need for OpenCL interop and host memory round-trips. Features: - Six tonemapping algorithms: none, linear, gamma, clip, reinhard, hable, mobius (matching tonemap_opencl) - PQ (SMPTE ST.2084) and HLG (ARIB STD-B67) input transfer functions - BT.2020-to-BT.709 gamut conversion with automatic primaries detection - Configurable highlight desaturation - P010 input, NV12 or P010 output - Signal peak from HDR metadata (MaxCLL / mastering display) or manual override - Chroma-location-aware 4:2:0 downsampling The implementation follows the established CUDA filter pattern using PTX modules for kernel execution and proper CUDA context management. Each thread block processes a 2x2 luma quad with its shared chroma sample through the full pipeline: EOTF linearization, OOTF (HLG), primaries conversion, tonemapping with desaturation, delinearization, and YUV re-encoding. Performance validated at 272 fps for 4K60 HDR tonemapping (null output) and 118 fps for full decode-tonemap-encode pipeline with h264_nvenc on consumer NVIDIA hardware. Signed-off-by: Faeez Kadiri <[email protected]> --- Changelog | 1 + configure | 2 + doc/filters.texi | 67 ++++ libavfilter/Makefile | 1 + libavfilter/allfilters.c | 1 + libavfilter/version.h | 2 +- libavfilter/vf_tonemap_cuda.c | 641 +++++++++++++++++++++++++++++++++ libavfilter/vf_tonemap_cuda.cu | 420 +++++++++++++++++++++ libavfilter/vf_tonemap_cuda.h | 92 +++++ 9 files changed, 1226 insertions(+), 1 deletion(-) create mode 100644 libavfilter/vf_tonemap_cuda.c create mode 100644 libavfilter/vf_tonemap_cuda.cu create mode 100644 libavfilter/vf_tonemap_cuda.h diff --git a/Changelog b/Changelog index ce49a5fff0..d410e7a6ca 100644 --- a/Changelog +++ b/Changelog @@ -26,6 +26,7 @@ version <next>: - swscale Vulkan support - LCEVC metadata bitstream filter - Add vf_deinterlace_d3d12 filter +- tonemap_cuda filter version 8.0: diff --git a/configure b/configure index 5ad2e6787d..a1d83f7fc1 100755 --- a/configure +++ b/configure @@ -4212,6 +4212,8 @@ tinterlace_merge_test_deps="tinterlace_filter" tinterlace_pad_test_deps="tinterlace_filter" tonemap_filter_deps="const_nan" tonemap_vaapi_filter_deps="vaapi VAProcFilterParameterBufferHDRToneMapping" +tonemap_cuda_filter_deps="ffnvcodec const_nan" +tonemap_cuda_filter_deps_any="cuda_nvcc cuda_llvm" tonemap_opencl_filter_deps="opencl const_nan" transpose_opencl_filter_deps="opencl" transpose_vaapi_filter_deps="vaapi VAProcPipelineCaps_rotation_flags" diff --git a/doc/filters.texi b/doc/filters.texi index e49dd9ef0d..82027d80cd 100644 --- a/doc/filters.texi +++ b/doc/filters.texi @@ -27291,6 +27291,73 @@ Thumbnails are extracted from every @var{n}=150-frame batch, selecting one per b @end itemize +@anchor{tonemap_cuda} +@subsection tonemap_cuda +Perform HDR to SDR conversion with tonemapping using CUDA. + +This is a CUDA-accelerated HDR-to-SDR tonemapping filter similar to +@code{tonemap_opencl}. It runs natively on NVIDIA GPUs using CUDA, +avoiding the need for OpenCL interop and host memory round-trips. + +Unlike @code{tonemap_opencl}, this filter does not currently support +the @option{threshold} option for scene-change detection; signal peak +is evaluated per-frame from HDR metadata or from the @option{peak} +override. + +It accepts the following options: + +@table @option +@item tonemap +Specify the tone-mapping operator. Same as @ref{tonemap}. + +@item param +Tune the tone-mapping algorithm. Same as @ref{tonemap}. + +@item desat +Apply desaturation for highlights. 0.0 disables. Default is 0.5. + +@item format +Specify the output pixel format. Supported values are @var{p010} and @var{nv12}. +Default is @var{nv12}. + +@item range, r +Set the output color range. Accepted values are @var{tv}/@var{limited} and +@var{pc}/@var{full}. + +@item primaries, p +Set the output color primaries. Accepted values include @var{bt709} and +@var{bt2020}. + +@item transfer, t +Set the output transfer characteristics. Accepted values include @var{bt709} +and @var{bt2020}. + +@item matrix, m +Set the output colorspace matrix. Accepted values include @var{bt709} and +@var{bt2020}. + +@item peak +Override the signal peak value (in multiples of reference white, 100 cd/m2). +When set to 0 (default), the peak is read from the input HDR metadata +(Content Light Level or Mastering Display). +@end table + +@subsubsection Examples + +@itemize +@item +Convert an HDR video to SDR using hable tonemapping, fully on GPU: +@example +ffmpeg -hwaccel cuda -hwaccel_output_format cuda -i hdr_input.mkv -vf "tonemap_cuda=tonemap=hable:desat=0.5:format=nv12" -c:v h264_nvenc output.mp4 +@end example + +@item +Convert HDR to SDR with explicit BT.709 output properties: +@example +ffmpeg -hwaccel cuda -hwaccel_output_format cuda -i hdr_input.mkv -vf "tonemap_cuda=tonemap=mobius:t=bt709:m=bt709:p=bt709:range=tv" -c:v hevc_nvenc output.mkv +@end example +@end itemize + @subsection yadif_cuda Deinterlace the input video using the @ref{yadif} algorithm, but implemented diff --git a/libavfilter/Makefile b/libavfilter/Makefile index a530cfae29..a9a41b34fd 100644 --- a/libavfilter/Makefile +++ b/libavfilter/Makefile @@ -541,6 +541,7 @@ OBJS-$(CONFIG_TMEDIAN_FILTER) += vf_xmedian.o framesync.o OBJS-$(CONFIG_TMIDEQUALIZER_FILTER) += vf_tmidequalizer.o OBJS-$(CONFIG_TMIX_FILTER) += vf_mix.o framesync.o OBJS-$(CONFIG_TONEMAP_FILTER) += vf_tonemap.o +OBJS-$(CONFIG_TONEMAP_CUDA_FILTER) += vf_tonemap_cuda.o vf_tonemap_cuda.ptx.o cuda/load_helper.o colorspace.o OBJS-$(CONFIG_TONEMAP_OPENCL_FILTER) += vf_tonemap_opencl.o opencl.o \ opencl/tonemap.o opencl/colorspace_common.o OBJS-$(CONFIG_TONEMAP_VAAPI_FILTER) += vf_tonemap_vaapi.o vaapi_vpp.o diff --git a/libavfilter/allfilters.c b/libavfilter/allfilters.c index e26859e159..a670d9a26d 100644 --- a/libavfilter/allfilters.c +++ b/libavfilter/allfilters.c @@ -509,6 +509,7 @@ extern const FFFilter ff_vf_tmedian; extern const FFFilter ff_vf_tmidequalizer; extern const FFFilter ff_vf_tmix; extern const FFFilter ff_vf_tonemap; +extern const FFFilter ff_vf_tonemap_cuda; extern const FFFilter ff_vf_tonemap_opencl; extern const FFFilter ff_vf_tonemap_vaapi; extern const FFFilter ff_vf_tpad; diff --git a/libavfilter/version.h b/libavfilter/version.h index 537df129cd..7642b670d1 100644 --- a/libavfilter/version.h +++ b/libavfilter/version.h @@ -31,7 +31,7 @@ #include "version_major.h" -#define LIBAVFILTER_VERSION_MINOR 13 +#define LIBAVFILTER_VERSION_MINOR 14 #define LIBAVFILTER_VERSION_MICRO 100 diff --git a/libavfilter/vf_tonemap_cuda.c b/libavfilter/vf_tonemap_cuda.c new file mode 100644 index 0000000000..2311c51c1d --- /dev/null +++ b/libavfilter/vf_tonemap_cuda.c @@ -0,0 +1,641 @@ +/* + * Copyright (c) 2026, Faeez Kadiri < f1k2faeez at gmail dot com> + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +/** + * @file + * CUDA accelerated HDR to SDR tonemapping filter + */ + +#include <float.h> +#include <math.h> +#include <string.h> + +#include "filters.h" +#include "libavutil/cuda_check.h" +#include "libavutil/csp.h" +#include "libavutil/hwcontext.h" +#include "libavutil/hwcontext_cuda_internal.h" +#include "libavutil/opt.h" +#include "libavutil/pixdesc.h" + +#include "colorspace.h" +#include "cuda/load_helper.h" +#include "vf_tonemap_cuda.h" + +#define CHECK_CU(x) FF_CUDA_CHECK_DL(ctx, device_hwctx->internal->cuda_dl, x) +#define DIV_UP(a, b) (((a) + (b) - 1) / (b)) +#define BLOCKX 32 +#define BLOCKY 16 + +/** Private context for the tonemap_cuda filter. */ +typedef struct TonemapCUDAContext { + const AVClass *class; + + /** @name User-facing options (set via AVOption) */ + /**@{*/ + enum AVColorSpace colorspace; + enum AVColorTransferCharacteristic trc; + enum AVColorPrimaries primaries; + enum AVColorRange range; + int tonemap; ///< selected algorithm + enum AVPixelFormat format; ///< output pixel format + double peak; ///< signal peak override (0 = auto) + double param; ///< algorithm tuning knob + double desat_param; ///< highlight desaturation strength + /**@}*/ + + /** @name Resolved per-stream color properties */ + /**@{*/ + enum AVColorSpace colorspace_in, colorspace_out; + enum AVColorTransferCharacteristic trc_in, trc_out; + enum AVColorPrimaries primaries_in, primaries_out; + enum AVColorRange range_in, range_out; + enum AVChromaLocation chroma_loc; + /**@}*/ + + double target_peak; ///< SDR target peak (normally 1.0) + int initialised; ///< set once CUDA module is loaded + + AVCUDADeviceContext *hwctx; ///< CUDA device context + CUmodule cu_module; ///< loaded PTX module + CUfunction cu_func; ///< tonemap kernel entry point + AVBufferRef *frames_ctx; ///< output hw frames context + + /** @name Precomputed matrices uploaded to the GPU each frame */ + /**@{*/ + float rgb_matrix[9]; ///< YUV-to-RGB (source) + float yuv_matrix[9]; ///< RGB-to-YUV (destination) + float rgb2rgb_matrix[9]; ///< gamut conversion + float luma_src[3]; ///< source luma coefficients + float luma_dst[3]; ///< destination luma coefficients + int rgb2rgb_passthrough; ///< 1 if primaries match + int src_trc; ///< TransferFuncCUDA value + int dst_trc; ///< DelinearizeFuncCUDA value + /**@}*/ +} TonemapCUDAContext; + +/** + * Compute a 3x3 RGB-to-RGB colour-primary conversion matrix. + * + * @param in source colour primaries + * @param out destination colour primaries + * @param rgb2rgb output 3x3 matrix + * @return 0 on success, negative AVERROR on failure + */ +static int get_rgb2rgb_matrix(enum AVColorPrimaries in, + enum AVColorPrimaries out, + double rgb2rgb[3][3]) +{ + double rgb2xyz[3][3], xyz2rgb[3][3]; + const AVColorPrimariesDesc *in_primaries = + av_csp_primaries_desc_from_id(in); + const AVColorPrimariesDesc *out_primaries = + av_csp_primaries_desc_from_id(out); + + if (!in_primaries || !out_primaries) + return AVERROR(EINVAL); + + ff_fill_rgb2xyz_table(&out_primaries->prim, &out_primaries->wp, rgb2xyz); + ff_matrix_invert_3x3(rgb2xyz, xyz2rgb); + ff_fill_rgb2xyz_table(&in_primaries->prim, &in_primaries->wp, rgb2xyz); + ff_matrix_mul_3x3(rgb2rgb, rgb2xyz, xyz2rgb); + + return 0; +} + +/** Flatten a 3x3 double matrix into a row-major float[9] array. */ +static void double_matrix_to_float9(const double m[3][3], float out[9]) +{ + for (int i = 0; i < 3; i++) + for (int j = 0; j < 3; j++) + out[i * 3 + j] = (float)m[i][j]; +} + +/** + * Compute colour-space conversion matrices and transfer-function + * mappings from the current stream colour properties. + * + * Called once on the first frame and again whenever input colour + * metadata changes mid-stream. + * + * @return 0 on success, negative AVERROR on failure + */ +static int tonemap_cuda_setup(AVFilterContext *ctx) +{ + TonemapCUDAContext *s = ctx->priv; + double rgb2yuv_src[3][3], yuv2rgb_src[3][3]; + double rgb2yuv_dst[3][3]; + double rgb2rgb[3][3]; + const AVLumaCoefficients *luma_src, *luma_dst; + + /* Tonemap param defaults (matching tonemap_opencl) */ + switch (s->tonemap) { + case TONEMAP_GAMMA_CUDA: + if (isnan(s->param)) + s->param = 1.8; + break; + case TONEMAP_REINHARD_CUDA: + if (!isnan(s->param)) + s->param = (1.0 - s->param) / s->param; + break; + case TONEMAP_MOBIUS_CUDA: + if (isnan(s->param)) + s->param = 0.3; + break; + } + if (isnan(s->param)) + s->param = 1.0; + + s->target_peak = 1.0; + + av_log(ctx, AV_LOG_DEBUG, "tonemap transfer from %s to %s\n", + av_color_transfer_name(s->trc_in), + av_color_transfer_name(s->trc_out)); + av_log(ctx, AV_LOG_DEBUG, "mapping colorspace from %s to %s\n", + av_color_space_name(s->colorspace_in), + av_color_space_name(s->colorspace_out)); + av_log(ctx, AV_LOG_DEBUG, "mapping primaries from %s to %s\n", + av_color_primaries_name(s->primaries_in), + av_color_primaries_name(s->primaries_out)); + av_log(ctx, AV_LOG_DEBUG, "mapping range from %s to %s\n", + av_color_range_name(s->range_in), + av_color_range_name(s->range_out)); + + if (s->trc_in != AVCOL_TRC_SMPTE2084 && + s->trc_in != AVCOL_TRC_ARIB_STD_B67) { + av_log(ctx, AV_LOG_ERROR, + "unsupported input transfer %s, expected PQ or HLG\n", + av_color_transfer_name(s->trc_in)); + return AVERROR(EINVAL); + } + + /* Map source HDR transfer to kernel enum */ + s->src_trc = (s->trc_in == AVCOL_TRC_ARIB_STD_B67) ? TRC_HLG_CUDA + : TRC_ST2084_CUDA; + /* + * Output delinearization: BT.2020-10 uses the BT.709 OETF + * (they share the same curve). For the default BT.709 SDR + * target we use the BT.1886 inverse EOTF (pure gamma 2.4) + * which produces display-referred output better suited to + * tonemapped content than the BT.709 OETF linear toe. + */ + s->dst_trc = (s->trc_out == AVCOL_TRC_BT2020_10) ? DELIN_BT709_CUDA + : DELIN_BT1886_CUDA; + + /* Compute YUV-to-RGB matrix (input colorspace) */ + luma_src = av_csp_luma_coeffs_from_avcsp(s->colorspace_in); + if (!luma_src) { + av_log(ctx, AV_LOG_ERROR, "unsupported input colorspace %d (%s)\n", + s->colorspace_in, av_color_space_name(s->colorspace_in)); + return AVERROR(EINVAL); + } + ff_fill_rgb2yuv_table(luma_src, rgb2yuv_src); + ff_matrix_invert_3x3(rgb2yuv_src, yuv2rgb_src); + double_matrix_to_float9(yuv2rgb_src, s->rgb_matrix); + + /* Compute RGB-to-YUV matrix (output colorspace) */ + luma_dst = av_csp_luma_coeffs_from_avcsp(s->colorspace_out); + if (!luma_dst) { + av_log(ctx, AV_LOG_ERROR, "unsupported output colorspace %d (%s)\n", + s->colorspace_out, av_color_space_name(s->colorspace_out)); + return AVERROR(EINVAL); + } + ff_fill_rgb2yuv_table(luma_dst, rgb2yuv_dst); + double_matrix_to_float9(rgb2yuv_dst, s->yuv_matrix); + + /* Luma coefficients */ + s->luma_src[0] = (float)av_q2d(luma_src->cr); + s->luma_src[1] = (float)av_q2d(luma_src->cg); + s->luma_src[2] = (float)av_q2d(luma_src->cb); + s->luma_dst[0] = (float)av_q2d(luma_dst->cr); + s->luma_dst[1] = (float)av_q2d(luma_dst->cg); + s->luma_dst[2] = (float)av_q2d(luma_dst->cb); + + /* Primaries conversion matrix */ + s->rgb2rgb_passthrough = 1; + if (s->primaries_out != s->primaries_in) { + int ret = get_rgb2rgb_matrix(s->primaries_in, + s->primaries_out, + rgb2rgb); + if (ret < 0) { + av_log(ctx, AV_LOG_ERROR, "failed to compute primaries matrix\n"); + return ret; + } + double_matrix_to_float9(rgb2rgb, s->rgb2rgb_matrix); + s->rgb2rgb_passthrough = 0; + } + + return 0; +} + +/** + * Load the compiled PTX module and resolve the tonemap kernel. + * + * @return 0 on success, negative AVERROR on failure + */ +static av_cold int tonemap_cuda_load_functions(AVFilterContext *ctx) +{ + TonemapCUDAContext *s = ctx->priv; + CudaFunctions *cu = s->hwctx->internal->cuda_dl; + AVCUDADeviceContext *device_hwctx = s->hwctx; + CUcontext dummy; + int ret; + + extern const unsigned char ff_vf_tonemap_cuda_ptx_data[]; + extern const unsigned int ff_vf_tonemap_cuda_ptx_len; + + ret = CHECK_CU(cu->cuCtxPushCurrent(device_hwctx->cuda_ctx)); + if (ret < 0) + return ret; + + ret = ff_cuda_load_module(ctx, device_hwctx, &s->cu_module, + ff_vf_tonemap_cuda_ptx_data, + ff_vf_tonemap_cuda_ptx_len); + if (ret < 0) { + av_log(ctx, AV_LOG_ERROR, "Failed to load CUDA module\n"); + goto fail; + } + + ret = CHECK_CU(cu->cuModuleGetFunction(&s->cu_func, + s->cu_module, + "tonemap")); + if (ret < 0) { + av_log(ctx, AV_LOG_ERROR, "Failed to load tonemap kernel\n"); + goto fail; + } + +fail: + CHECK_CU(cu->cuCtxPopCurrent(&dummy)); + return ret; +} + +/** + * Validate input format, allocate output CUDA hw-frames context, + * and propagate link properties. + */ +static int tonemap_cuda_config_output(AVFilterLink *outlink) +{ + AVFilterContext *ctx = outlink->src; + TonemapCUDAContext *s = ctx->priv; + AVFilterLink *inlink = ctx->inputs[0]; + FilterLink *inl = ff_filter_link(inlink); + FilterLink *ol = ff_filter_link(outlink); + AVHWFramesContext *in_frames_ctx; + enum AVPixelFormat out_sw_format; + int ret; + + if (!inl->hw_frames_ctx) { + av_log(ctx, AV_LOG_ERROR, "No hw context provided on input\n"); + return AVERROR(EINVAL); + } + + in_frames_ctx = (AVHWFramesContext *)inl->hw_frames_ctx->data; + s->hwctx = in_frames_ctx->device_ctx->hwctx; + + if (in_frames_ctx->sw_format != AV_PIX_FMT_P010) { + av_log(ctx, AV_LOG_ERROR, + "Unsupported input format %s, only p010 is supported\n", + av_get_pix_fmt_name(in_frames_ctx->sw_format)); + return AVERROR(EINVAL); + } + + if (s->format == AV_PIX_FMT_NONE) { + av_log(ctx, AV_LOG_WARNING, + "Output format not set, defaulting to nv12\n"); + out_sw_format = AV_PIX_FMT_NV12; + } else if (s->format != AV_PIX_FMT_NV12 && s->format != AV_PIX_FMT_P010) { + av_log(ctx, AV_LOG_ERROR, + "Unsupported output format %s, only nv12 and p010 are supported\n", + av_get_pix_fmt_name(s->format)); + return AVERROR(EINVAL); + } else { + out_sw_format = s->format; + } + + s->frames_ctx = av_hwframe_ctx_alloc(in_frames_ctx->device_ref); + if (!s->frames_ctx) + return AVERROR(ENOMEM); + + AVHWFramesContext *out_frames = (AVHWFramesContext *)s->frames_ctx->data; + out_frames->format = AV_PIX_FMT_CUDA; + out_frames->sw_format = out_sw_format; + out_frames->width = FFALIGN(inlink->w, 32); + out_frames->height = FFALIGN(inlink->h, 32); + + ret = av_hwframe_ctx_init(s->frames_ctx); + if (ret < 0) { + av_log(ctx, AV_LOG_ERROR, "Failed to init output hw frames ctx\n"); + av_buffer_unref(&s->frames_ctx); + return ret; + } + + ol->hw_frames_ctx = av_buffer_ref(s->frames_ctx); + if (!ol->hw_frames_ctx) + return AVERROR(ENOMEM); + + outlink->w = inlink->w; + outlink->h = inlink->h; + outlink->time_base = inlink->time_base; + outlink->format = AV_PIX_FMT_CUDA; + + return 0; +} + +/** + * Process one input frame: resolve colour metadata, (re-)initialise + * matrices if needed, launch the CUDA tonemap kernel, and forward + * the resulting SDR frame downstream. + */ +static int tonemap_cuda_filter_frame(AVFilterLink *inlink, AVFrame *input) +{ + AVFilterContext *ctx = inlink->dst; + TonemapCUDAContext *s = ctx->priv; + AVFilterLink *outlink = ctx->outputs[0]; + FilterLink *ol = ff_filter_link(outlink); + AVHWFramesContext *out_frames_ctx = + (AVHWFramesContext *)ol->hw_frames_ctx->data; + AVCUDADeviceContext *device_hwctx = out_frames_ctx->device_ctx->hwctx; + CudaFunctions *cu = device_hwctx->internal->cuda_dl; + CUcontext dummy; + AVFrame *output = NULL; + double peak; + int ret; + + if (!input->hw_frames_ctx) + return AVERROR(EINVAL); + + if (input->width % 2 || input->height % 2) { + av_log(ctx, AV_LOG_ERROR, + "Input dimensions %dx%d must be even for 4:2:0\n", + input->width, input->height); + av_frame_free(&input); + return AVERROR(EINVAL); + } + + output = av_frame_alloc(); + if (!output) { + ret = AVERROR(ENOMEM); + goto fail; + } + + ret = av_hwframe_get_buffer(ol->hw_frames_ctx, output, 0); + if (ret < 0) { + av_log(ctx, AV_LOG_ERROR, "Failed to get output buffer: %s\n", + av_err2str(ret)); + goto fail; + } + + ret = av_frame_copy_props(output, input); + if (ret < 0) + goto fail; + + /* Determine signal peak */ + peak = s->peak; + if (peak <= 0.0) + peak = ff_determine_signal_peak(input); + + /* Set output color properties */ + if (s->trc != -1) + output->color_trc = s->trc; + if (s->primaries != -1) + output->color_primaries = s->primaries; + if (s->colorspace != -1) + output->colorspace = s->colorspace; + if (s->range != -1) + output->color_range = s->range; + + { + int props_changed = + s->trc_in != input->color_trc || + s->trc_out != output->color_trc || + s->colorspace_in != input->colorspace || + s->colorspace_out != output->colorspace || + s->primaries_in != input->color_primaries || + s->primaries_out != output->color_primaries; + + s->trc_in = input->color_trc; + s->trc_out = output->color_trc; + s->colorspace_in = input->colorspace; + s->colorspace_out = output->colorspace; + s->primaries_in = input->color_primaries; + s->primaries_out = output->color_primaries; + s->range_in = input->color_range; + s->range_out = output->color_range; + s->chroma_loc = output->chroma_location; + + if (!s->initialised || props_changed) { + if (s->initialised) + av_log(ctx, AV_LOG_INFO, + "Color properties changed, " + "recomputing matrices\n"); + + ret = tonemap_cuda_setup(ctx); + if (ret < 0) + goto fail; + + if (!s->initialised) { + ret = tonemap_cuda_load_functions(ctx); + if (ret < 0) + goto fail; + } + + s->initialised = 1; + } + } + + /* Push CUDA context */ + ret = CHECK_CU(cu->cuCtxPushCurrent(device_hwctx->cuda_ctx)); + if (ret < 0) + goto fail; + + { + CUDATonemapParams params = {0}; + + params.dst_y = (CUdeviceptr)output->data[0]; + params.dst_uv = (CUdeviceptr)output->data[1]; + params.src_y = (CUdeviceptr)input->data[0]; + params.src_uv = (CUdeviceptr)input->data[1]; + + params.width = input->width; + params.height = input->height; + params.src_pitch = input->linesize[0]; + params.dst_pitch = output->linesize[0]; + + memcpy(params.rgb_matrix, s->rgb_matrix, sizeof(s->rgb_matrix)); + memcpy(params.yuv_matrix, s->yuv_matrix, sizeof(s->yuv_matrix)); + memcpy(params.rgb2rgb_matrix, s->rgb2rgb_matrix, + sizeof(s->rgb2rgb_matrix)); + memcpy(params.luma_src, s->luma_src, sizeof(s->luma_src)); + memcpy(params.luma_dst, s->luma_dst, sizeof(s->luma_dst)); + + params.tonemap_func = s->tonemap; + params.param = (float)s->param; + params.desat_param = (float)s->desat_param; + params.signal_peak = (float)peak; + params.target_peak = (float)s->target_peak; + + params.src_trc = s->src_trc; + params.dst_trc = s->dst_trc; + params.src_range_full = (s->range_in == AVCOL_RANGE_JPEG); + params.dst_range_full = (s->range_out == AVCOL_RANGE_JPEG); + params.rgb2rgb_passthrough = s->rgb2rgb_passthrough; + params.chroma_loc = (int)s->chroma_loc; + params.out_depth = + (out_frames_ctx->sw_format == AV_PIX_FMT_P010) + ? 10 : 8; + + void *args[] = { ¶ms }; + ret = CHECK_CU(cu->cuLaunchKernel(s->cu_func, + DIV_UP(input->width / 2, BLOCKX), + DIV_UP(input->height / 2, BLOCKY), 1, + BLOCKX, BLOCKY, 1, + 0, device_hwctx->stream, args, NULL)); + if (ret < 0) + goto pop_ctx; + } + +pop_ctx: + CHECK_CU(cu->cuCtxPopCurrent(&dummy)); + if (ret < 0) + goto fail; + + output->width = input->width; + output->height = input->height; + + ff_update_hdr_metadata(output, s->target_peak); + + av_frame_free(&input); + + av_log(ctx, AV_LOG_DEBUG, + "Tonemap output: %s, %ux%u (%"PRId64")\n", + av_get_pix_fmt_name(output->format), + output->width, output->height, output->pts); + + return ff_filter_frame(outlink, output); + +fail: + av_frame_free(&input); + av_frame_free(&output); + return ret; +} + +/** Release the CUDA module and output frames context. */ +static av_cold void tonemap_cuda_uninit(AVFilterContext *ctx) +{ + TonemapCUDAContext *s = ctx->priv; + CUcontext dummy; + + av_buffer_unref(&s->frames_ctx); + + if (s->hwctx && s->cu_module) { + CudaFunctions *cu = s->hwctx->internal->cuda_dl; + AVCUDADeviceContext *device_hwctx = s->hwctx; + CHECK_CU(cu->cuCtxPushCurrent(s->hwctx->cuda_ctx)); + CHECK_CU(cu->cuModuleUnload(s->cu_module)); + CHECK_CU(cu->cuCtxPopCurrent(&dummy)); + } + + s->cu_module = NULL; + s->hwctx = NULL; +} + +#define OFFSET(x) offsetof(TonemapCUDAContext, x) +#define FLAGS (AV_OPT_FLAG_FILTERING_PARAM | AV_OPT_FLAG_VIDEO_PARAM) + +static const AVOption tonemap_cuda_options[] = { + { "tonemap", "tonemap algorithm selection", + OFFSET(tonemap), AV_OPT_TYPE_INT, + {.i64 = TONEMAP_NONE_CUDA}, + TONEMAP_NONE_CUDA, TONEMAP_MAX_CUDA - 1, + FLAGS, .unit = "tonemap" }, + { "none", 0, 0, AV_OPT_TYPE_CONST, {.i64 = TONEMAP_NONE_CUDA}, 0, 0, FLAGS, .unit = "tonemap" }, + { "linear", 0, 0, AV_OPT_TYPE_CONST, {.i64 = TONEMAP_LINEAR_CUDA}, 0, 0, FLAGS, .unit = "tonemap" }, + { "gamma", 0, 0, AV_OPT_TYPE_CONST, {.i64 = TONEMAP_GAMMA_CUDA}, 0, 0, FLAGS, .unit = "tonemap" }, + { "clip", 0, 0, AV_OPT_TYPE_CONST, {.i64 = TONEMAP_CLIP_CUDA}, 0, 0, FLAGS, .unit = "tonemap" }, + { "reinhard", 0, 0, AV_OPT_TYPE_CONST, {.i64 = TONEMAP_REINHARD_CUDA}, 0, 0, FLAGS, .unit = "tonemap" }, + { "hable", 0, 0, AV_OPT_TYPE_CONST, {.i64 = TONEMAP_HABLE_CUDA}, 0, 0, FLAGS, .unit = "tonemap" }, + { "mobius", 0, 0, AV_OPT_TYPE_CONST, {.i64 = TONEMAP_MOBIUS_CUDA}, 0, 0, FLAGS, .unit = "tonemap" }, + { "transfer", "set transfer characteristic", + OFFSET(trc), AV_OPT_TYPE_INT, {.i64 = AVCOL_TRC_BT709}, -1, INT_MAX, FLAGS, .unit = "transfer" }, + { "t", "set transfer characteristic", + OFFSET(trc), AV_OPT_TYPE_INT, {.i64 = AVCOL_TRC_BT709}, -1, INT_MAX, FLAGS, .unit = "transfer" }, + { "bt709", 0, 0, AV_OPT_TYPE_CONST, {.i64 = AVCOL_TRC_BT709}, 0, 0, FLAGS, .unit = "transfer" }, + { "bt2020", 0, 0, AV_OPT_TYPE_CONST, {.i64 = AVCOL_TRC_BT2020_10}, 0, 0, FLAGS, .unit = "transfer" }, + { "matrix", "set colorspace matrix", + OFFSET(colorspace), AV_OPT_TYPE_INT, {.i64 = -1}, -1, INT_MAX, FLAGS, .unit = "matrix" }, + { "m", "set colorspace matrix", + OFFSET(colorspace), AV_OPT_TYPE_INT, {.i64 = -1}, -1, INT_MAX, FLAGS, .unit = "matrix" }, + { "bt709", 0, 0, AV_OPT_TYPE_CONST, {.i64 = AVCOL_SPC_BT709}, 0, 0, FLAGS, .unit = "matrix" }, + { "bt2020", 0, 0, AV_OPT_TYPE_CONST, {.i64 = AVCOL_SPC_BT2020_NCL}, 0, 0, FLAGS, .unit = "matrix" }, + { "primaries", "set color primaries", + OFFSET(primaries), AV_OPT_TYPE_INT, {.i64 = -1}, -1, INT_MAX, FLAGS, .unit = "primaries" }, + { "p", "set color primaries", + OFFSET(primaries), AV_OPT_TYPE_INT, {.i64 = -1}, -1, INT_MAX, FLAGS, .unit = "primaries" }, + { "bt709", 0, 0, AV_OPT_TYPE_CONST, {.i64 = AVCOL_PRI_BT709}, 0, 0, FLAGS, .unit = "primaries" }, + { "bt2020", 0, 0, AV_OPT_TYPE_CONST, {.i64 = AVCOL_PRI_BT2020}, 0, 0, FLAGS, .unit = "primaries" }, + { "range", "set color range", + OFFSET(range), AV_OPT_TYPE_INT, {.i64 = -1}, -1, INT_MAX, FLAGS, .unit = "range" }, + { "r", "set color range", + OFFSET(range), AV_OPT_TYPE_INT, {.i64 = -1}, -1, INT_MAX, FLAGS, .unit = "range" }, + { "tv", 0, 0, AV_OPT_TYPE_CONST, {.i64 = AVCOL_RANGE_MPEG}, 0, 0, FLAGS, .unit = "range" }, + { "pc", 0, 0, AV_OPT_TYPE_CONST, {.i64 = AVCOL_RANGE_JPEG}, 0, 0, FLAGS, .unit = "range" }, + { "limited", 0, 0, AV_OPT_TYPE_CONST, {.i64 = AVCOL_RANGE_MPEG}, 0, 0, FLAGS, .unit = "range" }, + { "full", 0, 0, AV_OPT_TYPE_CONST, {.i64 = AVCOL_RANGE_JPEG}, 0, 0, FLAGS, .unit = "range" }, + { "format", "output pixel format", + OFFSET(format), AV_OPT_TYPE_PIXEL_FMT, {.i64 = AV_PIX_FMT_NONE}, + AV_PIX_FMT_NONE, INT_MAX, FLAGS, .unit = "fmt" }, + { "peak", "signal peak override", + OFFSET(peak), AV_OPT_TYPE_DOUBLE, {.dbl = 0}, 0, DBL_MAX, FLAGS }, + { "param", "tonemap parameter", + OFFSET(param), AV_OPT_TYPE_DOUBLE, {.dbl = NAN}, DBL_MIN, DBL_MAX, FLAGS }, + { "desat", "desaturation parameter", + OFFSET(desat_param), AV_OPT_TYPE_DOUBLE, {.dbl = 0.5}, 0, DBL_MAX, FLAGS }, + { NULL } +}; + +AVFILTER_DEFINE_CLASS(tonemap_cuda); + +static const AVFilterPad tonemap_cuda_inputs[] = { + { + .name = "default", + .type = AVMEDIA_TYPE_VIDEO, + .filter_frame = tonemap_cuda_filter_frame, + }, +}; + +static const AVFilterPad tonemap_cuda_outputs[] = { + { + .name = "default", + .type = AVMEDIA_TYPE_VIDEO, + .config_props = tonemap_cuda_config_output, + }, +}; + +const FFFilter ff_vf_tonemap_cuda = { + .p.name = "tonemap_cuda", + .p.description = NULL_IF_CONFIG_SMALL("CUDA accelerated HDR to SDR tonemapping"), + .p.priv_class = &tonemap_cuda_class, + .priv_size = sizeof(TonemapCUDAContext), + .uninit = tonemap_cuda_uninit, + FILTER_INPUTS(tonemap_cuda_inputs), + FILTER_OUTPUTS(tonemap_cuda_outputs), + FILTER_SINGLE_PIXFMT(AV_PIX_FMT_CUDA), + .flags_internal = FF_FILTER_FLAG_HWFRAME_AWARE, +}; diff --git a/libavfilter/vf_tonemap_cuda.cu b/libavfilter/vf_tonemap_cuda.cu new file mode 100644 index 0000000000..285b1c0f09 --- /dev/null +++ b/libavfilter/vf_tonemap_cuda.cu @@ -0,0 +1,420 @@ +/* + * Copyright (c) 2026, Faeez Kadiri < f1k2faeez at gmail dot com> + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "vf_tonemap_cuda.h" + +#define ST2084_MAX_LUMINANCE 10000.0f +#define REFERENCE_WHITE 100.0f +#define ST2084_M1 0.1593017578125f +#define ST2084_M2 78.84375f +#define ST2084_C1 0.8359375f +#define ST2084_C2 18.8515625f +#define ST2084_C3 18.6875f +#define HLG_A 0.17883277f +#define HLG_B 0.28466892f +#define HLG_C 0.55991073f +#define SDR_AVG 0.25f + +/* --- Transfer functions ------------------------------------------ */ + +__device__ static inline float eotf_st2084(float x) +{ + float p = __powf(x, 1.0f / ST2084_M2); + float a = fmaxf(p - ST2084_C1, 0.0f); + float b = fmaxf(ST2084_C2 - ST2084_C3 * p, 1e-6f); + float c = __powf(a / b, 1.0f / ST2084_M1); + return x > 0.0f ? c * (ST2084_MAX_LUMINANCE / REFERENCE_WHITE) : 0.0f; +} + +__device__ static inline float inverse_oetf_hlg(float x) +{ + float a = 4.0f * x * x; + float b = __expf((x - HLG_C) / HLG_A) + HLG_B; + return x < 0.5f ? a : b; +} + +__device__ static inline float inverse_eotf_bt1886(float c) +{ + return c < 0.0f ? 0.0f : __powf(c, 1.0f / 2.4f); +} + +__device__ static inline float oetf_bt709(float c) +{ + c = fmaxf(c, 0.0f); + float r1 = 4.5f * c; + float r2 = 1.099f * __powf(c, 0.45f) - 0.099f; + return c < 0.018f ? r1 : r2; +} + +__device__ static inline float linearize(float x, int trc) +{ + if (trc == TRC_HLG_CUDA) + return inverse_oetf_hlg(x); + return eotf_st2084(x); +} + +__device__ static inline float delinearize(float x, int trc) +{ + if (trc == DELIN_BT709_CUDA) + return oetf_bt709(x); + return inverse_eotf_bt1886(x); +} + +/* --- OOTF (HLG) ------------------------------------------------- */ + +__device__ static inline void ootf_hlg(float *r, float *g, float *b, + const float *luma_src, float peak) +{ + float luma = luma_src[0] * (*r) + luma_src[1] * (*g) + luma_src[2] * (*b); + float gamma = 1.2f + 0.42f * __log10f(peak * REFERENCE_WHITE / 1000.0f); + gamma = fmaxf(1.0f, gamma); + float factor = peak * __powf(luma, gamma - 1.0f) / __powf(12.0f, gamma); + *r *= factor; + *g *= factor; + *b *= factor; +} + +/* --- Tonemap algorithms ------------------------------------------ */ + +__device__ static inline float hable_f(float in) +{ + float a = 0.15f, b = 0.50f, c = 0.10f; + float d = 0.20f, e = 0.02f, f = 0.30f; + float num = in * (in * a + b * c) + d * e; + float den = in * (in * a + b) + d * f; + return num / den - e / f; +} + +__device__ static inline float apply_tonemap(float sig, float peak, + int algo, float param) +{ + switch (algo) { + case TONEMAP_LINEAR_CUDA: + return sig * param / peak; + case TONEMAP_GAMMA_CUDA: { + float p = sig > 0.05f ? sig / peak : 0.05f / peak; + float v = __powf(p, 1.0f / param); + return sig > 0.05f ? v : (sig * v / 0.05f); + } + case TONEMAP_CLIP_CUDA: + return fminf(fmaxf(sig * param, 0.0f), 1.0f); + case TONEMAP_REINHARD_CUDA: + return sig / (sig + param) * (peak + param) / peak; + case TONEMAP_HABLE_CUDA: + return hable_f(sig) / hable_f(peak); + case TONEMAP_MOBIUS_CUDA: { + float j = param; + if (sig <= j) + return sig; + float a = -j * j * (peak - 1.0f) / (j * j - 2.0f * j + peak); + float b = (j * j - 2.0f * j * peak + peak) / fmaxf(peak - 1.0f, 1e-6f); + return (b * b + 2.0f * b * j + j * j) / (b - a) * (sig + a) / (sig + b); + } + default: /* TONEMAP_NONE_CUDA */ + return sig; + } +} + +/* --- Color helpers ----------------------------------------------- */ + +__device__ static inline void mat3x3_mul(const float *m, + float r, float g, float b, + float *or_, float *og, float *ob) +{ + *or_ = __fmaf_rn(m[0], r, __fmaf_rn(m[1], g, m[2] * b)); + *og = __fmaf_rn(m[3], r, __fmaf_rn(m[4], g, m[5] * b)); + *ob = __fmaf_rn(m[6], r, __fmaf_rn(m[7], g, m[8] * b)); +} + +__device__ static inline void yuv2rgb(float y, float u, float v, + const CUDATonemapParams &p, + float *r, float *g, float *b) +{ + if (p.src_range_full) { + u -= 0.5f; + v -= 0.5f; + } else { + y = (y * 255.0f - 16.0f) / 219.0f; + u = (u * 255.0f - 128.0f) / 224.0f; + v = (v * 255.0f - 128.0f) / 224.0f; + } + mat3x3_mul(p.rgb_matrix, y, u, v, r, g, b); +} + +__device__ static inline void rgb2yuv(float r, float g, float b, + const CUDATonemapParams &p, + float *y, float *u, float *v) +{ + mat3x3_mul(p.yuv_matrix, r, g, b, y, u, v); + if (p.dst_range_full) { + *u += 0.5f; + *v += 0.5f; + } else { + *y = (219.0f * (*y) + 16.0f) / 255.0f; + *u = (224.0f * (*u) + 128.0f) / 255.0f; + *v = (224.0f * (*v) + 128.0f) / 255.0f; + } +} + +__device__ static inline float rgb2y(float r, float g, float b, + const CUDATonemapParams &p) +{ + float y = p.yuv_matrix[0] * r + p.yuv_matrix[1] * g + p.yuv_matrix[2] * b; + if (p.dst_range_full) + return y; + return (219.0f * y + 16.0f) / 255.0f; +} + +__device__ static inline void chroma_sample(float r0, float g0, float b0, + float r1, float g1, float b1, + float r2, float g2, float b2, + float r3, float g3, float b3, + int loc, + float *cr, float *cg, float *cb) +{ + switch (loc) { + case 1: /* AVCHROMA_LOC_LEFT */ + *cr = (r0 + r2) * 0.5f; + *cg = (g0 + g2) * 0.5f; + *cb = (b0 + b2) * 0.5f; + break; + case 3: /* AVCHROMA_LOC_TOPLEFT */ + *cr = r0; *cg = g0; *cb = b0; + break; + case 4: /* AVCHROMA_LOC_TOP */ + *cr = (r0 + r1) * 0.5f; + *cg = (g0 + g1) * 0.5f; + *cb = (b0 + b1) * 0.5f; + break; + case 5: /* AVCHROMA_LOC_BOTTOMLEFT */ + *cr = r2; *cg = g2; *cb = b2; + break; + case 6: /* AVCHROMA_LOC_BOTTOM */ + *cr = (r2 + r3) * 0.5f; + *cg = (g2 + g3) * 0.5f; + *cb = (b2 + b3) * 0.5f; + break; + default: /* CENTER / UNSPECIFIED */ + *cr = (r0 + r1 + r2 + r3) * 0.25f; + *cg = (g0 + g1 + g2 + g3) * 0.25f; + *cb = (b0 + b1 + b2 + b3) * 0.25f; + break; + } +} + +/* --- Per-pixel tonemap pipeline ---------------------------------- */ + +__device__ static inline void map_one_pixel_rgb(float *r, float *g, float *b, + const CUDATonemapParams &p, + float peak) +{ + float sig = fmaxf(fmaxf(*r, fmaxf(*g, *b)), 1e-6f); + + if (p.target_peak > 1.0f) { + sig *= 1.0f / p.target_peak; + peak *= 1.0f / p.target_peak; + } + + float sig_old = sig; + + if (p.desat_param > 0.0f) { + float luma = p.luma_dst[0] * (*r) + + p.luma_dst[1] * (*g) + + p.luma_dst[2] * (*b); + float coeff = fmaxf(sig - 0.18f, 1e-6f) / fmaxf(sig, 1e-6f); + coeff = __powf(coeff, 10.0f / p.desat_param); + *r = *r * (1.0f - coeff) + luma * coeff; + *g = *g * (1.0f - coeff) + luma * coeff; + *b = *b * (1.0f - coeff) + luma * coeff; + sig = sig * (1.0f - coeff) + luma * coeff; + } + + sig = apply_tonemap(sig, peak, p.tonemap_func, p.param); + sig = fminf(sig, 1.0f); + + float ratio = sig / sig_old; + *r *= ratio; + *g *= ratio; + *b *= ratio; +} + +/* --- Quantization helpers ---------------------------------------- */ + +__device__ static inline float read_p010(const unsigned short *p, + int idx) +{ + return (float)(__ldg(&p[idx]) >> 6) / 1023.0f; +} + +__device__ static inline float saturate(float x) +{ + return fminf(fmaxf(x, 0.0f), 1.0f); +} + +__device__ static inline unsigned short quant_p010(float v) +{ + return (unsigned short)(saturate(v) * 1023.0f + 0.5f) << 6; +} + +__device__ static inline unsigned char quant_nv12(float v) +{ + return (unsigned char)(saturate(v) * 255.0f + 0.5f); +} + +/* --- Main kernel ------------------------------------------------- */ + +extern "C" +__global__ void tonemap(CUDATonemapParams p) +{ + int xi = blockIdx.x * blockDim.x + threadIdx.x; + int yi = blockIdx.y * blockDim.y + threadIdx.y; + + int x = 2 * xi; + int y = 2 * yi; + + if (x + 1 >= p.width || y + 1 >= p.height) + return; + + int src_pitch_y = p.src_pitch / sizeof(unsigned short); + int src_pitch_uv = p.src_pitch / sizeof(unsigned short); + int dst_pitch_y, dst_pitch_uv; + + const unsigned short *src_y = (const unsigned short *)p.src_y; + const unsigned short *src_uv = (const unsigned short *)p.src_uv; + + /* Read 4 Y samples and 1 UV pair from P010 */ + float y0 = read_p010(src_y, y * src_pitch_y + x); + float y1 = read_p010(src_y, y * src_pitch_y + x + 1); + float y2 = read_p010(src_y, (y + 1) * src_pitch_y + x); + float y3 = read_p010(src_y, (y + 1) * src_pitch_y + x + 1); + + float u_val = read_p010(src_uv, yi * src_pitch_uv + 2 * xi); + float v_val = read_p010(src_uv, yi * src_pitch_uv + 2 * xi + 1); + + /* YUV to linear RGB for each of 4 pixels */ + float r0, g0, b0, r1, g1, b1, r2, g2, b2, r3, g3, b3; + + yuv2rgb(y0, u_val, v_val, p, &r0, &g0, &b0); + yuv2rgb(y1, u_val, v_val, p, &r1, &g1, &b1); + yuv2rgb(y2, u_val, v_val, p, &r2, &g2, &b2); + yuv2rgb(y3, u_val, v_val, p, &r3, &g3, &b3); + + /* Linearize (EOTF) */ + r0 = linearize(r0, p.src_trc); + g0 = linearize(g0, p.src_trc); + b0 = linearize(b0, p.src_trc); + r1 = linearize(r1, p.src_trc); + g1 = linearize(g1, p.src_trc); + b1 = linearize(b1, p.src_trc); + r2 = linearize(r2, p.src_trc); + g2 = linearize(g2, p.src_trc); + b2 = linearize(b2, p.src_trc); + r3 = linearize(r3, p.src_trc); + g3 = linearize(g3, p.src_trc); + b3 = linearize(b3, p.src_trc); + + /* OOTF (HLG only) */ + if (p.src_trc == TRC_HLG_CUDA) { + ootf_hlg(&r0, &g0, &b0, p.luma_src, p.signal_peak); + ootf_hlg(&r1, &g1, &b1, p.luma_src, p.signal_peak); + ootf_hlg(&r2, &g2, &b2, p.luma_src, p.signal_peak); + ootf_hlg(&r3, &g3, &b3, p.luma_src, p.signal_peak); + } + + /* Primaries conversion (e.g. BT.2020 to BT.709) */ + if (!p.rgb2rgb_passthrough) { + float tr, tg, tb; + mat3x3_mul(p.rgb2rgb_matrix, r0, g0, b0, + &tr, &tg, &tb); + r0 = tr; g0 = tg; b0 = tb; + mat3x3_mul(p.rgb2rgb_matrix, r1, g1, b1, + &tr, &tg, &tb); + r1 = tr; g1 = tg; b1 = tb; + mat3x3_mul(p.rgb2rgb_matrix, r2, g2, b2, + &tr, &tg, &tb); + r2 = tr; g2 = tg; b2 = tb; + mat3x3_mul(p.rgb2rgb_matrix, r3, g3, b3, + &tr, &tg, &tb); + r3 = tr; g3 = tg; b3 = tb; + } + + /* Tonemap each pixel */ + float peak = p.signal_peak; + map_one_pixel_rgb(&r0, &g0, &b0, p, peak); + map_one_pixel_rgb(&r1, &g1, &b1, p, peak); + map_one_pixel_rgb(&r2, &g2, &b2, p, peak); + map_one_pixel_rgb(&r3, &g3, &b3, p, peak); + + /* Delinearize */ + r0 = delinearize(r0, p.dst_trc); + g0 = delinearize(g0, p.dst_trc); + b0 = delinearize(b0, p.dst_trc); + r1 = delinearize(r1, p.dst_trc); + g1 = delinearize(g1, p.dst_trc); + b1 = delinearize(b1, p.dst_trc); + r2 = delinearize(r2, p.dst_trc); + g2 = delinearize(g2, p.dst_trc); + b2 = delinearize(b2, p.dst_trc); + r3 = delinearize(r3, p.dst_trc); + g3 = delinearize(g3, p.dst_trc); + b3 = delinearize(b3, p.dst_trc); + + /* Compute output luma (Y) for each pixel */ + float out_y0 = rgb2y(r0, g0, b0, p); + float out_y1 = rgb2y(r1, g1, b1, p); + float out_y2 = rgb2y(r2, g2, b2, p); + float out_y3 = rgb2y(r3, g3, b3, p); + + /* Compute chroma from the 4 delinearized RGB values */ + float cr, cg, cb; + chroma_sample(r0, g0, b0, r1, g1, b1, r2, g2, b2, r3, g3, b3, + p.chroma_loc, &cr, &cg, &cb); + float out_u, out_v, dummy_y; + rgb2yuv(cr, cg, cb, p, &dummy_y, &out_u, &out_v); + + /* Write output */ + if (p.out_depth == 10) { + unsigned short *dy = (unsigned short *)p.dst_y; + unsigned short *duv = (unsigned short *)p.dst_uv; + dst_pitch_y = p.dst_pitch / sizeof(unsigned short); + dst_pitch_uv = dst_pitch_y; + + dy[y * dst_pitch_y + x] = quant_p010(out_y0); + dy[y * dst_pitch_y + x + 1] = quant_p010(out_y1); + dy[(y + 1) * dst_pitch_y + x] = quant_p010(out_y2); + dy[(y + 1) * dst_pitch_y + x + 1] = quant_p010(out_y3); + + duv[yi * dst_pitch_uv + 2 * xi] = quant_p010(out_u); + duv[yi * dst_pitch_uv + 2 * xi + 1] = quant_p010(out_v); + } else { + unsigned char *dy = (unsigned char *)p.dst_y; + unsigned char *duv = (unsigned char *)p.dst_uv; + dst_pitch_y = p.dst_pitch; + dst_pitch_uv = dst_pitch_y; + + dy[y * dst_pitch_y + x] = quant_nv12(out_y0); + dy[y * dst_pitch_y + x + 1] = quant_nv12(out_y1); + dy[(y + 1) * dst_pitch_y + x] = quant_nv12(out_y2); + dy[(y + 1) * dst_pitch_y + x + 1] = quant_nv12(out_y3); + + duv[yi * dst_pitch_uv + 2 * xi] = quant_nv12(out_u); + duv[yi * dst_pitch_uv + 2 * xi + 1] = quant_nv12(out_v); + } +} diff --git a/libavfilter/vf_tonemap_cuda.h b/libavfilter/vf_tonemap_cuda.h new file mode 100644 index 0000000000..f8d44aca35 --- /dev/null +++ b/libavfilter/vf_tonemap_cuda.h @@ -0,0 +1,92 @@ +/* + * Copyright (c) 2026, Faeez Kadiri < f1k2faeez at gmail dot com> + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#ifndef AVFILTER_VF_TONEMAP_CUDA_H +#define AVFILTER_VF_TONEMAP_CUDA_H + +#if defined(__CUDACC__) || defined(__CUDA__) +#include <stdint.h> +typedef uint8_t* CUdeviceptr_t; +#else +#include <ffnvcodec/dynlink_cuda.h> +typedef CUdeviceptr CUdeviceptr_t; +#endif + +/** Tonemap algorithm selection, mirrored in the host-side enum. */ +enum TonemapAlgoCUDA { + TONEMAP_NONE_CUDA, + TONEMAP_LINEAR_CUDA, + TONEMAP_GAMMA_CUDA, + TONEMAP_CLIP_CUDA, + TONEMAP_REINHARD_CUDA, + TONEMAP_HABLE_CUDA, + TONEMAP_MOBIUS_CUDA, + TONEMAP_MAX_CUDA, +}; + +/** Source HDR transfer function for the EOTF linearization step. */ +enum TransferFuncCUDA { + TRC_ST2084_CUDA = 0, + TRC_HLG_CUDA = 1, +}; + +/** Output SDR delinearization curve selection. */ +enum DelinearizeFuncCUDA { + DELIN_BT1886_CUDA = 0, ///< inverse EOTF, gamma 2.4 + DELIN_BT709_CUDA = 1, ///< BT.709 OETF with linear segment +}; + +/** + * Kernel parameter block passed by value to the CUDA tonemap kernel. + * Shared between the host C code and the device .cu code. + */ +typedef struct CUDATonemapParams { + CUdeviceptr_t dst_y; ///< output luma plane + CUdeviceptr_t dst_uv; ///< output chroma plane (interleaved UV) + CUdeviceptr_t src_y; ///< input luma plane (P010) + CUdeviceptr_t src_uv; ///< input chroma plane (P010, interleaved UV) + + int width; ///< frame width in pixels + int height; ///< frame height in pixels + int src_pitch; ///< input plane pitch in bytes + int dst_pitch; ///< output plane pitch in bytes + + float rgb_matrix[9]; ///< YUV-to-RGB matrix (source colorspace) + float yuv_matrix[9]; ///< RGB-to-YUV matrix (output colorspace) + float rgb2rgb_matrix[9]; ///< gamut conversion (e.g. BT.2020 to BT.709) + float luma_src[3]; ///< source luma coefficients (cr, cg, cb) + float luma_dst[3]; ///< destination luma coefficients + + int tonemap_func; ///< algorithm, one of TonemapAlgoCUDA + float param; ///< algorithm-specific tuning parameter + float desat_param; ///< highlight desaturation strength + float signal_peak; ///< HDR signal peak (multiples of ref white) + float target_peak; ///< SDR target peak (normally 1.0) + + int src_trc; ///< source transfer, one of TransferFuncCUDA + int dst_trc; ///< dest transfer, one of DelinearizeFuncCUDA + int src_range_full; ///< 1 if source is full-range (JPEG) + int dst_range_full; ///< 1 if output is full-range (JPEG) + int rgb2rgb_passthrough; ///< 1 if source and dest primaries match + int chroma_loc; ///< chroma sample location for downsampling + int out_depth; ///< output bit depth (8 or 10) +} CUDATonemapParams; + +#endif /* AVFILTER_VF_TONEMAP_CUDA_H */ -- 2.52.0 _______________________________________________ ffmpeg-devel mailing list -- [email protected] To unsubscribe send an email to [email protected]
