filters.texi: add classify documentation

m.kaindl0208 Tue, 11 Mar 2025 10:16:27 -0700

Signed-off-by: MaximilianKaindl <[email protected]>
---
 doc/filters.texi | 124 ++++++++++++++++++++++++++++++++---------------
 1 file changed, 85 insertions(+), 39 deletions(-)


diff --git a/doc/filters.texi b/doc/filters.texi
index 0ba7d3035f..a7046e0f4e 100644
--- a/doc/filters.texi
+++ b/doc/filters.texi
@@ -11970,45 +11970,6 @@ ffmpeg -i INPUT -f lavfi -i 
nullsrc=hd720,geq='r=128+80*(sin(sqrt((X-W/2)*(X-W/2
 @end example
 @end itemize

-@section dnn_classify
-
-Do classification with deep neural networks based on bounding boxes.
-
-The filter accepts the following options:
-
-@table @option
-@item dnn_backend
-Specify which DNN backend to use for model loading and execution. This option 
accepts
-only openvino now, tensorflow backends will be added.
-
-@item model
-Set path to model file specifying network architecture and its parameters.
-Note that different backends use different file formats.
-
-@item input
-Set the input name of the dnn network.
-
-@item output
-Set the output name of the dnn network.
-
-@item confidence
-Set the confidence threshold (default: 0.5).
-
-@item labels
-Set path to label file specifying the mapping between label id and name.
-Each label name is written in one line, tailing spaces and empty lines are 
skipped.
-The first line is the name of label id 0,
-and the second line is the name of label id 1, etc.
-The label id is considered as name if the label file is not provided.
-
-@item backend_configs
-Set the configs to be passed into backend
-
-For tensorflow backend, you can set its configs with @option{sess_config} 
options,
-please use tools/python/tf_sess_config.py to get the configs for your system.
-
-@end table
-
 @section dnn_detect

 Do object detection with deep neural networks.
@@ -31982,6 +31943,91 @@ settb=AVTB
 @end example
 @end itemize

+@section dnn_classify
+Analyze media (video frames or audio) using deep neural networks to apply 
classifications based on the content.
+This filter supports three classification modes:
+
+@itemize @bullet
+@item Standard image classification (OpenVINO backend)
+@item CLIP (Contrastive Language-Image Pre-training) classification (Torch 
backend)
+@item CLAP (Contrastive Language-Audio Pre-training) classification (Torch 
backend)
+@end itemize
+
+The filter accepts the following options:
+@table @option
+@item dnn_backend
+Specify which DNN backend to use for model loading and execution. Currently 
supports:
+@table @samp
+@item openvino
+Use OpenVINO backend (standard image classification only).
+@item torch
+Use LibTorch backend (supports CLIP for images and CLAP for audio).
+@end table
+@item confidence
+Set the confidence threshold (default: 0.5). Classifications with confidence 
below this value will be filtered out.
+@item labels
+Set path to a label file specifying classification labels. This is required 
for standard classification and can be used for CLIP/CLAP classification.
+Each label is written on a separate line in the file. Trailing spaces and 
empty lines are skipped.
+@item categories
+Path to a categories file for hierarchical classification (CLIP/CLAP only). 
This allows classification to be organized into multiple category units with 
individual categories containing related labels.
+@item tokenizer
+Path to the text tokenizer.json file (CLIP/CLAP only). Required for text 
embedding generation.
+@item target
+Specify which objects to classify. When omitted, the entire frame is 
classified. When specified, only bounding boxes with detection labels matching 
this value are classified.
+@item is_audio
+Enable audio processing mode for CLAP models (default: 0). Set to 1 to process 
audio input instead of video frames.
+@item logit_scale
+Logit scale for similarity calculation in CLIP/CLAP (default: 4.6052 for CLIP, 
33.37 for CLAP). Values below 0 use the default.
+@item temperature
+Softmax temperature for CLIP/CLAP models (default: 1.0). Lower values make the 
output more peaked, higher values make it smoother.
+@item forward_order
+Order of forward output for CLIP/CLAP: 0 for media-text order, 1 for 
text-media order (default depends on model type).
+@item normalize
+Whether to normalize the input tensor for CLIP/CLAP (default depends on model 
type). Some scripted models already do this in the forward, so this is not 
necessary in some cases.
+@item input_res
+Expected input resolution for video processing models (default: automatically 
detected).
+@item sample_rate
+Expected sample rate for audio processing models (default: 44100).
+@item sample_duration
+Expected sample duration in seconds for audio processing models (default: 7).
+@item token_dimension
+Dimension of token vector for text embeddings (default: 77).
+@item optimize
+Enable graph executor optimization (0: disabled, 1: enabled).
+@end table
+@subsection Category Files Format
+For CLIP/CLAP models, a hierarchical categories file can be provided with the 
following format:
+@example
+[RecordingSystem]
+(Professional)
+a photo with high level of detail
+a professionally recorded sound
+(HomeRecording)
+a photo with low level of detail
+an amateur recording
+[ContentType]
+(Nature)
+trees
+mountains
+birds singing
+(Urban)
+buildings
+street noise
+traffic sounds
+@end example
+Each unit enclosed in square brackets [] creates a classification group. 
Within each group, categories are defined with parentheses () and the labels 
under each category are used to classify the input.
+@subsection Examples
+@example
+Classify video using OpenVINO
+ffmpeg -i input.mp4 -vf 
"dnn_classify=dnn_backend=openvino:model=model.xml:labels=labels.txt" output.mp4
+Classify video using CLIP
+ffmpeg -i input.mp4 -vf 
"dnn_classify=dnn_backend=torch:model=clip_model.pt:categories=categories.txt:tokenizer=tokenizer.json"
 output.mp4
+Classify only person objects in a video
+ffmpeg -i input.mp4 -vf 
"dnn_detect=model=detection.xml:input=data:output=detection_out:confidence=0.5,dnn_classify=model=clip_model.pt:dnn_backend=torch:tokenizer=tokenizer.json:labels=labels.txt:target=person"
 output.mp4
+Classify audio using CLAP
+ffmpeg -i input.mp3 -af 
"dnn_classify=dnn_backend=torch:model=clap_model.pt:categories=audio_categories.txt:tokenizer=tokenizer.json:is_audio=1:sample_rate=44100:sample_duration=7"
 output.mp3
+@end example
+
 @section showcqt
 Convert input audio to a video output representing frequency spectrum
 logarithmically using Brown-Puckette constant Q transform algorithm with
--
2.34.1


_______________________________________________
ffmpeg-devel mailing list
[email protected]
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
[email protected] with subject "unsubscribe".

[FFmpeg-devel] [PATCH v2 FFmpeg 18/20] doc/filters.texi: add classify documentation

Reply via email to