Liu created FLINK-39083:
---------------------------

             Summary: Support field-level error tolerance for CSV format 
deserialization
                 Key: FLINK-39083
                 URL: https://issues.apache.org/jira/browse/FLINK-39083
             Project: Flink
          Issue Type: Improvement
          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
            Reporter: Liu
         Attachments: image-2026-02-13-10-08-41-570.png

h1. Motivation

Currently, the csv.ignore-parse-errors option in CSV format provides only two 
behaviors:
 * false (default): Any field-level parse error causes the entire job to fail.
 * true: Any field-level parse error causes the entire row to be discarded 
(returns null).

This "all-or-nothing" approach is problematic in production ETL scenarios. For 
example, consider a CSV table with 50 columns where only 1 TIMESTAMP column 
occasionally has malformed values. With ignore-parse-errors=true, the entire 
row—including the 49 correctly parsed fields—is silently dropped. This leads to 
significant and unnecessary data loss.
h1. Proposal

Introduce a new configuration option csv.ignore-single-field-parse-error 
(boolean, default false) that provides field-level error tolerance:
 * When enabled, if a single field fails type conversion (e.g., "abc" for an 
INT column), only that field is set to null, and the rest of the row is 
preserved.
 * Jackson-level parsing errors (e.g., malformed CSV structure) are not 
affected by this option and continue to be governed by ignore-parse-errors.

Behavior matrix:

!image-2026-02-13-10-08-41-570.png|width=532,height=149!

*Scope of Changes*
 # CsvFormatOptions — Add new ConfigOption<Boolean> for 
ignore-single-field-parse-error.
 # CsvCommons — Register the new option in optionalOptions() and 
forwardOptions().
 # CsvToRowDataConverters — Core change: add ignoreSingleFieldParseErrors flag; 
modify the catch block in createRowConverter() to set the failed field to null 
instead of re-throwing.
 # CsvRowDataDeserializationSchema — Add builder setter; pass flag through to 
CsvToRowDataConverters.
 # CsvFormatFactory — Wire the new config option to the deserialization schema 
builder.
 # CsvFileFormatFactory — Wire the new config option in the bulk decoding path.
 # Tests — Add unit tests covering all four combinations in the behavior matrix.
 # Documentation — Update English and Chinese CSV format docs.

h1. Compatibility
 * Fully backward compatible: The new option defaults to false, preserving 
existing behavior.
 * No changes to serialization path: This option only affects deserialization.
 * No public API changes: Only new optional configuration added.

h1. Discussion Points

I'd like to get community feedback on the following before proceeding with 
implementation:
 # Option naming: Is csv.ignore-single-field-parse-error clear enough? 
Alternatives considered: csv.field-error-as-null, csv.partial-parse-errors.
 # Interaction with ignore-parse-errors: Should 
ignore-single-field-parse-error=true implicitly suppress field-level errors 
even when ignore-parse-errors=false? Or should it only take effect when 
ignore-parse-errors is also true?
 # Cross-format consistency: Should we consider a similar option for the JSON 
format (JsonToRowDataConverters) in a follow-up JIRA? The JSON format has a 
similar "all-or-nothing" behavior today.
 # Logging: The proposed implementation logs a WARN for each field-level error. 
Should this be configurable or use a different log level (e.g., DEBUG) to avoid 
log flooding?

h1. Related

Depends on / follows: https://issues.apache.org/jira/browse/FLINK-39065



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to