ShivamS136 opened a new issue, #15163: URL: https://github.com/apache/pinot/issues/15163
## Issue Description There appears to be a significant difference in deduplication behavior between Pinot v1.2.0 and v1.3.0. The behavior change affects how records are deduplicated based on the `dedupTimeColumn` and `metadataTTL` settings. ## Environment - **Affected Pinot Versions**: - v1.3.0 (new behavior) - v1.2.0 (previous behavior) ## Deduplication Behavior Differences ### In v1.3.0: - Records only get deduped if at least one insertion record's `dedupTimeColumn` value is at most `metadataTTL` older than current time - If a record within TTL is inserted, then deduping works - Records outside TTL are successfully inserted even if the data is the same (potential duplicates) - If one record is encountered within TTL value, then the primary key is created and all future records with the same primary key value get deduped ### In v1.2.0: - The `dedupTimeColumn` doesn't seem to affect deduplication - Any record inserted into Pinot gets the primary key generated irrespective of time column value - Future records with the same primary key value get deduped consistently ## Expected Behavior Deduplication should work consistently across versions and should properly deduplicate records based on the primary key, regardless of the time column values. ## Table Configuration <details> <summary>Table Schema</summary> ```json { "schemaName": "leaderboard_entries", "dimensionFieldSpecs": [ { "name": "leaderboard_id", "dataType": "LONG" }, { "name": "participant_id", "dataType": "STRING" }, { "name": "attempt_number", "dataType": "INT", "defaultNullValue": 1 }, { "name": "entry_meta", "dataType": "JSON", "defaultNullValue": "{}" } ], "metricFieldSpecs": [ { "name": "score", "dataType": "INT", "defaultNullValue": 0 } ], "dateTimeFieldSpecs": [ { "name": "insertion_time", "dataType": "LONG", "format": "1:MILLISECONDS:EPOCH", "granularity": "1:MILLISECONDS" }, { "name": "attempt_time", "dataType": "LONG", "format": "1:MILLISECONDS:EPOCH", "granularity": "1:MILLISECONDS" } ], "primaryKeyColumns": ["leaderboard_id", "participant_id", "attempt_number"] } ``` </details> <details> <summary>Table Config</summary> ```json { "tableName": "leaderboard_entries", "tableType": "REALTIME", "segmentsConfig": { "timeColumnName": "insertion_time", "replication": "2", "retentionTimeUnit": "DAYS", "retentionTimeValue": "90", "timeType": "MILLISECONDS" }, "query": { "timeoutMs": "5000" }, "tenants": {}, "tableIndexConfig": { "sortedColumn": ["score"] }, "fieldConfigList": [ { "name": "leaderboard_id", "indexes": { "inverted": {} } }, { "name": "participant_id", "indexes": { "bloom": {} } } ], "ingestionConfig": { "streamIngestionConfig": { "streamConfigMaps": [ { "streamType": "kafka", "stream.kafka.consumer.type": "lowlevel", "stream.kafka.topic.name": "leaderboard-entry", "stream.kafka.broker.list": "kafka:9092", "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.json.JSONMessageDecoder", "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory", "stream.kafka.consumer.prop.auto.offset.reset": "smallest", "stream.kafka.consumer.prop.format": "JSON", "realtime.segment.flush.threshold.time": "4h", "realtime.segment.flush.threshold.rows": "0", "realtime.segment.flush.threshold.segment.rows": "0", "realtime.segment.flush.threshold.segment.size": "20M" } ] } }, "metadata": { "customConfigs": {} }, "routing": { "instanceSelectorType": "strictReplicaGroup" }, "dedupConfig": { "dedupEnabled": true, "hashFunction": "NONE", "dedupTimeColumn": "insertion_time", "metadataTTL": 600000, "enablePreload": true } } ``` </details> ## Observations When using v1.2.0, the following warning appears during table addition, suggesting that the `dedupTimeColumn` and `metadataTTL` properties might not be recognized or used in this version: ```json { "unrecognizedProperties": { "/dedupConfig/dedupTimeColumn": "insertion_time", "/dedupConfig/metadataTTL": 600000 }, "status": "Table leaderboard_entries_REALTIME successfully added" } ``` ## Impact This behavior change can lead to: 1. Unexpected duplicates in v1.3.0 when records are outside the TTL window 2. Inconsistent deduplication behavior when migrating from v1.2.0 to v1.3.0 3. Potential data integrity issues if applications rely on the previous deduplication behavior ## Proposed Solution Either: 1. Restore the v1.2.0 behavior where deduplication works consistently regardless of time column values, or 2. Clearly document this behavior change and provide configuration options to maintain backward compatibility ## Additional Information Related Slack thread with more info: https://apache-pinot.slack.com/archives/C011C9JHN7R/p1740757158048619 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org