ShivamS136 opened a new issue, #15163:
URL: https://github.com/apache/pinot/issues/15163

   ## Issue Description
   
   There appears to be a significant difference in deduplication behavior 
between Pinot v1.2.0 and v1.3.0. The behavior change affects how records are 
deduplicated based on the `dedupTimeColumn` and `metadataTTL` settings.
   
   ## Environment
   
   - **Affected Pinot Versions**: 
     - v1.3.0 (new behavior)
     - v1.2.0 (previous behavior)
   
   ## Deduplication Behavior Differences
   
   ### In v1.3.0:
   
   - Records only get deduped if at least one insertion record's 
`dedupTimeColumn` value is at most `metadataTTL` older than current time
   - If a record within TTL is inserted, then deduping works
   - Records outside TTL are successfully inserted even if the data is the same 
(potential duplicates)
   - If one record is encountered within TTL value, then the primary key is 
created and all future records with the same primary key value get deduped
   
   ### In v1.2.0:
   
   - The `dedupTimeColumn` doesn't seem to affect deduplication
   - Any record inserted into Pinot gets the primary key generated irrespective 
of time column value
   - Future records with the same primary key value get deduped consistently
   
   ## Expected Behavior
   
   Deduplication should work consistently across versions and should properly 
deduplicate records based on the primary key, regardless of the time column 
values.
   
   ## Table Configuration
   
   <details>
   <summary>Table Schema</summary>
   
   ```json
   {
        "schemaName": "leaderboard_entries",
        "dimensionFieldSpecs": [
                {
                        "name": "leaderboard_id",
                        "dataType": "LONG"
                },
                {
                        "name": "participant_id",
                        "dataType": "STRING"
                },
                {
                        "name": "attempt_number",
                        "dataType": "INT",
                        "defaultNullValue": 1
                },
                {
                        "name": "entry_meta",
                        "dataType": "JSON",
                        "defaultNullValue": "{}"
                }
        ],
        "metricFieldSpecs": [
                {
                        "name": "score",
                        "dataType": "INT",
                        "defaultNullValue": 0
                }
        ],
        "dateTimeFieldSpecs": [
                {
                        "name": "insertion_time",
                        "dataType": "LONG",
                        "format": "1:MILLISECONDS:EPOCH",
                        "granularity": "1:MILLISECONDS"
                },
                {
                        "name": "attempt_time",
                        "dataType": "LONG",
                        "format": "1:MILLISECONDS:EPOCH",
                        "granularity": "1:MILLISECONDS"
                }
        ],
        "primaryKeyColumns": ["leaderboard_id", "participant_id", 
"attempt_number"]
   }
   ```
   </details>
   
   <details>
   <summary>Table Config</summary>
   
   ```json
   {
        "tableName": "leaderboard_entries",
        "tableType": "REALTIME",
        "segmentsConfig": {
                "timeColumnName": "insertion_time",
                "replication": "2",
                "retentionTimeUnit": "DAYS",
                "retentionTimeValue": "90",
                "timeType": "MILLISECONDS"
        },
        "query": {
                "timeoutMs": "5000"
        },
        "tenants": {},
        "tableIndexConfig": {
                "sortedColumn": ["score"]
        },
        "fieldConfigList": [
                {
                        "name": "leaderboard_id",
                        "indexes": {
                                "inverted": {}
                        }
                },
                {
                        "name": "participant_id",
                        "indexes": {
                                "bloom": {}
                        }
                }
        ],
        "ingestionConfig": {
                "streamIngestionConfig": {
                        "streamConfigMaps": [
                                {
                                        "streamType": "kafka",
                                        "stream.kafka.consumer.type": 
"lowlevel",
                                        "stream.kafka.topic.name": 
"leaderboard-entry",
                                        "stream.kafka.broker.list": 
"kafka:9092",
                                        "stream.kafka.decoder.class.name": 
"org.apache.pinot.plugin.inputformat.json.JSONMessageDecoder",
                                        
"stream.kafka.consumer.factory.class.name": 
"org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory",
                                        
"stream.kafka.consumer.prop.auto.offset.reset": "smallest",
                                        "stream.kafka.consumer.prop.format": 
"JSON",
                                        
"realtime.segment.flush.threshold.time": "4h",
                                        
"realtime.segment.flush.threshold.rows": "0",
                                        
"realtime.segment.flush.threshold.segment.rows": "0",
                                        
"realtime.segment.flush.threshold.segment.size": "20M"
                                }
                        ]
                }
        },
        "metadata": {
                "customConfigs": {}
        },
        "routing": {
                "instanceSelectorType": "strictReplicaGroup"
        },
        "dedupConfig": {
                "dedupEnabled": true,
                "hashFunction": "NONE",
                "dedupTimeColumn": "insertion_time",
                "metadataTTL": 600000,
                "enablePreload": true
        }
   }
   ```
   </details>
   
   ## Observations
   
   When using v1.2.0, the following warning appears during table addition, 
suggesting that the `dedupTimeColumn` and `metadataTTL` properties might not be 
recognized or used in this version:
   
   ```json
   {
     "unrecognizedProperties": {
       "/dedupConfig/dedupTimeColumn": "insertion_time",
       "/dedupConfig/metadataTTL": 600000
     },
     "status": "Table leaderboard_entries_REALTIME successfully added"
   }
   ```
   
   ## Impact
   
   This behavior change can lead to:
   1. Unexpected duplicates in v1.3.0 when records are outside the TTL window
   2. Inconsistent deduplication behavior when migrating from v1.2.0 to v1.3.0
   3. Potential data integrity issues if applications rely on the previous 
deduplication behavior
   
   ## Proposed Solution
   
   Either:
   1. Restore the v1.2.0 behavior where deduplication works consistently 
regardless of time column values, or
   2. Clearly document this behavior change and provide configuration options 
to maintain backward compatibility
   
   ## Additional Information
   
   Related Slack thread with more info: 
https://apache-pinot.slack.com/archives/C011C9JHN7R/p1740757158048619


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to