krishan1390 opened a new issue, #15608: URL: https://github.com/apache/pinot/issues/15608
Kinesis provides capabilities to split a shard into 2 shards and merge 2 shards into 1 shard. Both these operations create new shards and the older shards (called parent shards) stop being active. To maintain ingestion ordering, Pinot needs to consume completely from older shards before it can start processing newer shards. To process new shards, Pinot needs to commit older segments and start new consumers. This functionality is taken care of by RealtimeSegmentValidationManager when it runs periodically with a few caveats 1. Currently, Pinot creates segments for older shards whenever RealtimeSegmentValidationManager runs. These are segments with 0 docs, so don't impact query results but is unnecessary additional metadata 2. Depending on certain race conditions, Pinot can go into a loop of creating multiple segments of 0 docs for these older shards in a single RealtimeSegmentValidationManager run, which further complicates the problem 3. After the shard split, If the table consumption is paused and then resumed with "smallest" offset before RealtimeSegmentValidationManager is run, consumption doesn't resume and is completely stopped. The problem exists even if RealtimeSegmentValidationManager is run later or if we try to resume with "lastConsumed" offset. 4. Creating a new table listening to a stream after a shard is split on that stream also leads to all of above issues. Expected behaviour 1. After a shard is split, consumption should happen if a table is paused and resumed (independent of whether RealtimeSegmentValidationManager is run) 2. If we resume with "smallest" offset, we expect older shards to be consumed first. This will create duplicate data. New shards will be consumed if we resume later with "lastConsumed" or "largest" offset or after RealtimeSegmentValidationManager is run 3. If we resume with "lastConsumed" offset, we expect consumption to start where it stopped without any data loss. 4. If we resume with "largest" offset, we expect consumption to start with some data loss. 5. If RealtimeSegmentValidationManager is run, we expect consumption to start where it stopped without any data loss. 6. There shouldn't be any new segments for older shards once it is completely processed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org