krishan1390 opened a new issue, #15608:
URL: https://github.com/apache/pinot/issues/15608

   Kinesis provides capabilities to split a shard into 2 shards and merge 2 
shards into 1 shard. 
   
   Both these operations create new shards and the older shards (called parent 
shards) stop being active. 
   
   To maintain ingestion ordering, Pinot needs to consume completely from older 
shards before it can start processing newer shards. To process new shards, 
Pinot needs to commit older segments and start new consumers. 
   
   This functionality is taken care of by RealtimeSegmentValidationManager when 
it runs periodically with a few caveats
   1. Currently, Pinot creates segments for older shards whenever 
RealtimeSegmentValidationManager runs. These are segments with 0 docs, so don't 
impact query results but is unnecessary additional metadata
   2. Depending on certain race conditions, Pinot can go into a loop of 
creating multiple segments of 0 docs for these older shards in a single 
RealtimeSegmentValidationManager run, which further complicates the problem
   3. After the shard split, If the table consumption is paused and then 
resumed with "smallest" offset before RealtimeSegmentValidationManager is run, 
consumption doesn't resume and is completely stopped. The problem exists even 
if RealtimeSegmentValidationManager is run later or if we try to resume with 
"lastConsumed" offset. 
   4. Creating a new table listening to a stream after a shard is split on that 
stream also leads to all of above issues. 
   
   Expected behaviour
   1. After a shard is split, consumption should happen if a table is paused 
and resumed (independent of whether RealtimeSegmentValidationManager is run)
   2. If we resume with "smallest" offset, we expect older shards to be 
consumed first. This will create duplicate data. New shards will be consumed if 
we resume later with "lastConsumed" or "largest" offset or after 
RealtimeSegmentValidationManager is run
   3. If we resume with "lastConsumed" offset, we expect consumption to start 
where it stopped without any data loss. 
   4. If we resume with "largest" offset, we expect consumption to start with 
some data loss. 
   5. If RealtimeSegmentValidationManager is run, we expect consumption to 
start where it stopped without any data loss. 
   6. There shouldn't be any new segments for older shards once it is 
completely processed
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to