klsince commented on issue #12667:
URL: https://github.com/apache/pinot/issues/12667#issuecomment-2008121069

   Link this up with a related issue: 
https://github.com/apache/pinot/issues/11948. We used to miss data while 
starting a new consuming segment, but after fixing that, we found under-count 
or over-count issue like the one reported here.
   
   There is another cause to this issue where queries might miss some data/PKs 
under data ingestion. The query processes segments (mutable and many immutable 
ones) in a non-deterministic order:
   1. if processing mutable segment firstly, the query might see less valid 
docs (under-count), as new data ingestion can invalidate docs in immutable 
segments before they get processed later
   2. if processing mutable segment last, the query might see more valid docs 
(over-count), as PKs of some newly ingested data may be processed while the 
query processes immutable segments earlier.
   
   To solve this, we're thinking about to provide a consistent table view for 
queries for upsert tables. In general, we can take copy of bitmaps to form the 
consistent view for queries to use, while keeping ingesting data.
   
   We can take the copy on read path or write path:
   1. query threads to lock the consuming thread (and helix threads that are 
adding/replacing segments) and take copy of validDocId bitmaps, then release 
the lock and process the segments with bitmap copies. Pros: fresh view; Cons: 
costly and blocks ingestion as the query rate becomes higher.
   
   2. consuming thread (and also the helix threads) take copy of validDocId 
bitmaps, then atomically replace the old set of bitmaps with the new set. 
Queries use the bitmap copy to process segments. We can take copy periodically 
to amortize cost but sacrifice some data freshness, or take copy at handling 
individual record if we can do the copy very efficiently (only swap the updated 
1 or 2 bitmaps into the consistent view atomically).
   
   Those options are not exclusive. We may add all of them and use per need 
(query heavy or ingest heavy or strict freshness etc.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to