ianvkoeppe opened a new issue #5751:
URL: https://github.com/apache/incubator-pinot/issues/5751


   ## Overview
   
   For hybrid setups, Pinot splits/filters broker queries to both the offline 
and realtime tables. It does so based on a `time value` which represents the 
latest available offline segments. Pinot assumes the consumer is uploading 
partial or intermediate segments throughout the current day, so it only serves 
`time value - 1` from the offline table and the rest from realtime.
   
   As an explicit example, if I upload a segment with the date 1/2/2000, the 
`time value` for the table will be 1/1/2000. This means, a query for all time, 
will be modified to query (*, 1/1/2000] to the offline table and (1/1/2000, *) 
to the realtime table. In this scenario, the 1/2/2000 segment is not being 
served queries. Once another segment is uploaded for 1/3/2000, then the data 
from 1/2/2000 will be served.
   
   ## Problem Statement
   
   In hybrid table setups where offline segments are only uploaded once per 
day, it is desired that those offline segments immediately start serving over 
the realtime segments. Today, this has to be achieved via a "hack" where an 
empty segment is pushed for a future date to trick Pinot into serving the 
latest offline segment with actual data.
   
   ## Requirements
   
   1. This should be configurable per table.
   
   ## Proposed Solutions
   
   ### Update BrokerRequestHandler to Partition Queries Differently
   
   We could use a table-level config in the request handler to modify the time 
filtering behavior to include the latest offline segments based on the time 
boundary (already available in the request handler).
   
   ### Update TimeBoundaryManager
   
   Using a table-level config we could modify the time boundary manager to 
return current time value for inquiries about the current time boundary rather 
than the current behavior of `time value - 1`.
   
   ## Open Questions
   
   1. At what point is the time value refreshed for a table? Segment uploads 
may contain more than one segment file. Does it wait until all are uploaded to 
increment or does this happen asynchronously and the time value may refresh 
while the others are still uploading? If the latter, then we risk creating a 
state of bad data where only some segments have been uploaded, but queries are 
already being diverted to the offline table.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to