[GitHub] [pinot] gortiz commented on pull request #10372: Let applyAnd to be applied using different window sizes

via GitHub Mon, 06 Mar 2023 01:14:09 -0800


gortiz commented on PR #10372:
URL: https://github.com/apache/pinot/pull/10372#issuecomment-1455758280


   > IIUC, the intention is to batch more reads from ForwardIndexReader to 
reduce the total times of accessing the reader? If that is the case, I think we 
should add the batch size into the SVScanDocIdIterator constructor without 
changing the applyAnd() interface
   
   That would make the change quite more complex. There are two buffers in this 
algorithm: the one used by `WhateverMatcher` 
   and the one used by `applyAnd`. Let's call `buffer` to the one in 
`WhateverMatcher` and `docIds` to the one in `applyAnd`.
   
   The code assumes that:
   - `length` <= `docIds.length` <= `buffer.length`. The reality is that 
`docIds.length` == `buffer.length`
   - `length` entries are going to be read into `buffer`.
   
   I'm not sure on what you suggest, but I'm going to assume that the idea is 
to increase `buffer.length` and break the second invariant.
   
   I think that would make the code quite more complex. `buffer` would need to 
be an actual cache over the data that is read and must read in some batch size 
defined at construction time. Then we would need a cursor to indicate which 
entry was the last that was consumed by the caller. Finally each time we call 
`readValuesSV` (I'm assuming with MV would be the same) we would need to check 
whether there are enough elements in `buffer` which may be resolved as:
   1. There are more documents to read than `length`, so directly read from 
`buffer` and move the cursor.
   2. `buffer` is exhausted, so read more from the backend, set the cursor to 0 
and repeat.
   3. `buffer` is not exhausted but there are not enough elements to read. 
First consume the elements on `buffer`, then read more and then read from 0 to 
the remaining number of elements.
   
   We can avoid the 3rd case if we assume that:
   1. `docIds.length = k * buffer.length` where `k` is some natural positive 
number
   2. In all calls but the last, `length = docIds.length`
   3. In the last call `length` = remaining docs in the segment and therefore 
`buffer` will contain exactly `length` elements.
   
   This algorithm introduce several not so trivial to analyze costs in Pinot's 
hotpath and contrary to my proposal, these changes will affect Pinot even when 
this customization is not used. In case we add something like that, I would be 
with Richard and I would require to add some benchmarks in order to hint 
whether the change may affect the performance negatively in some cases.
   
   Contrary to this approach, mine is simpler (it doesn't change the current 
algorithm or invariants at all), it doesn't require to add the new concept of 
fetch size into the reader, it doesn't affect the performance on OSS Pinot and 
it doesn't close the door to the possibility of adding the fetch size to the 
reader in a future in case we find other interesting cases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

[GitHub] [pinot] gortiz commented on pull request #10372: Let applyAnd to be applied using different window sizes

Reply via email to