[PR] Enhancing the Turkish stop word list with additional common words [lucene]

2025-04-24 Thread via GitHub


HakanBayazitHabes opened a new pull request, #14549:
URL: https://github.com/apache/lucene/pull/14549

   ### Description
   
   This pull request proposes the addition of several frequently used Turkish 
stopwords to the stopwords.txt file. These words are commonly considered 
non-informative in NLP tasks and are consistent with standard Turkish stopword 
sets.
   
   - Words like şu, şöyle, şayet, and öz were added
   
   - Alphabetically sorted for consistency
   
   - Based on analysis of multiple Turkish NLP resources
   
   This enhancement improves coverage and ensures better compatibility with 
text processing tasks involving Turkish.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Make competitive iterators more robust. [lucene]

2025-04-24 Thread via GitHub


jpountz commented on PR #14532:
URL: https://github.com/apache/lucene/pull/14532#issuecomment-2827499874

   Benchmark results with #14550 applied:
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
OrStopWords   33.75  (8.8%)   32.53  
(7.2%)   -3.6% ( -18% -   13%) 0.399
   IntervalsOrdered2.29  (4.1%)2.23  
(5.3%)   -2.4% ( -11% -7%) 0.334
   AndStopWords   30.50  (5.6%)   29.81  
(5.9%)   -2.3% ( -13% -9%) 0.462
 OrMany   19.52  (4.4%)   19.15  
(5.7%)   -1.9% ( -11% -8%) 0.483
 Or2Terms2StopWords  163.00  (5.5%)  159.95  
(4.7%)   -1.9% ( -11% -8%) 0.492
   Or3Terms  169.56  (5.8%)  166.70  
(4.8%)   -1.7% ( -11% -9%) 0.553
 OrHighHigh   52.71  (5.7%)   51.88  
(3.8%)   -1.6% ( -10% -8%) 0.542
  And3Terms  175.92  (4.0%)  173.27  
(4.4%)   -1.5% (  -9% -7%) 0.504
And2Terms2StopWords  165.34  (3.5%)  162.88  
(3.9%)   -1.5% (  -8% -6%) 0.456
   DismaxOrHighHigh  117.18  (2.3%)  115.70  
(1.6%)   -1.3% (  -5% -2%) 0.236
   Term  441.32  (3.6%)  435.76  
(4.9%)   -1.3% (  -9% -7%) 0.586
  OrHighMed  194.68  (4.8%)  192.28  
(3.2%)   -1.2% (  -8% -7%) 0.569
AndHighOrMedMed   47.26  (0.9%)   46.68  
(0.9%)   -1.2% (  -2% -0%) 0.011
DismaxOrHighMed  171.78  (2.7%)  170.05  
(1.8%)   -1.0% (  -5% -3%) 0.408
 CombinedOrHighHigh   19.21  (1.1%)   19.03  
(0.9%)   -0.9% (  -2% -1%) 0.074
   CombinedTerm   31.67  (4.4%)   31.40  
(3.9%)   -0.8% (  -8% -7%) 0.703
  CombinedOrHighMed   73.65  (0.7%)   73.04  
(0.9%)   -0.8% (  -2% -0%) 0.059
CombinedAndHighHigh   11.61  (0.8%)   11.51  
(1.5%)   -0.8% (  -3% -1%) 0.211
 Fuzzy1  101.38  (1.6%)  100.61  
(2.3%)   -0.8% (  -4% -3%) 0.462
 CombinedAndHighMed   39.16  (1.2%)   38.87  
(1.1%)   -0.8% (  -3% -1%) 0.236
 Fuzzy2   85.09  (1.5%)   84.46  
(2.1%)   -0.7% (  -4% -2%) 0.444
FilteredOrStopWords   47.30  (1.6%)   47.01  
(2.2%)   -0.6% (  -4% -3%) 0.550
 FilteredPhrase   33.45  (1.4%)   33.25  
(1.4%)   -0.6% (  -3% -2%) 0.439
AndHighHigh   43.56  (0.9%)   43.36  
(1.4%)   -0.5% (  -2% -1%) 0.460
   CountAndHighHigh  359.83  (1.3%)  358.29  
(3.2%)   -0.4% (  -4% -4%) 0.743
Prefix3  160.80  (2.7%)  160.13  
(5.2%)   -0.4% (  -8% -7%) 0.851
CountAndHighMed  309.90  (2.0%)  308.69  
(1.6%)   -0.4% (  -3% -3%) 0.682
  FilteredAnd3Terms  183.33  (2.0%)  182.64  
(1.6%)   -0.4% (  -3% -3%) 0.700
 DismaxTerm  486.91  (1.4%)  485.15  
(1.9%)   -0.4% (  -3% -3%) 0.688
 OrHighRare  274.71  (2.8%)  273.73  
(4.4%)   -0.4% (  -7% -7%) 0.857
   FilteredAndStopWords   46.11  (1.6%)   45.95  
(1.8%)   -0.4% (  -3% -3%) 0.701
 FilteredAndHighMed  128.57  (2.5%)  128.13  
(2.6%)   -0.3% (  -5% -4%) 0.801
   AndMedOrHighHigh   66.66  (0.9%)   66.43  
(1.8%)   -0.3% (  -3% -2%) 0.664
  TermMonthSort 3290.83  (2.4%) 3281.44  
(3.0%)   -0.3% (  -5% -5%) 0.846
CountPhrase4.21  (1.8%)4.20  
(4.4%)   -0.3% (  -6% -6%) 0.882
 CountFilteredOrHighMed  149.22  (0.5%)  148.84  
(0.7%)   -0.3% (  -1% -0%) 0.449
 FilteredOr2Terms2StopWords  149.95  (1.2%)  149.59  
(1.4%)   -0.2% (  -2% -2%) 0.729
 AndHighMed  136.50  (0.8%)  136.19  
(0.8%)   -0.2% (  -1% -1%) 0.616
   FilteredOr3Terms  168.18  (1.1%)  167.82  
(1.5%)   -0.2% (  -2% -2%) 0.751
   FilteredTerm  160.37  (1.8%)  160.11  
(1.1%)   -0.2% (  -3% -2%) 0.838
 CountOrHighMed  369.17  (1.4%)  368.63  
(2.5%)   -0.1% (  -3% -3%) 0

Re: [I] Allow Histogram Collection using PointTree when SortedNumericDocValues is absent [lucene]

2025-04-24 Thread via GitHub


jpountz commented on issue #14536:
URL: https://github.com/apache/lucene/issues/14536#issuecomment-2827511771

   In my opinion, it's fine to require doc values to be indexed for faceting to 
work. I don't think we should try to support faceting (or sorting) when the 
field has a points index but no doc values.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Remove DISIDocIdStream. [lucene]

2025-04-24 Thread via GitHub


jpountz opened a new pull request, #14550:
URL: https://github.com/apache/lucene/pull/14550

   I had initially introduced `DISIDocIdStream` to avoid introducing 
regressions when `DenseConjunctionBulkScorer` started accepting single clauses. 
However, benchmarks on #14532 suggested that going through `DISIDocIdStream` is 
slower than loading docs into a bit set first and then iterating the bit set, 
when the postings list has many of its blocks encoded as bit sets.
   
   This makes sense, the way how `BitSetDocIdStream` iterates set bits saves a 
number of operations compared with calling `FixedBitSet#nextSetBit` in a loop.
   
   So I'm suggesting removing `DISIDocIdStream` for now for simplicity.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Impl intoBitset for IndexedDISI and Docvalues [lucene]

2025-04-24 Thread via GitHub


gf2121 merged PR #14529:
URL: https://github.com/apache/lucene/pull/14529


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Add `SkipIndex` in SortedNumericDocValuesSetQuery [lucene]

2025-04-24 Thread via GitHub


kkewwei opened a new issue, #14551:
URL: https://github.com/apache/lucene/issues/14551

   ### Description
   
   In #13449, SkipIndex was introduced. It is primarily utilized in range query 
like `SortedSetDocValuesRangeQuery` and `SortedNumericDocValuesRangeQuery`. I'm 
wondering if we could incorporate it into `SortedNumericDocValuesSetQuery`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Add a timeout for forceMergeDeletes in IndexWriter [lucene]

2025-04-24 Thread via GitHub


mikemccand commented on issue #14431:
URL: https://github.com/apache/lucene/issues/14431#issuecomment-2828061328

   > > I don't know if we are already doing this -- is this TieredMergePolicy's 
default behavior (1 -> 1) for forceMergeDeletes? I don't think so?
   > 
   > It's not the default indeed. TieredMergePolicy always optimizes for lower 
write amplification over latency to my knowledge. I suspect it's a better 
default, not knowing anything about the user's use-case.
   
   Yeah +1 to defaulting that way.
   
   > > Yeah we actually set natural merging to tolerate lower deletes after 
forceMergeDeletes finishes.
   > 
   > Out of curiosity, what about making natural merging the way how deletes 
are reclaimed instead of using `forceMergeDeletes`? Something like
   > 
   > * Update `TieredMergePolicy` to tolerate lower deletes.
   > * Call `IndexWriter#maybeMerge`
   > * Poll until the deletes percent is below a tolerable value or the timeout.
   > 
   > The benefit I see is that `findMerges` naturally creates more and smaller 
merges vs `findForced(Deletes)Merges`, and prioritizes merges that are cheaper 
and reclaim more deletes.
   
   Ahhh that is a brilliant idea!  It also avoids the possible discontinuity of 
the different merge selection producing "exotic" segment geometry from 
`forcMergeDeletes` since it'd always be natural merges that are selected.  I 
like this idea!  Thanks @jpountz.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Impl intoBitset for IndexedDISI and Docvalues [lucene]

2025-04-24 Thread via GitHub


jpountz commented on code in PR #14529:
URL: https://github.com/apache/lucene/pull/14529#discussion_r2058189258


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/IndexedDISI.java:
##
@@ -693,6 +723,34 @@ boolean advanceExactWithinBlock(IndexedDISI disi, int 
target) throws IOException
 disi.index = disi.numberOfOnes - Long.bitCount(leftBits);
 return (leftBits & 1L) != 0;
   }
+
+  @Override
+  boolean intoBitSetWithinBlock(IndexedDISI disi, int upTo, FixedBitSet 
bitSet, int offset)
+  throws IOException {
+if (disi.bitSet == null) {
+  disi.bitSet = new FixedBitSet(BLOCK_SIZE);
+}
+
+int sourceFrom = disi.doc & 0x;
+int sourceTo = Math.min(upTo - disi.block, BLOCK_SIZE);
+int destFrom = disi.block - offset + sourceFrom;

Review Comment:
   Is this the same as `disi.doc - offset`?



##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/IndexedDISI.java:
##
@@ -693,6 +723,34 @@ boolean advanceExactWithinBlock(IndexedDISI disi, int 
target) throws IOException
 disi.index = disi.numberOfOnes - Long.bitCount(leftBits);
 return (leftBits & 1L) != 0;
   }
+
+  @Override
+  boolean intoBitSetWithinBlock(IndexedDISI disi, int upTo, FixedBitSet 
bitSet, int offset)
+  throws IOException {
+if (disi.bitSet == null) {
+  disi.bitSet = new FixedBitSet(BLOCK_SIZE);
+}
+
+int sourceFrom = disi.doc & 0x;
+int sourceTo = Math.min(upTo - disi.block, BLOCK_SIZE);
+int destFrom = disi.block - offset + sourceFrom;
+
+long fp = disi.slice.getFilePointer();
+disi.slice.seek(fp - Long.BYTES);

Review Comment:
   nit: maybe leave a small comment explaining why we're seeking backwards, I 
believe that it's to include the word that contains the current doc, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Encode numeric docvalue with per block gcd [lucene]

2025-04-24 Thread via GitHub


jpountz commented on issue #14545:
URL: https://github.com/apache/lucene/issues/14545#issuecomment-2827604411

   If you're interested in storage efficiency, you'd probably want to use a 
doc-values format like this one: https://github.com/apache/lucene/issues/11072. 
It stores data into blocks of 128 values like postings, where each block is 
encoded with its own GCD among other things. Elasticsearch uses something 
similar for logs and metrics.
   
   It wasn't adopted by Lucene because the way how it decodes whole blocks when 
it needs a single value in a block slowed down sparse queries but IMO it would 
be a better trade-off for a number of applications.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Ensuring skip list is read for fields indexed with only DOCS [lucene]

2025-04-24 Thread via GitHub


expani commented on code in PR #14511:
URL: https://github.com/apache/lucene/pull/14511#discussion_r2058428825


##
lucene/core/src/java/org/apache/lucene/codecs/lucene103/Lucene103PostingsReader.java:
##
@@ -282,6 +290,10 @@ public PostingsEnum postings(
   @Override
   public ImpactsEnum impacts(FieldInfo fieldInfo, BlockTermState state, int 
flags)
   throws IOException {
+if (state.docFreq <= BLOCK_SIZE) {
+  // no skip data
+  return new SlowImpactsEnum(postings(fieldInfo, state, null, flags));
+}

Review Comment:
   Sure I saw that removing this also works.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-04-24 Thread via GitHub


benwtrent commented on PR #14527:
URL: https://github.com/apache/lucene/pull/14527#issuecomment-2827970657

   @weizijun I am running some benchmarking as well. But the key thing is to 
update the parameters in `knnPerfTest.py` making sure `numMergeWorker` and 
`numMergeThread` are greater than `1` ,etc. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-04-24 Thread via GitHub


weizijun commented on PR #14527:
URL: https://github.com/apache/lucene/pull/14527#issuecomment-2827978234

   @benwtrent I see the default parameters:
   ```
 "numMergeWorker": (12,),
 "numMergeThread": (4,),
   ```
   
   Is the current merger effective with these parameters?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-04-24 Thread via GitHub


weizijun commented on PR #14527:
URL: https://github.com/apache/lucene/pull/14527#issuecomment-2827261485

   > We need to make sure that there are no significant performance or 
concurrency bugs introduced with this. Could you test with 
https://github.com/mikemccand/luceneutil to verify recall, multi-threaded 
merge, etc. ?
   
   hi, @benwtrent , could you tell me which benchmark tool can test this case?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Make task executor non-final [lucene]

2025-04-24 Thread via GitHub


jpountz commented on PR #14524:
URL: https://github.com/apache/lucene/pull/14524#issuecomment-2827538594

   IMO Lucene should own how queries execute concurrently instead of making it 
pluggable. So I'd rather not allow users to pass a custom `TaskExecutor`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Ensuring skip list is read for fields indexed with only DOCS [lucene]

2025-04-24 Thread via GitHub


jpountz commented on code in PR #14511:
URL: https://github.com/apache/lucene/pull/14511#discussion_r2058399082


##
lucene/core/src/java/org/apache/lucene/codecs/lucene103/Lucene103PostingsReader.java:
##
@@ -1286,14 +1298,11 @@ public long cost() {
 
   @Override
   public int numLevels() {
-return indexHasFreq == false || level1LastDocID == NO_MORE_DOCS ? 
1 : 2;
+return level1LastDocID == NO_MORE_DOCS ? 1 : 2;
   }
 
   @Override
   public int getDocIdUpTo(int level) {
-if (indexHasFreq == false) {
-  return NO_MORE_DOCS;
-}

Review Comment:
   Likewise, let's keep this?



##
lucene/core/src/java/org/apache/lucene/codecs/lucene103/Lucene103PostingsReader.java:
##
@@ -1286,14 +1298,11 @@ public long cost() {
 
   @Override
   public int numLevels() {
-return indexHasFreq == false || level1LastDocID == NO_MORE_DOCS ? 
1 : 2;
+return level1LastDocID == NO_MORE_DOCS ? 1 : 2;

Review Comment:
   I would keep it the way it was for now. Exposing a single level that covers 
the whole doc ID space may cause inefficiencies with some scorers, but we 
should fix these scorers instead of tuning this class. If all docs have the 
same score, it should be just fine to return a single level of impact that 
covers the whole doc ID space.



##
lucene/core/src/java/org/apache/lucene/codecs/lucene103/Lucene103PostingsReader.java:
##
@@ -282,6 +290,10 @@ public PostingsEnum postings(
   @Override
   public ImpactsEnum impacts(FieldInfo fieldInfo, BlockTermState state, int 
flags)
   throws IOException {
+if (state.docFreq <= BLOCK_SIZE) {
+  // no skip data
+  return new SlowImpactsEnum(postings(fieldInfo, state, null, flags));
+}

Review Comment:
   Let's not use `SlowImpactsEnum` again or it would cancel the speedup from 
https://github.com/apache/lucene/pull/14017 (annotation HJ at 
https://benchmarks.mikemccandless.com/OrHighHigh.html).



##
lucene/core/src/java/org/apache/lucene/codecs/lucene103/Lucene103PostingsReader.java:
##
@@ -66,13 +67,20 @@
 public final class Lucene103PostingsReader extends PostingsReaderBase {
 
   static final VectorizationProvider VECTORIZATION_PROVIDER = 
VectorizationProvider.getInstance();
+
   // Dummy impacts, composed of the maximum possible term frequency and the 
lowest possible
   // (unsigned) norm value. This is typically used on tail blocks, which don't 
actually record
-  // impacts as the storage overhead would not be worth any query evaluation 
speedup, since there's
+  // impacts as the storage overhead would not be worth any query evaluation 
speedup, since
+  // there's
   // less than 128 docs left to evaluate anyway.
   private static final List DUMMY_IMPACTS =
   Collections.singletonList(new Impact(Integer.MAX_VALUE, 1L));
 
+  // We stopped storing a placeholder impact with freq=1 for fields with 
IndexOptions.DOCS after
+  // 9.12.0
+  private static final List NON_COMPETITIVE_IMPACTS =
+  Collections.singletonList(new Impact(1, 1L));

Review Comment:
   Maybe call it `DUMMY_IMPACTS_NO_FREQS` or something like that to draw a 
parallel with `DUMMY_IMPACTS` above?



##
lucene/core/src/java/org/apache/lucene/codecs/lucene103/Lucene103PostingsReader.java:
##
@@ -1309,8 +1318,9 @@ public List getImpacts(int level) {
   if (level == 1) {
 return readImpacts(level1SerializedImpacts, level1Impacts);
   }
+  return DUMMY_IMPACTS;
 }
-return DUMMY_IMPACTS;
+return NON_COMPETITIVE_IMPACTS;

Review Comment:
   I would keep it the way it was before and just add this at the beginning of 
this method (similar to `getDocIdUpTo`):
   
   ```java
   if (indexHasFreq == false) {
 return DUMMY_IMPACTS_NO_FREQS;
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Speed up flush of softdelete by intoBitset [lucene]

2025-04-24 Thread via GitHub


gf2121 opened a new pull request, #14552:
URL: https://github.com/apache/lucene/pull/14552

   Speed up the flush of softdelete by intoBitset.
   
   relates: https://github.com/apache/lucene/issues/14521


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Create file open hints on IOContext to replace ReadAdvice [lucene]

2025-04-24 Thread via GitHub


jpountz commented on code in PR #14482:
URL: https://github.com/apache/lucene/pull/14482#discussion_r2059102197


##
lucene/core/src/java/org/apache/lucene/store/Directory.java:
##
@@ -79,6 +83,31 @@ public abstract class Directory implements Closeable {
*/
   public abstract long fileLength(String name) throws IOException;
 
+  protected void validateIOContext(IOContext context) {
+Map, List> 
hintClasses =
+
context.hints().stream().collect(Collectors.groupingBy(IOContext.FileOpenHint::getClass));
+
+// there should only be one of FileType, FileData, DataAccess
+List fileTypes =
+hintClasses.getOrDefault(FileTypeHint.class, List.of());
+if (fileTypes.size() > 1) {
+  throw new IllegalArgumentException("Multiple file type hints specified: 
" + fileTypes);
+}
+List fileData = 
hintClasses.getOrDefault(FileDataHint.class, List.of());
+if (fileData.size() > 1) {
+  throw new IllegalArgumentException("Multiple file data hints specified: 
" + fileData);
+}
+List dataAccess =
+hintClasses.getOrDefault(DataAccessHint.class, List.of());
+if (dataAccess.size() > 1) {
+  throw new IllegalArgumentException("Multiple data access hints 
specified: " + dataAccess);
+}
+  }
+
+  protected ReadAdvice toReadAdvice(IOContext context) {

Review Comment:
   > After all, you're specifying a specific read advice for a reason - if you 
encode the reason (as hints), rather than the specific behaviour that you want, 
then the directory can choose the best way to do that according to the 
directory implementation
   
   My reading of recent reports is that users want to be able to customize it 
because there isn't a one-size-fits-all. For instance, vectors work better with 
`ReadAdvice.NORMAL` when they can fit in the page cache, and better with 
`ReadAdvice.RANDOM` when they significantly exceed the size of the page cache. 
So I agree that Lucene should come with sensible defaults, but I think that we 
also need a way to diverge from Lucene's defaults?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Create file open hints on IOContext to replace ReadAdvice [lucene]

2025-04-24 Thread via GitHub


ChrisHegarty commented on code in PR #14482:
URL: https://github.com/apache/lucene/pull/14482#discussion_r2057896977


##
lucene/core/src/java/org/apache/lucene/store/Directory.java:
##
@@ -79,6 +83,31 @@ public abstract class Directory implements Closeable {
*/
   public abstract long fileLength(String name) throws IOException;
 
+  protected void validateIOContext(IOContext context) {
+Map, List> 
hintClasses =
+
context.hints().stream().collect(Collectors.groupingBy(IOContext.FileOpenHint::getClass));
+
+// there should only be one of FileType, FileData, DataAccess
+List fileTypes =
+hintClasses.getOrDefault(FileTypeHint.class, List.of());
+if (fileTypes.size() > 1) {
+  throw new IllegalArgumentException("Multiple file type hints specified: 
" + fileTypes);
+}
+List fileData = 
hintClasses.getOrDefault(FileDataHint.class, List.of());
+if (fileData.size() > 1) {
+  throw new IllegalArgumentException("Multiple file data hints specified: 
" + fileData);
+}
+List dataAccess =
+hintClasses.getOrDefault(DataAccessHint.class, List.of());
+if (dataAccess.size() > 1) {
+  throw new IllegalArgumentException("Multiple data access hints 
specified: " + dataAccess);
+}
+  }
+
+  protected ReadAdvice toReadAdvice(IOContext context) {

Review Comment:
   The final api position as I understand it, `IOContext` has no knowledge of 
`ReadAdvice` ( `IOConext::withReadAdvice` and `IOContext::readAdvice` will be 
removed ).  
   
   `ReadAdvice` is rather determined by the directory when given the 
higher-level hints from the IOContext. For example, the `MMapDirectory` 
implementation of openInput would do something like.
   
   ```java
   public IndexInput openInput(String name, IOContext context) throws 
IOException {
 ...
 return PROVIDER.openInput(..., toReadAdvice(context), preload.test(name, 
context),...);
   }
   ```
   ... so it's somewhat akin to the the current preLoad model.
   
   For a custom filter wrapper one would do:
   
   ```java
   Directory dir = new FilterDirectory(mmapDirectory) {
 @Override
 public ReadAdvice toReadAdvice(IOContext context) {
   if (someCondition) {
 // enforce specific read advice instead of relying on hints
 return ReadAdvice.NORMAL;
   }
   // hints to readAdvice
 }
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Create file open hints on IOContext to replace ReadAdvice [lucene]

2025-04-24 Thread via GitHub


thecoop commented on code in PR #14482:
URL: https://github.com/apache/lucene/pull/14482#discussion_r2057921646


##
lucene/core/src/java/org/apache/lucene/store/Directory.java:
##
@@ -79,6 +83,31 @@ public abstract class Directory implements Closeable {
*/
   public abstract long fileLength(String name) throws IOException;
 
+  protected void validateIOContext(IOContext context) {
+Map, List> 
hintClasses =
+
context.hints().stream().collect(Collectors.groupingBy(IOContext.FileOpenHint::getClass));
+
+// there should only be one of FileType, FileData, DataAccess
+List fileTypes =
+hintClasses.getOrDefault(FileTypeHint.class, List.of());
+if (fileTypes.size() > 1) {
+  throw new IllegalArgumentException("Multiple file type hints specified: 
" + fileTypes);
+}
+List fileData = 
hintClasses.getOrDefault(FileDataHint.class, List.of());
+if (fileData.size() > 1) {
+  throw new IllegalArgumentException("Multiple file data hints specified: 
" + fileData);
+}
+List dataAccess =
+hintClasses.getOrDefault(DataAccessHint.class, List.of());
+if (dataAccess.size() > 1) {
+  throw new IllegalArgumentException("Multiple data access hints 
specified: " + dataAccess);
+}
+  }
+
+  protected ReadAdvice toReadAdvice(IOContext context) {

Review Comment:
   The `MMapDirectory` wouldn't call the overridden `toReadAdvice`, as thats on 
a different class, not in the same type hierarchy.
   
   Hmm...
   
   One way to think of it might be that trying to set a specific ReadAdvice is 
wrong. You should instead be setting the correct hints _such that_ the 
directory chooses the behaviour you want. So rather than setting 
`ReadAdvice.NORMAL` directly, you would specify hints to make MMapDirectory 
select the `ReadAdvice` that you want - or whatever else MMapDirectory chooses 
to do to follow the hints you've specified. The `ReadAdvice` is entirely opaque 
within the Directory implementation, and you can _only_ set hints.
   
   After all, you're specifying a specific read advice for a reason - if you 
encode the reason (as hints), rather than the specific behaviour that you want, 
then the directory can choose the best way to do that according to the 
directory implementation, which may or may not be using certain ReadAdvices (or 
any other setting it chooses). So the code would be:
   
   ```
   Directory dir = new FilterDirectory(mmapDirectory) {
 @Override
 public IndexInput openInput(String name, IOContext context) {
   if (someCondition()) {
 // be more specific in what this file is being opened for,
 // MMapDirectory chooses what to do with this info
 context = context.withHints(FileDataHint.VECTORS, 
DataAccessHint.RANDOM);
   }
   return in.openInput(name, context);
 }
   }
   ```
   
   
   If we don't want to do that, then the hint -> ReadAdvice mapping would 
indeed need to be on the IOContext, and _not_ the responsibility of the 
`Directory`; or we use some sort of lambda passed in to the relevant Directory 
constructor to do that mapping, which feels kinda hacky



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Create file open hints on IOContext to replace ReadAdvice [lucene]

2025-04-24 Thread via GitHub


thecoop commented on code in PR #14482:
URL: https://github.com/apache/lucene/pull/14482#discussion_r2057921646


##
lucene/core/src/java/org/apache/lucene/store/Directory.java:
##
@@ -79,6 +83,31 @@ public abstract class Directory implements Closeable {
*/
   public abstract long fileLength(String name) throws IOException;
 
+  protected void validateIOContext(IOContext context) {
+Map, List> 
hintClasses =
+
context.hints().stream().collect(Collectors.groupingBy(IOContext.FileOpenHint::getClass));
+
+// there should only be one of FileType, FileData, DataAccess
+List fileTypes =
+hintClasses.getOrDefault(FileTypeHint.class, List.of());
+if (fileTypes.size() > 1) {
+  throw new IllegalArgumentException("Multiple file type hints specified: 
" + fileTypes);
+}
+List fileData = 
hintClasses.getOrDefault(FileDataHint.class, List.of());
+if (fileData.size() > 1) {
+  throw new IllegalArgumentException("Multiple file data hints specified: 
" + fileData);
+}
+List dataAccess =
+hintClasses.getOrDefault(DataAccessHint.class, List.of());
+if (dataAccess.size() > 1) {
+  throw new IllegalArgumentException("Multiple data access hints 
specified: " + dataAccess);
+}
+  }
+
+  protected ReadAdvice toReadAdvice(IOContext context) {

Review Comment:
   The `MMapDirectory` wouldn't call the overridden `toReadAdvice`, as thats on 
a different class, not in the same type hierarchy.
   
   Hmm...
   
   One way to think of it might be that trying to set a specific ReadAdvice is 
wrong. You should instead be setting the correct hints _such that_ the 
directory chooses the behaviour you want. So rather than setting 
`ReadAdvice.NORMAL` directly, you would specify hints to make MMapDirectory 
select the `ReadAdvice` that you want - or whatever else MMapDirectory chooses 
to do to follow the hints you've specified. The `ReadAdvice` is entirely opaque 
within the Directory implementation, and you can _only_ set hints.
   
   After all, you're specifying a specific read advice for a reason - if you 
encode the reason (as hints), rather than the specific behaviour that you want, 
then the directory can choose the best way to do that according to the 
directory implementation, which may or may not be using certain ReadAdvices (or 
any other setting it chooses). So the code would be:
   
   ```
   Directory dir = new FilterDirectory(mmapDirectory) {
 @Override
 public IndexInput openInput(String name, IOContext context) {
   if (someCondition()) {
 // be more specific in what this file is being opened for,
 // MMapDirectory chooses what to do with it
 context = context.withHints(FileDataHint.VECTORS, 
DataAccessHint.RANDOM);
   }
   return in.openInput(name, context);
 }
   }
   ```
   
   
   If we don't want to do that, then the hint -> ReadAdvice mapping would 
indeed need to be on the IOContext, and _not_ the responsibility of the 
`Directory`; or we use some sort of lambda passed in to the relevant Directory 
constructor to do that mapping, which feels kinda hacky



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Create file open hints on IOContext to replace ReadAdvice [lucene]

2025-04-24 Thread via GitHub


thecoop commented on code in PR #14482:
URL: https://github.com/apache/lucene/pull/14482#discussion_r2057921646


##
lucene/core/src/java/org/apache/lucene/store/Directory.java:
##
@@ -79,6 +83,31 @@ public abstract class Directory implements Closeable {
*/
   public abstract long fileLength(String name) throws IOException;
 
+  protected void validateIOContext(IOContext context) {
+Map, List> 
hintClasses =
+
context.hints().stream().collect(Collectors.groupingBy(IOContext.FileOpenHint::getClass));
+
+// there should only be one of FileType, FileData, DataAccess
+List fileTypes =
+hintClasses.getOrDefault(FileTypeHint.class, List.of());
+if (fileTypes.size() > 1) {
+  throw new IllegalArgumentException("Multiple file type hints specified: 
" + fileTypes);
+}
+List fileData = 
hintClasses.getOrDefault(FileDataHint.class, List.of());
+if (fileData.size() > 1) {
+  throw new IllegalArgumentException("Multiple file data hints specified: 
" + fileData);
+}
+List dataAccess =
+hintClasses.getOrDefault(DataAccessHint.class, List.of());
+if (dataAccess.size() > 1) {
+  throw new IllegalArgumentException("Multiple data access hints 
specified: " + dataAccess);
+}
+  }
+
+  protected ReadAdvice toReadAdvice(IOContext context) {

Review Comment:
   The `MMapDirectory` wouldn't call the overridden `toReadAdvice`, as thats on 
a different class, not in the same type hierarchy.
   
   Hmm...
   
   One way to think of it might be that trying to set a specific ReadAdvice is 
wrong. You should instead be setting the correct hints _such that_ the 
directory chooses the behaviour you want. So rather than setting 
`ReadAdvice.NORMAL` directly, you would specify hints to make MMapDirectory 
select the `ReadAdvice` that you want - or whatever else MMapDirectory chooses 
to do to follow the hints you've specified. The `ReadAdvice` is entirely opaque 
within the Directory implementation, and you can _only_ set hints.
   
   After all, you're specifying a specific read advice for a reason - if you 
encode the reason (as hints), rather than the specific behaviour that you want, 
then the directory can choose the best way to do that according to the 
directory implementation, which may or may not be using certain ReadAdvices (or 
any other setting it chooses). So the code would be:
   
   ```
   Directory dir = new FilterDirectory(mmapDirectory) {
 @Override
 public IndexInput openInput(String name, IOContext context) {
   if (someCondition()) {
 // be more specific in what this file is being opened for,
 // MMapDirectory chooses what to do with this info
 context = context.withHints(FileDataHint.VECTORS, 
DataAccessHint.RANDOM);
   }
   return in.openInput(name, context);
 }
   }
   ```
   
   
   If we don't want to do that, then the hint -> ReadAdvice mapping would 
indeed need to be on the IOContext, and _not_ the responsibility of the 
`Directory`; or we use some sort of lambda override similar to `setPreload`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Create file open hints on IOContext to replace ReadAdvice [lucene]

2025-04-24 Thread via GitHub


thecoop commented on code in PR #14482:
URL: https://github.com/apache/lucene/pull/14482#discussion_r2057921646


##
lucene/core/src/java/org/apache/lucene/store/Directory.java:
##
@@ -79,6 +83,31 @@ public abstract class Directory implements Closeable {
*/
   public abstract long fileLength(String name) throws IOException;
 
+  protected void validateIOContext(IOContext context) {
+Map, List> 
hintClasses =
+
context.hints().stream().collect(Collectors.groupingBy(IOContext.FileOpenHint::getClass));
+
+// there should only be one of FileType, FileData, DataAccess
+List fileTypes =
+hintClasses.getOrDefault(FileTypeHint.class, List.of());
+if (fileTypes.size() > 1) {
+  throw new IllegalArgumentException("Multiple file type hints specified: 
" + fileTypes);
+}
+List fileData = 
hintClasses.getOrDefault(FileDataHint.class, List.of());
+if (fileData.size() > 1) {
+  throw new IllegalArgumentException("Multiple file data hints specified: 
" + fileData);
+}
+List dataAccess =
+hintClasses.getOrDefault(DataAccessHint.class, List.of());
+if (dataAccess.size() > 1) {
+  throw new IllegalArgumentException("Multiple data access hints 
specified: " + dataAccess);
+}
+  }
+
+  protected ReadAdvice toReadAdvice(IOContext context) {

Review Comment:
   The `MMapDirectory` wouldn't call the overridden `toReadAdvice`, as thats on 
a different class, not in the same type hierarchy.
   
   Hmm...
   
   One way to think of it might be that trying to set a specific ReadAdvice is 
wrong. You should instead be setting the correct hints _such that_ the 
directory chooses the behaviour you want. So rather than setting 
`ReadAdvice.NORMAL` directly, you would specify hints to make MMapDirectory 
select the `ReadAdvice` that you want - or whatever else MMapDirectory chooses 
to do to follow the hints you've specified. The `ReadAdvice` is entirely opaque 
within the Directory implementation, and you can _only_ set hints.
   
   After all, you're specifying a specific read advice for a reason - if you 
encode the reason (as hints), rather than the specific behaviour that you want, 
then the directory can choose the best way to do that according to the 
directory implementation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Impl intoBitset for IndexedDISI and Docvalues [lucene]

2025-04-24 Thread via GitHub


gf2121 commented on code in PR #14529:
URL: https://github.com/apache/lucene/pull/14529#discussion_r2057923869


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/IndexedDISI.java:
##
@@ -693,6 +727,44 @@ boolean advanceExactWithinBlock(IndexedDISI disi, int 
target) throws IOException
 disi.index = disi.numberOfOnes - Long.bitCount(leftBits);
 return (leftBits & 1L) != 0;
   }
+
+  @Override
+  boolean intoBitSetWithinBlock(IndexedDISI disi, int upTo, FixedBitSet 
bitSet, int offset)
+  throws IOException {
+if (disi.doc >= upTo) {
+  advanceWithinBlock(disi, disi.doc);

Review Comment:
   This was useful before we advancing `disi.block`, but not now. I like the 
simplification more :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Create file open hints on IOContext to replace ReadAdvice [lucene]

2025-04-24 Thread via GitHub


thecoop commented on code in PR #14482:
URL: https://github.com/apache/lucene/pull/14482#discussion_r2057921646


##
lucene/core/src/java/org/apache/lucene/store/Directory.java:
##
@@ -79,6 +83,31 @@ public abstract class Directory implements Closeable {
*/
   public abstract long fileLength(String name) throws IOException;
 
+  protected void validateIOContext(IOContext context) {
+Map, List> 
hintClasses =
+
context.hints().stream().collect(Collectors.groupingBy(IOContext.FileOpenHint::getClass));
+
+// there should only be one of FileType, FileData, DataAccess
+List fileTypes =
+hintClasses.getOrDefault(FileTypeHint.class, List.of());
+if (fileTypes.size() > 1) {
+  throw new IllegalArgumentException("Multiple file type hints specified: 
" + fileTypes);
+}
+List fileData = 
hintClasses.getOrDefault(FileDataHint.class, List.of());
+if (fileData.size() > 1) {
+  throw new IllegalArgumentException("Multiple file data hints specified: 
" + fileData);
+}
+List dataAccess =
+hintClasses.getOrDefault(DataAccessHint.class, List.of());
+if (dataAccess.size() > 1) {
+  throw new IllegalArgumentException("Multiple data access hints 
specified: " + dataAccess);
+}
+  }
+
+  protected ReadAdvice toReadAdvice(IOContext context) {

Review Comment:
   The `MMapDirectory` wouldn't call the overridden `toReadAdvice`, as thats on 
a different class, not in the same type hierarchy.
   
   Hmm...
   
   One way to think of it might be that trying to set a specific ReadAdvice is 
wrong. You should instead be setting the correct hints _such that_ the 
directory chooses the behaviour you want. So rather than setting 
`ReadAdvice.NORMAL` directly, you would specify hints to make MMapDirectory 
select the `ReadAdvice` that you want - or whatever else MMapDirectory chooses 
to do to follow the hints you've specified. The `ReadAdvice` is entirely opaque 
within the Directory implementation, and you can _only_ set hints.
   
   After all, you're specifying a specific read advice for a reason - if you 
encode the reason (as hints), rather than the specific behaviour that you want, 
then the directory can choose the best way to do that according to the 
directory implementation, which may or may not be using certain ReadAdvices (or 
any other setting it chooses)
   
   If we don't want to do that, then the hint -> ReadAdvice mapping would 
indeed need to be on the IOContext, or we use some sort of lambda passed in to 
the relevant Directory constructor to override it, which feels kinda hacky



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Create file open hints on IOContext to replace ReadAdvice [lucene]

2025-04-24 Thread via GitHub


thecoop commented on code in PR #14482:
URL: https://github.com/apache/lucene/pull/14482#discussion_r2057921646


##
lucene/core/src/java/org/apache/lucene/store/Directory.java:
##
@@ -79,6 +83,31 @@ public abstract class Directory implements Closeable {
*/
   public abstract long fileLength(String name) throws IOException;
 
+  protected void validateIOContext(IOContext context) {
+Map, List> 
hintClasses =
+
context.hints().stream().collect(Collectors.groupingBy(IOContext.FileOpenHint::getClass));
+
+// there should only be one of FileType, FileData, DataAccess
+List fileTypes =
+hintClasses.getOrDefault(FileTypeHint.class, List.of());
+if (fileTypes.size() > 1) {
+  throw new IllegalArgumentException("Multiple file type hints specified: 
" + fileTypes);
+}
+List fileData = 
hintClasses.getOrDefault(FileDataHint.class, List.of());
+if (fileData.size() > 1) {
+  throw new IllegalArgumentException("Multiple file data hints specified: 
" + fileData);
+}
+List dataAccess =
+hintClasses.getOrDefault(DataAccessHint.class, List.of());
+if (dataAccess.size() > 1) {
+  throw new IllegalArgumentException("Multiple data access hints 
specified: " + dataAccess);
+}
+  }
+
+  protected ReadAdvice toReadAdvice(IOContext context) {

Review Comment:
   The `MMapDirectory` wouldn't call the overridden `toReadAdvice`, as thats on 
a different class, not in the same type hierarchy.
   
   Hmm...
   
   One way to think of it might be that trying to set a specific ReadAdvice is 
wrong. You should instead be setting the correct hints _such that_ the 
directory chooses the behaviour you want. So rather than setting 
`ReadAdvice.NORMAL` directly, you would specify hints to make MMapDirectory 
select the `ReadAdvice` that you want - or whatever else MMapDirectory chooses 
to do to follow the hints you've specified. The `ReadAdvice` is entirely opaque 
within the Directory implementation, and you can _only_ set hints.
   
   After all, you're specifying a specific read advice for a reason - if you 
encode the reason (as hints), rather than the specific behaviour that you want, 
then the directory can choose the best way to do that according to the 
directory implementation, which may or may not be using certain ReadAdvices (or 
any other setting it chooses)
   
   If we don't want to do that, then the hint -> ReadAdvice mapping would 
indeed need to be on the IOContext, and _not_ the responsibility of the 
`Directory`; or we use some sort of lambda passed in to the relevant Directory 
constructor to override it, which feels kinda hacky



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Create file open hints on IOContext to replace ReadAdvice [lucene]

2025-04-24 Thread via GitHub


thecoop commented on code in PR #14482:
URL: https://github.com/apache/lucene/pull/14482#discussion_r2057921646


##
lucene/core/src/java/org/apache/lucene/store/Directory.java:
##
@@ -79,6 +83,31 @@ public abstract class Directory implements Closeable {
*/
   public abstract long fileLength(String name) throws IOException;
 
+  protected void validateIOContext(IOContext context) {
+Map, List> 
hintClasses =
+
context.hints().stream().collect(Collectors.groupingBy(IOContext.FileOpenHint::getClass));
+
+// there should only be one of FileType, FileData, DataAccess
+List fileTypes =
+hintClasses.getOrDefault(FileTypeHint.class, List.of());
+if (fileTypes.size() > 1) {
+  throw new IllegalArgumentException("Multiple file type hints specified: 
" + fileTypes);
+}
+List fileData = 
hintClasses.getOrDefault(FileDataHint.class, List.of());
+if (fileData.size() > 1) {
+  throw new IllegalArgumentException("Multiple file data hints specified: 
" + fileData);
+}
+List dataAccess =
+hintClasses.getOrDefault(DataAccessHint.class, List.of());
+if (dataAccess.size() > 1) {
+  throw new IllegalArgumentException("Multiple data access hints 
specified: " + dataAccess);
+}
+  }
+
+  protected ReadAdvice toReadAdvice(IOContext context) {

Review Comment:
   The `MMapDirectory` wouldn't call the overridden `toReadAdvice`, as thats on 
a different class, not in the same type hierarchy.
   
   Hmm...
   
   One way to think of it might be that trying to set a specific ReadAdvice is 
wrong. You should instead be setting the correct hints _such that_ the 
directory chooses the behaviour you want. So rather than setting 
`ReadAdvice.NORMAL` directly, you would specify hints to make MMapDirectory 
select the `ReadAdvice` that you want - or whatever else MMapDirectory chooses 
to do to follow the hints you've specified. The `ReadAdvice` is entirely opaque 
within the Directory implementation, and you can _only_ set hints.
   
   After all, you're specifying a specific read advice for a reason - if you 
encode the reason (as hints), rather than the specific behaviour that you want, 
then the directory can choose the best way to do that according to the 
directory implementation, which may or may not be using certain ReadAdvices (or 
any other setting it chooses)
   
   If we don't want to do that, then the hint -> ReadAdvice mapping would 
indeed need to be on the IOContext, and _not_ the responsibility of the 
`Directory`; or we use some sort of lambda passed in to the relevant Directory 
constructor to do that mapping, which feels kinda hacky



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Create file open hints on IOContext to replace ReadAdvice [lucene]

2025-04-24 Thread via GitHub


ChrisHegarty commented on code in PR #14482:
URL: https://github.com/apache/lucene/pull/14482#discussion_r2057828737


##
lucene/core/src/java/org/apache/lucene/store/Directory.java:
##
@@ -79,6 +83,31 @@ public abstract class Directory implements Closeable {
*/
   public abstract long fileLength(String name) throws IOException;
 
+  protected void validateIOContext(IOContext context) {

Review Comment:
   >It does mean that no hint at all can have more than one instance as a hint 
though - but that might be ok?
   
   That limitation seems ok to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Make competitive iterators more robust. [lucene]

2025-04-24 Thread via GitHub


jpountz commented on PR #14532:
URL: https://github.com/apache/lucene/pull/14532#issuecomment-2826606964

   I could confirm that it's due to `#docIdEnd()` now being implemented on the 
competitive iterators. This triggers usage of `DISIDocIdStream`, which calls 
`nextDoc()` in a loop, and calling `nextDoc()` in a loop happens to be a slower 
way of iterating doc IDs than calling `intoBitSet` and then iterating set bits 
when blocks are encoded as a bit set.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Impl intoBitset for IndexedDISI and Docvalues [lucene]

2025-04-24 Thread via GitHub


gf2121 commented on code in PR #14529:
URL: https://github.com/apache/lucene/pull/14529#discussion_r2057706654


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/IndexedDISI.java:
##
@@ -491,6 +492,14 @@ public int advance(int target) throws IOException {
 return doc;
   }
 
+  @Override
+  public void intoBitSet(int upTo, FixedBitSet bitSet, int offset) throws 
IOException {
+assert doc >= offset;
+while (method.intoBitsetWithinBlock(this, upTo, bitSet, offset) == false) {
+  readBlockHeader();

Review Comment:
   I like the simplification!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Impl intoBitset for IndexedDISI and Docvalues [lucene]

2025-04-24 Thread via GitHub


gf2121 commented on code in PR #14529:
URL: https://github.com/apache/lucene/pull/14529#discussion_r2057680596


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/IndexedDISI.java:
##
@@ -625,6 +634,29 @@ boolean advanceExactWithinBlock(IndexedDISI disi, int 
target) throws IOException
 disi.exists = false;
 return false;
   }
+
+  @Override
+  boolean intoBitsetWithinBlock(IndexedDISI disi, int upTo, FixedBitSet 
bitSet, int offset)
+  throws IOException {
+if (disi.doc >= upTo) {
+  return true;
+}
+bitSet.set(disi.doc - offset);
+
+for (int i = disi.index + 1, to = disi.nextBlockIndex; i <= to; i++) {

Review Comment:
   In `advanceWithinBlock`, `disi.index++` is executed in the loop body before 
`if (doc >= targetInBlock)` so it is not actually excluded. 
`advanceWithinBlock` can return true when `disi.index` equals 
`disi.disi.nextBlockIndex`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Impl intoBitset for IndexedDISI and Docvalues [lucene]

2025-04-24 Thread via GitHub


gf2121 commented on code in PR #14529:
URL: https://github.com/apache/lucene/pull/14529#discussion_r2057720050


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/IndexedDISI.java:
##
@@ -625,6 +634,29 @@ boolean advanceExactWithinBlock(IndexedDISI disi, int 
target) throws IOException
 disi.exists = false;
 return false;
   }
+
+  @Override
+  boolean intoBitsetWithinBlock(IndexedDISI disi, int upTo, FixedBitSet 
bitSet, int offset)
+  throws IOException {
+if (disi.doc >= upTo) {
+  return true;
+}
+bitSet.set(disi.doc - offset);
+
+for (int i = disi.index + 1, to = disi.nextBlockIndex; i <= to; i++) {

Review Comment:
   I update to the pattern like `advanceWithinBlock`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Impl intoBitset for IndexedDISI and Docvalues [lucene]

2025-04-24 Thread via GitHub


jpountz commented on code in PR #14529:
URL: https://github.com/apache/lucene/pull/14529#discussion_r2057729121


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/IndexedDISI.java:
##
@@ -693,6 +727,44 @@ boolean advanceExactWithinBlock(IndexedDISI disi, int 
target) throws IOException
 disi.index = disi.numberOfOnes - Long.bitCount(leftBits);
 return (leftBits & 1L) != 0;
   }
+
+  @Override
+  boolean intoBitSetWithinBlock(IndexedDISI disi, int upTo, FixedBitSet 
bitSet, int offset)
+  throws IOException {
+if (disi.doc >= upTo) {
+  advanceWithinBlock(disi, disi.doc);

Review Comment:
   Do we really need to call `advanceWithinBlock`? If `disi.doc` is already `>= 
upTo`, there shouldn't be anything to do?



##
lucene/core/src/test/org/apache/lucene/codecs/lucene90/TestIndexedDISI.java:
##
@@ -555,6 +565,47 @@ private void assertAdvanceExactRandomized(
 }
   }
 
+  private void assertIntoBitsetRandomized(
+  IndexedDISI disi, BitSetIterator disi2, int disi2length, int step) 
throws IOException {
+int index = -1;
+FixedBitSet set1 = new FixedBitSet(step);
+FixedBitSet set2 = new FixedBitSet(step);
+
+for (int upTo = 0; upTo < disi2length; ) {
+  int lastUpTo = upTo;
+  upTo += TestUtil.nextInt(random(), 0, step);
+  int offset = TestUtil.nextInt(random(), lastUpTo, upTo);
+
+  if (disi.docID() < offset) {
+disi.advance(offset);
+  }
+  int doc = disi2.docID();
+  while (doc < offset) {
+index++;
+doc = disi2.nextDoc();
+  }
+  while (doc < upTo) {
+set2.set(doc - offset);
+index++;
+doc = disi2.nextDoc();
+  }
+
+  disi.intoBitSet(upTo, set1, offset);
+  assertEquals(index, disi.index());
+  BitSetIterator expected = new BitSetIterator(set2, set2.cardinality());
+  BitSetIterator actual = new BitSetIterator(set1, set1.cardinality());
+  for (int expectedDoc = expected.nextDoc();
+  expectedDoc != DocIdSetIterator.NO_MORE_DOCS;
+  expectedDoc = expected.nextDoc()) {
+int actualDoc = actual.nextDoc();
+assertEquals(expectedDoc + offset, actualDoc + offset); // plus offset 
for better message.
+  }
+  assertEquals(DocIdSetIterator.NO_MORE_DOCS, actual.nextDoc());
+  set1.clear();
+  set2.clear();

Review Comment:
   Can you also check that nextDoc() returns the expected value after 
`intoBitSet` returns? To make sure that all required state is set correctly?



##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/IndexedDISI.java:
##
@@ -693,6 +727,44 @@ boolean advanceExactWithinBlock(IndexedDISI disi, int 
target) throws IOException
 disi.index = disi.numberOfOnes - Long.bitCount(leftBits);
 return (leftBits & 1L) != 0;
   }
+
+  @Override
+  boolean intoBitSetWithinBlock(IndexedDISI disi, int upTo, FixedBitSet 
bitSet, int offset)
+  throws IOException {
+if (disi.doc >= upTo) {
+  advanceWithinBlock(disi, disi.doc);
+  return true;
+}
+
+int sourceFrom = disi.doc & 0x;
+int sourceStartWord = sourceFrom >>> 6;
+int sourceTo = Math.min(upTo - disi.block, BLOCK_SIZE);
+int numWords = FixedBitSet.bits2words(sourceTo) - sourceStartWord;
+
+if (disi.bitSet == null || disi.bitSet.getBits().length < numWords) {

Review Comment:
   Should we always create a bit set of size `BLOCK_SIZE` when it's null for 
simplicity? This is what users will get in the worst-case scenario anyway, and 
it's not crazy (65k bits = 8kB)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Create file open hints on IOContext to replace ReadAdvice [lucene]

2025-04-24 Thread via GitHub


ChrisHegarty commented on code in PR #14482:
URL: https://github.com/apache/lucene/pull/14482#discussion_r2057896977


##
lucene/core/src/java/org/apache/lucene/store/Directory.java:
##
@@ -79,6 +83,31 @@ public abstract class Directory implements Closeable {
*/
   public abstract long fileLength(String name) throws IOException;
 
+  protected void validateIOContext(IOContext context) {
+Map, List> 
hintClasses =
+
context.hints().stream().collect(Collectors.groupingBy(IOContext.FileOpenHint::getClass));
+
+// there should only be one of FileType, FileData, DataAccess
+List fileTypes =
+hintClasses.getOrDefault(FileTypeHint.class, List.of());
+if (fileTypes.size() > 1) {
+  throw new IllegalArgumentException("Multiple file type hints specified: 
" + fileTypes);
+}
+List fileData = 
hintClasses.getOrDefault(FileDataHint.class, List.of());
+if (fileData.size() > 1) {
+  throw new IllegalArgumentException("Multiple file data hints specified: 
" + fileData);
+}
+List dataAccess =
+hintClasses.getOrDefault(DataAccessHint.class, List.of());
+if (dataAccess.size() > 1) {
+  throw new IllegalArgumentException("Multiple data access hints 
specified: " + dataAccess);
+}
+  }
+
+  protected ReadAdvice toReadAdvice(IOContext context) {

Review Comment:
   The final api position as I understand it, `IOContext` has no knowledge of 
`ReadAdvice` ( `IOConext::withReadAdvice` and `IOContext::readAdvice` will be 
removed ).  
   
   EDIT: this is incorrect. D'oh!!
   ~`ReadAdvice` is rather determined by the directory when given the 
higher-level hints from the IOContext. For example, the `MMapDirectory` 
implementation of openInput would do something like.~
   
   ```java
   public IndexInput openInput(String name, IOContext context) throws 
IOException {
 ...
 return PROVIDER.openInput(..., toReadAdvice(context), preload.test(name, 
context),...);
   }
   ```
   ~... so it's somewhat akin to the the current preLoad model.~
   
   ~For a custom filter wrapper one would do:~
   
   ```java
   Directory dir = new FilterDirectory(mmapDirectory) {
 @Override
 public ReadAdvice toReadAdvice(IOContext context) {
   if (someCondition) {
 // enforce specific read advice instead of relying on hints
 return ReadAdvice.NORMAL;
   }
   // hints to readAdvice
 }
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Create file open hints on IOContext to replace ReadAdvice [lucene]

2025-04-24 Thread via GitHub


thecoop commented on code in PR #14482:
URL: https://github.com/apache/lucene/pull/14482#discussion_r2057921646


##
lucene/core/src/java/org/apache/lucene/store/Directory.java:
##
@@ -79,6 +83,31 @@ public abstract class Directory implements Closeable {
*/
   public abstract long fileLength(String name) throws IOException;
 
+  protected void validateIOContext(IOContext context) {
+Map, List> 
hintClasses =
+
context.hints().stream().collect(Collectors.groupingBy(IOContext.FileOpenHint::getClass));
+
+// there should only be one of FileType, FileData, DataAccess
+List fileTypes =
+hintClasses.getOrDefault(FileTypeHint.class, List.of());
+if (fileTypes.size() > 1) {
+  throw new IllegalArgumentException("Multiple file type hints specified: 
" + fileTypes);
+}
+List fileData = 
hintClasses.getOrDefault(FileDataHint.class, List.of());
+if (fileData.size() > 1) {
+  throw new IllegalArgumentException("Multiple file data hints specified: 
" + fileData);
+}
+List dataAccess =
+hintClasses.getOrDefault(DataAccessHint.class, List.of());
+if (dataAccess.size() > 1) {
+  throw new IllegalArgumentException("Multiple data access hints 
specified: " + dataAccess);
+}
+  }
+
+  protected ReadAdvice toReadAdvice(IOContext context) {

Review Comment:
   The `MMapDirectory` wouldn't call the overridden `toReadAdvice`, as thats on 
a different class, not in the same type hierarchy.
   
   Hmm...
   
   One way to think of it might be that trying to set a specific ReadAdvice is 
wrong. You should instead be setting the correct hints _such that_ the 
directory chooses the behaviour you want. So rather than setting 
`ReadAdvice.NORMAL` directly, you would specify hints to make MMapDirectory 
select the `ReadAdvice` that you want - or whatever else MMapDirectory chooses 
to do to follow the hints you've specified. The `ReadAdvice` is entirely hidden 
within the Directory implementation, and you can _only_ set hints.
   
   After all, you're specifying a specific read advice for a reason - if you 
encode the reason (as hints), rather than the specific behaviour that you want, 
then the directory can choose the best way to do that according to the 
directory implementation, which may or may not be using certain ReadAdvices (or 
any other setting it chooses). So the code would be:
   
   ```
   Directory dir = new FilterDirectory(mmapDirectory) {
 @Override
 public IndexInput openInput(String name, IOContext context) {
   if (someCondition()) {
 // be more specific in what this file is being opened for,
 // MMapDirectory chooses what to do with this info
 context = context.withHints(FileDataHint.VECTORS, 
DataAccessHint.RANDOM);
   }
   return in.openInput(name, context);
 }
   }
   ```
   
   
   If we don't want to do that, then the hint -> ReadAdvice mapping would 
indeed need to be on the IOContext, and _not_ the responsibility of the 
`Directory`; or we use some sort of lambda override similar to `setPreload`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Create file open hints on IOContext to replace ReadAdvice [lucene]

2025-04-24 Thread via GitHub


ChrisHegarty commented on code in PR #14482:
URL: https://github.com/apache/lucene/pull/14482#discussion_r2057993129


##
lucene/core/src/java/org/apache/lucene/store/Directory.java:
##
@@ -79,6 +83,31 @@ public abstract class Directory implements Closeable {
*/
   public abstract long fileLength(String name) throws IOException;
 
+  protected void validateIOContext(IOContext context) {
+Map, List> 
hintClasses =
+
context.hints().stream().collect(Collectors.groupingBy(IOContext.FileOpenHint::getClass));
+
+// there should only be one of FileType, FileData, DataAccess
+List fileTypes =
+hintClasses.getOrDefault(FileTypeHint.class, List.of());
+if (fileTypes.size() > 1) {
+  throw new IllegalArgumentException("Multiple file type hints specified: 
" + fileTypes);
+}
+List fileData = 
hintClasses.getOrDefault(FileDataHint.class, List.of());
+if (fileData.size() > 1) {
+  throw new IllegalArgumentException("Multiple file data hints specified: 
" + fileData);
+}
+List dataAccess =
+hintClasses.getOrDefault(DataAccessHint.class, List.of());
+if (dataAccess.size() > 1) {
+  throw new IllegalArgumentException("Multiple data access hints 
specified: " + dataAccess);
+}
+  }
+
+  protected ReadAdvice toReadAdvice(IOContext context) {

Review Comment:
   > or we use some sort of lambda override similar to setPreload
   
   I think that we can have both. Ultimately, as the author of a directory 
implementation or writing tests, or adapting to certain system environments, I 
want the ability to override.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Impl intoBitset for IndexedDISI and Docvalues [lucene]

2025-04-24 Thread via GitHub


gf2121 commented on code in PR #14529:
URL: https://github.com/apache/lucene/pull/14529#discussion_r2058000145


##
lucene/core/src/test/org/apache/lucene/codecs/lucene90/TestIndexedDISI.java:
##
@@ -555,6 +565,47 @@ private void assertAdvanceExactRandomized(
 }
   }
 
+  private void assertIntoBitsetRandomized(
+  IndexedDISI disi, BitSetIterator disi2, int disi2length, int step) 
throws IOException {
+int index = -1;
+FixedBitSet set1 = new FixedBitSet(step);
+FixedBitSet set2 = new FixedBitSet(step);
+
+for (int upTo = 0; upTo < disi2length; ) {
+  int lastUpTo = upTo;
+  upTo += TestUtil.nextInt(random(), 0, step);
+  int offset = TestUtil.nextInt(random(), lastUpTo, upTo);
+
+  if (disi.docID() < offset) {
+disi.advance(offset);
+  }
+  int doc = disi2.docID();
+  while (doc < offset) {
+index++;
+doc = disi2.nextDoc();
+  }
+  while (doc < upTo) {
+set2.set(doc - offset);
+index++;
+doc = disi2.nextDoc();
+  }
+
+  disi.intoBitSet(upTo, set1, offset);
+  assertEquals(index, disi.index());
+  BitSetIterator expected = new BitSetIterator(set2, set2.cardinality());
+  BitSetIterator actual = new BitSetIterator(set1, set1.cardinality());
+  for (int expectedDoc = expected.nextDoc();
+  expectedDoc != DocIdSetIterator.NO_MORE_DOCS;
+  expectedDoc = expected.nextDoc()) {
+int actualDoc = actual.nextDoc();
+assertEquals(expectedDoc + offset, actualDoc + offset); // plus offset 
for better message.
+  }
+  assertEquals(DocIdSetIterator.NO_MORE_DOCS, actual.nextDoc());
+  set1.clear();
+  set2.clear();

Review Comment:
   This helps me find a bug: `upTo` could be a number in last block when 
`RANGE` do `intoBitSetWithinBlock`, i'm missing a `doc > upTo` check there. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Make competitive iterators more robust. [lucene]

2025-04-24 Thread via GitHub


jpountz commented on PR #14532:
URL: https://github.com/apache/lucene/pull/14532#issuecomment-2826597720

   I'm seeing a reproducible slowdown with this change:
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
 TermDTSort  399.37  (2.2%)  367.70  
(2.3%)   -7.9% ( -12% -   -3%) 0.000
  TermTitleSort   87.74  (4.2%)   83.02  
(5.5%)   -5.4% ( -14% -4%) 0.001
   IntervalsOrdered2.31  (2.0%)2.27  
(7.1%)   -1.6% ( -10% -7%) 0.324
CountPhrase4.18  (2.2%)4.13  
(5.6%)   -1.3% (  -8% -6%) 0.345
   PKLookup  315.43  (5.0%)  312.54  
(6.4%)   -0.9% ( -11% -   11%) 0.614
  TermDayOfYearSort  286.52  (1.8%)  284.23  
(1.7%)   -0.8% (  -4% -2%) 0.144
CombinedAndHighHigh   11.51  (3.4%)   11.44  
(5.9%)   -0.6% (  -9% -8%) 0.690
 CombinedAndHighMed   38.75  (3.1%)   38.52  
(5.2%)   -0.6% (  -8% -7%) 0.663
 Phrase   14.38  (1.5%)   14.30  
(2.3%)   -0.6% (  -4% -3%) 0.363
FilteredPrefix3  151.34  (3.7%)  150.70  
(3.9%)   -0.4% (  -7% -7%) 0.726
CountAndHighMed  308.56  (1.9%)  307.41  
(1.5%)   -0.4% (  -3% -3%) 0.498
   FilteredTerm  159.62  (2.2%)  159.03  
(2.3%)   -0.4% (  -4% -4%) 0.606
Prefix3  159.86  (3.9%)  159.40  
(4.1%)   -0.3% (  -7% -8%) 0.822
  TermMonthSort 3191.33  (3.5%) 3186.61  
(3.8%)   -0.1% (  -7% -7%) 0.899
 FilteredIntNRQ  302.80  (1.2%)  302.51  
(1.4%)   -0.1% (  -2% -2%) 0.819
   CombinedTerm   30.71  (4.7%)   30.69  
(4.7%)   -0.1% (  -9% -9%) 0.960
   FilteredOr3Terms  165.78  (1.3%)  165.83  
(1.6%)0.0% (  -2% -3%) 0.951
CountFilteredOrHighHigh  136.74  (1.0%)  136.79  
(1.1%)0.0% (  -1% -2%) 0.906
   CountAndHighHigh  358.72  (2.2%)  358.91  
(2.5%)0.1% (  -4% -4%) 0.944
 CountFilteredOrHighMed  148.20  (0.7%)  148.31  
(0.8%)0.1% (  -1% -1%) 0.760
 FilteredOrHighHigh   67.56  (1.6%)   67.61  
(1.9%)0.1% (  -3% -3%) 0.891
  FilteredOrHighMed  152.31  (1.0%)  152.49  
(1.3%)0.1% (  -2% -2%) 0.755
 DismaxTerm  477.31  (2.6%)  477.99  
(2.5%)0.1% (  -4% -5%) 0.860
 FilteredPhrase   33.05  (1.4%)   33.10  
(1.7%)0.1% (  -2% -3%) 0.769
 FilteredOr2Terms2StopWords  147.00  (1.2%)  147.26  
(1.6%)0.2% (  -2% -2%) 0.688
FilteredOrStopWords   46.26  (2.0%)   46.34  
(2.6%)0.2% (  -4% -4%) 0.799
 Fuzzy1   99.78  (2.8%)   99.99  
(3.3%)0.2% (  -5% -6%) 0.827
   FilteredAndStopWords   45.41  (2.1%)   45.50  
(2.8%)0.2% (  -4% -5%) 0.785
 CombinedOrHighHigh   18.78  (2.9%)   18.82  
(2.8%)0.2% (  -5% -6%) 0.813
 IntNRQ  309.33  (1.1%)  310.04  
(1.5%)0.2% (  -2% -2%) 0.580
  FilteredAnd3Terms  178.96  (1.7%)  179.39  
(2.0%)0.2% (  -3% -4%) 0.683
CountOrMany   30.46  (1.9%)   30.55  
(2.4%)0.3% (  -3% -4%) 0.656
 FilteredOrMany   16.44  (1.5%)   16.49  
(2.4%)0.3% (  -3% -4%) 0.628
FilteredAnd2Terms2StopWords  177.94  (1.4%)  178.49  
(1.6%)0.3% (  -2% -3%) 0.511
 Fuzzy2   83.87  (2.4%)   84.14  
(2.9%)0.3% (  -4% -5%) 0.704
  CombinedOrHighMed   71.67  (2.8%)   71.90  
(2.9%)0.3% (  -5% -6%) 0.719
CountFilteredOrMany   27.41  (1.8%)   27.51  
(1.9%)0.3% (  -3% -4%) 0.558
   Wildcard   91.57  (2.9%)   91.90  
(2.9%)0.4% (  -5% -6%) 0.700
FilteredAndHighHigh   65.89  (1.9%)   66.13  
(2.2%)0.4% (  -3% -4%) 0.566
 AndHighMed  133.85  (2.6%)  134.37  
(2.1%)0.4% (  -4% -5%) 0.608
CountOrHighHigh  346.70  (2.1%)  348.06  
(2.3%)0.4% (  

Re: [I] Examine the affects of MADV_RANDOM when MGLRU is enabled in Linux kernel [lucene]

2025-04-24 Thread via GitHub


rmuir commented on issue #14408:
URL: https://github.com/apache/lucene/issues/14408#issuecomment-2828267119

   IMO it would be better to make a 10.2.2 if you want to do that? Especially 
since I don't think it would be a trivial change: I'm concerned that changing 
just the "default" would have any benefit, since it is hardcoded in other 
places in codecs. AFAIK #14422 is working on fixing that "real problem".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Logic for collecting Histogram efficiently using Point Trees [lucene]

2025-04-24 Thread via GitHub


jainankitk commented on code in PR #14439:
URL: https://github.com/apache/lucene/pull/14439#discussion_r2059224354


##
lucene/CHANGES.txt:
##
@@ -78,6 +78,8 @@ Optimizations
 -
 * GITHUB#14418: Quick exit on filter query matching no docs when rewriting knn 
query. (Pan Guixin)
 
+* GITHUB#14439: Efficient Histogram Collection using multi range traversal 
over PointTrees

Review Comment:
   ah, my bad. Fixed!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move sloppySin into SloppyMath from GeoUtils [lucene]

2025-04-24 Thread via GitHub


jainankitk commented on PR #14516:
URL: https://github.com/apache/lucene/pull/14516#issuecomment-2828864383

   > thanks for adding tests and benchies
   
   Thank you for the pointers on jmh benchmark. Helped me reason about and 
demonstrate the performance improvements for Histogram Collector PR - 
https://github.com/apache/lucene/pull/14439#issuecomment-2819914834


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add AnytimeRankingSearcher for SLA-Aware Early Termination with Bin-Based Score Boosting [lucene]

2025-04-24 Thread via GitHub


atris commented on PR #14525:
URL: https://github.com/apache/lucene/pull/14525#issuecomment-2828868856

   @jpountz thanks for looking!
   
   Just for my understanding, the first PR should contain the index time 
binning logic that is currently in this PR, just with a simpler model of rank 
on bin boosting?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] How to control the total number of merging threads when vector data merging easily leads to memory overflow and high CPU cost [lucene]

2025-04-24 Thread via GitHub


weizijun opened a new issue, #14554:
URL: https://github.com/apache/lucene/issues/14554

   ### Description
   
   When there are many shards to merge, vector data merging can easily lead to 
memory overflow and high CPU cost.
   The index.merge.scheduler.max_thread_count parameter can't control the merge 
thread count, it only pause the writeByte by MergeRateLimiter when the merge 
thread is bigger then max_thread_count. 
   But OnHeapHnswGraph has been built during the pause phase, and it will take 
up so much memory that the Java heap is not enough.
   This problem can easily be caused when a datanode with a 32G heap size holds 
2-3TB of vector documents(with bbq, the node can contain these data).
   The PR https://github.com/apache/lucene/pull/14527 can reduce the heap size, 
but it don't solve the problem totally.
   Is there any solution to this problem?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up flush of softdelete by intoBitset [lucene]

2025-04-24 Thread via GitHub


gf2121 commented on PR #14552:
URL: https://github.com/apache/lucene/pull/14552#issuecomment-2829443943

   I managed to get some numbers on this change:
   
   ```
   Baseline: Soft delete flush total took: 30311ms, IndexedDISI#writeBitset 
total took: 8324ms.
   Candidate: Soft delete flush total took: 22635ms, IndexedDISI#writeBitset 
total took: 964ms.
   ```
   
   BTW, i added some docvalue fields in my benchmark. Then i'm seeing rather 
huge cost of virtual calls represented by `vtable stub` when computing 
min/max/gcd.
   
   
![image](https://github.com/user-attachments/assets/30cf58ff-a5c2-49d4-86b9-9837172c2f89)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Allow docID == NO_MORE_DOCS for asserting leaf reader [lucene]

2025-04-24 Thread via GitHub


gf2121 opened a new pull request, #14555:
URL: https://github.com/apache/lucene/pull/14555

   `exists` will be set false if `docID() == NO_MORE_DOCS`, while it is allowed 
by contract for `intoBitset`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Encode numeric docvalue with per block gcd [lucene]

2025-04-24 Thread via GitHub


HUSTERGS closed issue #14545: Encode numeric docvalue with per block gcd
URL: https://github.com/apache/lucene/issues/14545


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Encode numeric docvalue with per block gcd [lucene]

2025-04-24 Thread via GitHub


HUSTERGS commented on issue #14545:
URL: https://github.com/apache/lucene/issues/14545#issuecomment-2829240138

   Got it! Thanks for your reply, close the issue for now :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix leadCost calculation in BooleanScorerSupplier.requiredBulkScorer [lucene]

2025-04-24 Thread via GitHub


jpountz commented on PR #14543:
URL: https://github.com/apache/lucene/pull/14543#issuecomment-2827497431

   Thanks for catching this! For reference, this was introduced in 10.0 (by 
me). The change looks good, let me know if you need help writing a test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] removing constructor with deprecated attribute 'onlyLongestMatch [lucene]

2025-04-24 Thread via GitHub


renatoh commented on PR #14356:
URL: https://github.com/apache/lucene/pull/14356#issuecomment-2828336563

   @rmuir Could you please have a look at the PR, it has been open for more 
than a month. Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Logic for collecting Histogram efficiently using Point Trees [lucene]

2025-04-24 Thread via GitHub


stefanvodita commented on code in PR #14439:
URL: https://github.com/apache/lucene/pull/14439#discussion_r2059125314


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/PointTreeBulkCollector.java:
##
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.facet.plain.histograms;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.IOException;
+import java.util.function.Function;
+import org.apache.lucene.index.PointValues;
+import org.apache.lucene.internal.hppc.LongIntHashMap;
+import org.apache.lucene.search.CollectionTerminatedException;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.util.ArrayUtil;
+
+class PointTreeBulkCollector {
+  static boolean collect(
+  final PointValues pointValues,
+  final long bucketWidth,
+  final LongIntHashMap collectorCounts,
+  final int maxBuckets)
+  throws IOException {
+// TODO: Do we really need pointValues.getDocCount() == pointValues.size()
+if (pointValues == null
+|| pointValues.getNumDimensions() != 1
+|| pointValues.getDocCount() != pointValues.size()
+|| ArrayUtil.getValue(pointValues.getBytesPerDimension()) == null) {
+  return false;

Review Comment:
   I see your point and I'm happy to have a follow-up issue if necessary for 
this and the related comment below.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Logic for collecting Histogram efficiently using Point Trees [lucene]

2025-04-24 Thread via GitHub


stefanvodita commented on code in PR #14439:
URL: https://github.com/apache/lucene/pull/14439#discussion_r2059130450


##
lucene/CHANGES.txt:
##
@@ -78,6 +78,8 @@ Optimizations
 -
 * GITHUB#14418: Quick exit on filter query matching no docs when rewriting knn 
query. (Pan Guixin)
 
+* GITHUB#14439: Efficient Histogram Collection using multi range traversal 
over PointTrees

Review Comment:
   You should add your name too!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up filtered disjunctions by loading the filter into a bit set. [lucene]

2025-04-24 Thread via GitHub


github-actions[bot] commented on PR #14024:
URL: https://github.com/apache/lucene/pull/14024#issuecomment-2829119993

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for querying multiple fields to QueryBuilder. [lucene]

2025-04-24 Thread via GitHub


github-actions[bot] commented on PR #14262:
URL: https://github.com/apache/lucene/pull/14262#issuecomment-2829119918

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-04-24 Thread via GitHub


weizijun commented on PR #14527:
URL: https://github.com/apache/lucene/pull/14527#issuecomment-2828149196

   > I left a comment on the underlying structures used.
   > 
   > Please update CHANGES.txt under 10.3 optimizations for this nice 
optimization! It will be very nice to have better heap utilization on graph 
building!
   
   Ok, and about the OnHeapHnswGraph.ramBytesUsed, I didn't change to code. But 
it maybe need to change the logic.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-04-24 Thread via GitHub


benwtrent commented on code in PR #14527:
URL: https://github.com/apache/lucene/pull/14527#discussion_r2058763024


##
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java:
##
@@ -32,13 +33,15 @@
 public class NeighborArray {
   private final boolean scoresDescOrder;
   private int size;
-  private final float[] scores;
-  private final int[] nodes;
+  private final int maxSize;
+  private final FloatArrayList scores;
+  private final IntArrayList nodes;
   private int sortedNodeSize;
 
   public NeighborArray(int maxSize, boolean descOrder) {
-nodes = new int[maxSize];
-scores = new float[maxSize];
+this.maxSize = maxSize;
+nodes = new IntArrayList();
+scores = new FloatArrayList();

Review Comment:
   two things, I think these should be initialized to something like 
`maxSize/4` or `maxSize/8`.
   
   Additionally, the array lists should enforce the max size and ensure the 
underlying buffer does not get bigger than the expected max. I think you can do 
this pretty simply by having something like `MaxSizedIntArrayList` that accepts 
an init & max size parameters in its ctor, and overrides `ensureBufferSpace` so 
that it:
   
- disallows growth past `maxSize`
- Instead of doing `ArrayUtil.grow` it does `ArrayUtil.growInRange(buffer, 
elementsCount + expectedAdditions, maxSize)`
   
   This will prevent overallocation of the underlying buffer and enforce max 
size limitations.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Make task executor non-final [lucene]

2025-04-24 Thread via GitHub


Shibi-bala commented on PR #14524:
URL: https://github.com/apache/lucene/pull/14524#issuecomment-2828849510

   @jpountz `IMO Lucene should own how queries execute concurrently instead of 
making it pluggable` then why allow an executor service to be passed in? 
Previously, lucene didn't have control of this because a user's executor 
service could do whatever it wanted with the tasks. Now, control over execution 
is split between the user's executor service and what lucene does on the main 
query thread. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-04-24 Thread via GitHub


msokolov commented on PR #14527:
URL: https://github.com/apache/lucene/pull/14527#issuecomment-2828649522

   re: concurrency; in theory it should be safe since we lock a row before 
inserting anything into it.  Consider that even with fixed-size arrays we need 
to track the occupancy (how full the array is) in a thread-safe way.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Create file open hints on IOContext to replace ReadAdvice [lucene]

2025-04-24 Thread via GitHub


jpountz commented on code in PR #14482:
URL: https://github.com/apache/lucene/pull/14482#discussion_r2059104981


##
lucene/core/src/java/org/apache/lucene/store/Directory.java:
##
@@ -79,6 +83,31 @@ public abstract class Directory implements Closeable {
*/
   public abstract long fileLength(String name) throws IOException;
 
+  protected void validateIOContext(IOContext context) {
+Map, List> 
hintClasses =
+
context.hints().stream().collect(Collectors.groupingBy(IOContext.FileOpenHint::getClass));
+
+// there should only be one of FileType, FileData, DataAccess
+List fileTypes =
+hintClasses.getOrDefault(FileTypeHint.class, List.of());
+if (fileTypes.size() > 1) {
+  throw new IllegalArgumentException("Multiple file type hints specified: 
" + fileTypes);
+}
+List fileData = 
hintClasses.getOrDefault(FileDataHint.class, List.of());
+if (fileData.size() > 1) {
+  throw new IllegalArgumentException("Multiple file data hints specified: 
" + fileData);
+}
+List dataAccess =
+hintClasses.getOrDefault(DataAccessHint.class, List.of());
+if (dataAccess.size() > 1) {
+  throw new IllegalArgumentException("Multiple data access hints 
specified: " + dataAccess);
+}
+  }
+
+  protected ReadAdvice toReadAdvice(IOContext context) {

Review Comment:
   > or we use some sort of lambda override similar to setPreload
   
   FWIW one thing I like about it is that the `ReadAdvice` then becomes an 
implementation detail of `MMapDirectory` instead of a general API on 
`Directory`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] DrillSideways does not support intra-segment concurrency [lucene]

2025-04-24 Thread via GitHub


kashkambath commented on issue #13753:
URL: https://github.com/apache/lucene/issues/13753#issuecomment-2828705042

   Hi @javanna,
   
   Your explanation makes sense that intra-segment concurrency can't be 
leveraged by the existing `DrillSidewaysScorer` since it goes through all the 
docs in a segment. However, `DrillSidewaysScorer` is only used when a 
non-concurrent execution of drill sideways occurs via 
[`DrillSideways#searchSequentially()`](https://github.com/apache/lucene/blob/e2e44bcf6baf7dc97f72b999a8d7e236700d574c/lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java#L441-L465)
 AFAIK. 
   
   If an `Executor` is passed into `DrillSideways` at its instantiation, then 
`DrillSideways` will execute the 
[`DrillSideways#searchConcurrently()`](https://github.com/apache/lucene/blob/e2e44bcf6baf7dc97f72b999a8d7e236700d574c/lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java#L483)
 codepath, which does not leverage the `DrillSidewaysScorer`. 
`searchConcurrently` instead runs in parallel N+1 queries for N facets (one 
base query + N holdout queries for facet counts.) I believe each of these N+1 
queries are traditional search queries which could leverage 
inter-segment/intra-segment parallelism.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Fail spotless check for wildcard imports [lucene]

2025-04-24 Thread via GitHub


rmuir commented on issue #14553:
URL: https://github.com/apache/lucene/issues/14553#issuecomment-2828727305

   I can help with eclipse formatter. It has a limit that you set in the 
preference files and I think the solution is to set it to an obnoxious value 
(999 or something like that)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add AnytimeRankingSearcher for SLA-Aware Early Termination with Bin-Based Score Boosting [lucene]

2025-04-24 Thread via GitHub


jpountz commented on PR #14525:
URL: https://github.com/apache/lucene/pull/14525#issuecomment-2828722257

   > the implementation is more ambitious
   
   I like ambition, but it also makes this change harder to review/integrate, 
especially with the high LOC count. I would suggest splitting this PR into 
multiple PRs, for instance:
- First PR just works with indexes created with existing recursive graph 
bisection and uses basic heuristics to determine which ranges of doc IDs to 
score first (e.g. using impacts) to hopefully increase the top-k score quickly. 
No extra data stored in the index. All code under lucene/misc rather than core.
- Another PR can introduce the SLA-based termination logic.
- Another PR can introduce topical clustering mechanism of Kulkarni and 
Callan, that the paper suggests combining with recursive graph bisection.
- Another PR can discuss augmenting index formats to enhance the range 
selection logic.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Fail spotless check for wildcard imports [lucene]

2025-04-24 Thread via GitHub


dweiss commented on issue #14553:
URL: https://github.com/apache/lucene/issues/14553#issuecomment-2828723311

   I've adopted and used what opensearch came up with. It would be ideal to 
also fix wildcard imports - not just detect them. It should be possible with 
intellij or eclipse formatter somehow but I couldn't figure it out in the 
limited time I had. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Examine the affects of MADV_RANDOM when MGLRU is enabled in Linux kernel [lucene]

2025-04-24 Thread via GitHub


vigyasharma commented on issue #14408:
URL: https://github.com/apache/lucene/issues/14408#issuecomment-2828222009

   Should we target a revert of the default ReadAdvice.RANDOM setting, for the 
10.2.1 RC that's currently being evaluated?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-04-24 Thread via GitHub


benwtrent commented on PR #14527:
URL: https://github.com/apache/lucene/pull/14527#issuecomment-2828219136

   > Ok, and about the OnHeapHnswGraph.ramBytesUsed, I didn't change to code. 
But it maybe need to change the logic.
   
   I think utilizing the underlying array estimations is what we need and the 
testing needs to be updated to assert within a valid range.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Fail spotless check for wildcard imports [lucene]

2025-04-24 Thread via GitHub


dweiss commented on issue #14553:
URL: https://github.com/apache/lucene/issues/14553#issuecomment-2828789291

   > I can help with eclipse formatter. It has a limit that you set in the 
preference files and I think the solution is to set it to an obnoxious value 
(999 or something like that)
   
   The problem I had was to only expand wildcard imports, without touching 
anything else. In theory, re-applying gjf should bring the code back to the 
same state but in practice there were subtle differences before and after. I 
don't know if you fan "only" expand wildcards, without reformatting the rest of 
the code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org