Re: [I] Move group-varint encoding/decoding logic to DataOutput/DataInput? [lucene]

2023-11-23 Thread via GitHub


easyice commented on issue #12826:
URL: https://github.com/apache/lucene/issues/12826#issuecomment-1823961753

   The wikimediumall benchmark reslut looks fine on java 21:
   
   
   java 21
   
   ```
   
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
 IntNRQ7.32 (10.5%)7.17 
(10.5%)   -2.1% ( -20% -   21%) 0.528
   HighTermTitleBDVSort4.24  (6.7%)4.18  
(7.4%)   -1.4% ( -14% -   13%) 0.521
   PKLookup   97.77  (3.0%)   96.77  
(4.1%)   -1.0% (  -7% -6%) 0.372
Respell   31.39  (2.0%)   31.25  
(2.4%)   -0.5% (  -4% -3%) 0.504
   HighSpanNear2.06  (3.3%)2.06  
(2.9%)0.1% (  -5% -6%) 0.946
AndHighHigh   12.99  (3.0%)   13.01  
(2.5%)0.2% (  -5% -5%) 0.835
 Fuzzy1   37.99  (2.1%)   38.09  
(2.1%)0.3% (  -3% -4%) 0.674
   HighSloppyPhrase8.97  (3.3%)9.02  
(3.4%)0.6% (  -5% -7%) 0.573
   HighIntervalsOrdered   13.16  (3.4%)   13.24  
(3.3%)0.6% (  -5% -7%) 0.544
MedSpanNear8.66  (2.1%)8.72  
(1.9%)0.7% (  -3% -4%) 0.300
LowIntervalsOrdered   63.85  (2.2%)   64.40  
(2.7%)0.9% (  -3% -5%) 0.264
MedIntervalsOrdered1.00  (2.9%)1.01  
(2.8%)0.9% (  -4% -6%) 0.301
LowSpanNear   12.79  (3.0%)   12.92  
(2.2%)1.0% (  -4% -6%) 0.235
 AndHighMed   26.44  (3.2%)   26.72  
(3.1%)1.1% (  -5% -7%) 0.283
LowSloppyPhrase5.84  (2.4%)5.90  
(3.1%)1.1% (  -4% -6%) 0.214
 HighPhrase7.72  (3.9%)7.81  
(4.7%)1.1% (  -7% -   10%) 0.421
 OrHighHigh   16.44  (4.0%)   16.63  
(4.2%)1.1% (  -6% -9%) 0.388
 AndHighLow  412.43  (4.0%)  417.26  
(3.2%)1.2% (  -5% -8%) 0.307
  HighTermDayOfYearSort  142.91  (3.9%)  144.72  
(3.1%)1.3% (  -5% -8%) 0.256
   OrHighNotMed  309.12  (4.9%)  313.44  
(3.6%)1.4% (  -6% -   10%) 0.307
MedSloppyPhrase   11.26  (3.1%)   11.43  
(3.6%)1.5% (  -5% -8%) 0.169
  OrHighLow  177.96  (3.6%)  181.05  
(3.1%)1.7% (  -4% -8%) 0.102
   OrHighNotLow  214.08  (4.5%)  217.91  
(4.2%)1.8% (  -6% -   10%) 0.194
  OrNotHighHigh  179.87  (4.5%)  183.13  
(3.8%)1.8% (  -6% -   10%) 0.171
  MedPhrase   96.44  (4.8%)   98.24  
(5.0%)1.9% (  -7% -   12%) 0.231
 Fuzzy2   25.19  (2.6%)   25.67  
(3.1%)1.9% (  -3% -7%) 0.041
  HighTermTitleSort   82.34  (3.9%)   84.06  
(3.6%)2.1% (  -5% -9%) 0.076
  LowPhrase   44.99  (4.5%)   45.94  
(4.7%)2.1% (  -6% -   11%) 0.144
  OrHighNotHigh  139.88  (4.8%)  142.89  
(4.6%)2.2% (  -6% -   12%) 0.144
   OrNotHighLow  269.48  (2.8%)  275.65  
(2.7%)2.3% (  -3% -7%) 0.008
  OrHighMed   48.08  (9.1%)   49.22  
(7.5%)2.4% ( -13% -   20%) 0.369
   OrNotHighMed  140.40  (6.3%)  143.99  
(6.4%)2.6% (  -9% -   16%) 0.202
 TermDTSort   78.01  (5.5%)   80.15  
(5.7%)2.7% (  -8% -   14%) 0.125
   Wildcard   40.16  (2.6%)   41.47  
(3.0%)3.3% (  -2% -9%) 0.000
   HighTerm  302.83  (6.1%)  313.49  
(4.7%)3.5% (  -6% -   15%) 0.041
MedTerm  235.66  (6.7%)  245.00  
(7.5%)4.0% (  -9% -   19%) 0.078
LowTerm  245.23  (8.6%)  255.28 
(10.4%)4.1% ( -13% -   25%) 0.176
  HighTermMonthSort 1037.99  (6.6%) 1089.92  
(6.2%)5.0% (  -7% -   19%) 0.013
Prefix3   56.06  (3.7%)   60.53  
(3.6%)8.0% (   0% -   15%) 0.000
   
   ```

   
   but a bit regression in java 17:
   
   round 1
   
   
   ```
   
   TaskQPS baseline  StdDevQPS 

Re: [PR] Simplify advancing on postings/impacts enums [lucene]

2023-11-23 Thread via GitHub


jpountz commented on code in PR #12810:
URL: https://github.com/apache/lucene/pull/12810#discussion_r1403031948


##
lucene/core/src/java/org/apache/lucene/codecs/MultiLevelSkipListReader.java:
##
@@ -63,7 +63,7 @@ public abstract class MultiLevelSkipListReader implements 
Closeable {
   protected int[] skipDoc;
 
   /** Doc id of last read skip entry with docId <= target. */
-  private int lastDoc;
+  private int lastDoc = -1;

Review Comment:
   Another good point. This one felt a bit less obvious to me because of how 
it's used internally by unversioned codecs, e.g. SimpleText, but your point 
about external codecs makes sense, I'll rename.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Simplify advancing on postings/impacts enums [lucene]

2023-11-23 Thread via GitHub


jpountz commented on code in PR #12810:
URL: https://github.com/apache/lucene/pull/12810#discussion_r1403032682


##
lucene/core/src/java/org/apache/lucene/codecs/MultiLevelSkipListReader.java:
##
@@ -63,7 +63,7 @@ public abstract class MultiLevelSkipListReader implements 
Closeable {
   protected int[] skipDoc;
 
   /** Doc id of last read skip entry with docId <= target. */
-  private int lastDoc;
+  private int lastDoc = -1;

Review Comment:
   I pushed a rename.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Move group-varint encoding/decoding logic to DataOutput/DataInput? [lucene]

2023-11-23 Thread via GitHub


jpountz commented on issue #12826:
URL: https://github.com/apache/lucene/issues/12826#issuecomment-1823968702

   Thanks for testing, I was wondering about the virtual call too. In theory, 
group-varint works well because you can start decoding the next group before 
being fully done with the current one, but this only works if reading a group 
gets inlined so that the compiler can figure it out. The annoying part is that 
reading a sequence of vints would ideally go into an int[] but in our case we 
need a long[] because this is what postings are using (for different reasons, 
it helps them better take advantage of auto vectorization).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Improve DirectReader java doc [lucene]

2023-11-23 Thread via GitHub


gf2121 merged PR #12835:
URL: https://github.com/apache/lucene/pull/12835


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Move group-varint encoding/decoding logic to DataOutput/DataInput? [lucene]

2023-11-23 Thread via GitHub


easyice commented on issue #12826:
URL: https://github.com/apache/lucene/issues/12826#issuecomment-1824101886

   Thanks for explaining! you are right, if we call `DataInput.readVIntGroup`, 
the  function  `DataInput.readVIntGroupLong`(the same as 
`GroupVIntReader#readLong`) is not inlined, But if we switch to  
`readVIntGroupFully`,  the `DataInput.readVIntGroupLong` can be inlined.
   
   **baseline**, use `GroupVIntReader#readValues` , the 
`GroupVIntReader#readLong` is inlined:
   
![image](https://github.com/apache/lucene/assets/23521001/66706628-c3cf-43a4-81b5-4ce7433e045b)
   
   
   **approach_1**, switch to `DataInput.readVIntGroup`, the  
`DataInput.readVIntGroupLong` is NOT inlined:
   
![image](https://github.com/apache/lucene/assets/23521001/0a37c894-0de0-4b33-b0c5-864f73e8cf9a)
   
   
   **approach_2**, switch to `DataInput.readVIntGroupFully`, the  
`DataInput.readVIntGroupLong` is inlined:
   
![image](https://github.com/apache/lucene/assets/23521001/f1ab8850-decf-42e3-8bdd-ea0bee0270c5)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for index sorting with document blocks [lucene]

2023-11-23 Thread via GitHub


s1monw commented on code in PR #12829:
URL: https://github.com/apache/lucene/pull/12829#discussion_r1403174609


##
lucene/core/src/test/org/apache/lucene/index/TestIndexSorting.java:
##
@@ -3173,4 +3173,184 @@ public void 
testSortDocsAndFreqsAndPositionsAndOffsets() throws IOException {
 reader.close();
 dir.close();
   }
+
+  public void testBlockIsMissingParentField() throws IOException {
+try (Directory dir = newDirectory()) {
+  IndexWriterConfig iwc = new IndexWriterConfig(new 
MockAnalyzer(random()));
+  String parentField = "parent";
+  Sort indexSort = new Sort(parentField, new SortField("foo", 
SortField.Type.INT));
+  iwc.setIndexSort(indexSort);
+  try (IndexWriter writer = new IndexWriter(dir, iwc)) {
+List runnabels =
+Arrays.asList(
+() -> {
+  IllegalArgumentException ex =
+  expectThrows(
+  IllegalArgumentException.class,
+  () -> {
+writer.addDocuments(Arrays.asList(new Document(), 
new Document()));
+  });
+  assertEquals(
+  "the last document in the block must contain a numeric 
doc values field named: parent",
+  ex.getMessage());
+},
+() -> {
+  IllegalArgumentException ex =
+  expectThrows(
+  IllegalArgumentException.class,
+  () -> {
+Document doc = new Document();
+doc.add(new NumericDocValuesField("parent", 0));
+writer.addDocuments(Arrays.asList(doc, new 
Document()));
+  });
+  assertEquals(
+  "only the last document in the block must contain a 
numeric doc values field named: parent",

Review Comment:
   I am not sure what the semantics of this would be. I think we should not 
violate the semantics of the `addDocuments` API that guarantees the insert 
order being preserved across merges etc. from that perspective I think it's 
actually good that we don't do that? 
   Also if you rely on a certain order you should sort it ahead of time or at 
retrieval time. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for index sorting with document blocks [lucene]

2023-11-23 Thread via GitHub


s1monw commented on code in PR #12829:
URL: https://github.com/apache/lucene/pull/12829#discussion_r1403178782


##
lucene/core/src/test/org/apache/lucene/index/TestIndexSorting.java:
##
@@ -3173,4 +3173,184 @@ public void 
testSortDocsAndFreqsAndPositionsAndOffsets() throws IOException {
 reader.close();
 dir.close();
   }
+
+  public void testBlockIsMissingParentField() throws IOException {
+try (Directory dir = newDirectory()) {
+  IndexWriterConfig iwc = new IndexWriterConfig(new 
MockAnalyzer(random()));
+  String parentField = "parent";
+  Sort indexSort = new Sort(parentField, new SortField("foo", 
SortField.Type.INT));
+  iwc.setIndexSort(indexSort);
+  try (IndexWriter writer = new IndexWriter(dir, iwc)) {
+List runnabels =
+Arrays.asList(
+() -> {
+  IllegalArgumentException ex =
+  expectThrows(
+  IllegalArgumentException.class,
+  () -> {
+writer.addDocuments(Arrays.asList(new Document(), 
new Document()));
+  });
+  assertEquals(
+  "the last document in the block must contain a numeric 
doc values field named: parent",
+  ex.getMessage());
+},
+() -> {
+  IllegalArgumentException ex =
+  expectThrows(
+  IllegalArgumentException.class,
+  () -> {
+Document doc = new Document();
+doc.add(new NumericDocValuesField("parent", 0));
+writer.addDocuments(Arrays.asList(doc, new 
Document()));
+  });
+  assertEquals(
+  "only the last document in the block must contain a 
numeric doc values field named: parent",
+  ex.getMessage());
+},
+() -> {
+  IllegalArgumentException ex =
+  expectThrows(
+  IllegalArgumentException.class,
+  () -> {
+writer.addDocuments(Arrays.asList(new Document()));
+  });
+  assertEquals(
+  "the last document in the block must contain a numeric 
doc values field named: parent",
+  ex.getMessage());
+});
+Collections.shuffle(runnabels, random());
+for (Runnable runnable : runnabels) {
+  runnable.run();
+}
+  }
+}
+  }
+
+  public void testIndexWithSortIsCongruent() throws IOException {
+try (Directory dir = newDirectory()) {
+  IndexWriterConfig iwc = new IndexWriterConfig(new 
MockAnalyzer(random()));
+  String parentField = "parent";
+  Sort indexSort = new Sort(parentField, new SortField("foo", 
SortField.Type.INT));
+  iwc.setIndexSort(indexSort);
+  try (IndexWriter writer = new IndexWriter(dir, iwc)) {
+Document child1 = new Document();
+child1.add(new StringField("id", Integer.toString(1), Store.YES));
+Document child2 = new Document();
+child2.add(new StringField("id", Integer.toString(1), Store.YES));
+Document parent = new Document();
+parent.add(new StringField("id", Integer.toString(1), Store.YES));
+parent.add(new NumericDocValuesField(parentField, 0));

Review Comment:
   no it has no meaning, we could add this field automatically at index time if 
a parent field is set on IWC or the sort and record the no. of children. If you 
wanna model multiple levels of nesting you can do that on top of the API. All 
we guarantee is that this block of docs is unchanged across merges. I don't 
think we should do more. you are free to use whatever marker you want to 
identify sub-roots. Yet, you can't use the parent field I think that is 
reserved for the top level. 
   One upside of having a single value for all parents is that is't very 
storage efficient. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for index sorting with document blocks [lucene]

2023-11-23 Thread via GitHub


s1monw commented on PR #12829:
URL: https://github.com/apache/lucene/pull/12829#issuecomment-1824135498

   > I'm a little worried about giving up functionality, but I think if we had 
a list of parent-fields rather than a single parent-field that would cover what 
we can do today? Maybe one is enough - you can still do hierarchies with one 
field, but today's API actually let's you define multiple overlapping relations.
   
   I don't think we give up any functionality. can you elaborate what 
functionality you are referring to? I don't think we should have a list of 
parent fields that IW requires, what would that list be used for? I also don't 
understand how the APU lets you define overlapping relations that this change 
would prevent you from?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add a merge policy wrapper that performs recursive graph bisection on merge. [lucene]

2023-11-23 Thread via GitHub


jpountz merged PR #12622:
URL: https://github.com/apache/lucene/pull/12622


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Simplify advancing on postings/impacts enums [lucene]

2023-11-23 Thread via GitHub


jpountz commented on PR #12810:
URL: https://github.com/apache/lucene/pull/12810#issuecomment-1824353933

   I'll merge to make sure it gets some time in CI before we cut a release. 
Feel free to raise concerns after merging if you have any, I'll happily address 
them and revert if necessary.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Simplify advancing on postings/impacts enums [lucene]

2023-11-23 Thread via GitHub


jpountz merged PR #12810:
URL: https://github.com/apache/lucene/pull/12810


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-23 Thread via GitHub


dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1402058230


##
lucene/core/src/java/org/apache/lucene/util/fst/FST.java:
##
@@ -435,6 +433,13 @@ public FST(FSTMetadata metadata, DataInput in, 
Outputs outputs, FSTStore f
 this.fstReader = fstReader;
   }
 
+  /**
+   * @return true if and only if this FST is readable (e.g has a reverse 
BytesReader)
+   */
+  public boolean isReadable() {

Review Comment:
   Yeah, I was thinking as we will remove it anyway, maybe we don't need this, 
along with the null check (as they seems to be throw-away code). If users try 
to use a non-readable FST now, NPE will thrown instead (of 
IllegalStateExeption). And then when we change FSTCompiler to return 
FSTMetadata, they can't even create a non-readable FST. What do you think?
   
   Alternatively, we can double back to `NullFSTReader`, but let it throw 
exception like this:
   
   ```
   private static final class NullFSTReader implements FSTReader {
   
   @Override
   public long ramBytesUsed() {
 return 0;
   }
   
   @Override
   public FST.BytesReader getReverseBytesReader() {
 throw new UnsupportedOperationException("NullFSTReader does not 
support getReverseBytesReader()");
   }
   
   @Override
   public void writeTo(DataOutput out) {
 throw new UnsupportedOperationException("NullFSTReader does not 
support writeTo(DataOutput)");
   }
 }
   ```
   
   The benefits are:
   - We can enforce `FSTReader` to be non-null.
   - Remove all null-check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-23 Thread via GitHub


dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1402058230


##
lucene/core/src/java/org/apache/lucene/util/fst/FST.java:
##
@@ -435,6 +433,13 @@ public FST(FSTMetadata metadata, DataInput in, 
Outputs outputs, FSTStore f
 this.fstReader = fstReader;
   }
 
+  /**
+   * @return true if and only if this FST is readable (e.g has a reverse 
BytesReader)
+   */
+  public boolean isReadable() {

Review Comment:
   Yeah, I was thinking as we will remove it anyway, maybe we don't need this, 
along with the null check (as they seems to be throw-away code). If users try 
to use a non-readable FST now, NPE will thrown instead (of 
IllegalStateExeption). And then when we change FSTCompiler to return 
FSTMetadata, they can't even create a non-readable FST. What do you think?
   
   Alternatively, we can double back to `NullFSTReader`, but let it throw 
exception like this:
   
   ```
   private static final class NullFSTReader implements FSTReader {
   
   @Override
   public long ramBytesUsed() {
 return 0;
   }
   
   @Override
   public FST.BytesReader getReverseBytesReader() {
 throw new UnsupportedOperationException("NullFSTReader does not 
support getReverseBytesReader()");
   }
   
   @Override
   public void writeTo(DataOutput out) {
 throw new UnsupportedOperationException("NullFSTReader does not 
support writeTo(DataOutput)");
   }
 }
   ```
   
   The benefits are:
   - We can enforce `FSTReader` to be non-null.
   - Remove all null-check.
   
   In the next PR, `NullFSTReader` can be used for the temporary FST created 
during building.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Virtual threads and Lucene (support async tasks) [lucene]

2023-11-23 Thread via GitHub


Jeevananthan-23 commented on issue #12531:
URL: https://github.com/apache/lucene/issues/12531#issuecomment-1824474577

   @uschindler, I understand the complexity of completely rewriting all Lucene 
internals. However, IMO it is necessary to do so in parallel. Relying entirely 
on MMAP is a bad idea it is good for warmup queries. 
   
   Ref: https://db.cs.cmu.edu/mmap-cidr2022/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Explore a single scoring implementation in DrillSidewaysScorer [LUCENE-10037] [lucene]

2023-11-23 Thread via GitHub


slow-J commented on issue #11076:
URL: https://github.com/apache/lucene/issues/11076#issuecomment-1824516300

   Hi @gsmiller, I found a drill sideways tasks file in luceneutil 
([ds.tasks](https://github.com/mikemccand/luceneutil/blob/5fae60800edd84c70492ac8765824ca2e4a6a991/tasks/ds.tasks))
 I tried locally creating a wikimediumds (`WIKI_MEDIUM_DS = 
Data('wikimediumds', constants.WIKI_MEDIUM_DOCS_LINE_FILE, 
constants.WIKI_MEDIUM_DOCS_COUNT, constants.WIKI_DS)`) and ran `python3 
src/python/localrun.py -source wikimediumds`
   
   I commented out calls to `doUnionScoring` and `doDrillDownAdvanceScoring` 
and left only `doQueryFirstScoring` 
https://github.com/slow-J/lucene/commit/7423ddda30a2ca5e3fd90ee9743b3c40ad6b94e0
 , it passes all unit tests.
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
HighTermEasyDD2 2718.56  (6.0%) 2656.82  
(4.8%)   -2.3% ( -12% -9%) 0.189
   MedTermEasyOrDD1 2454.91  (5.6%) 2417.95  
(4.4%)   -1.5% ( -10% -8%) 0.343
MedTermMixedDD2 3184.16  (5.5%) 3137.70  
(7.0%)   -1.5% ( -13% -   11%) 0.465
 LowTermEasyDD2 2602.71  (6.0%) 2570.49  
(5.6%)   -1.2% ( -12% -   11%) 0.503
 LowTermEasyDD1 2583.23  (4.7%) 2553.44  
(5.2%)   -1.2% ( -10% -9%) 0.465
   HighTermMixedDD2 2673.50  (4.3%) 2643.56  
(6.8%)   -1.1% ( -11% -   10%) 0.536
 LowTermHardDD1 2668.88  (5.9%) 2639.66  
(4.0%)   -1.1% ( -10% -9%) 0.496
 MedTermHardDD1 2320.13  (5.1%) 2298.27  
(5.7%)   -0.9% ( -11% -   10%) 0.580
  HighTermHardOrDD2 2477.61  (4.8%) 2454.33  
(3.6%)   -0.9% (  -8% -7%) 0.481
   PKLookup  110.75 (12.0%)  109.98 
(11.3%)   -0.7% ( -21% -   25%) 0.849
HighTermEasyDD1 2777.91  (6.0%) 2762.76  
(6.7%)   -0.5% ( -12% -   12%) 0.787
   MedTermHardOrDD1 2998.49  (5.2%) 2983.15  
(5.8%)   -0.5% ( -10% -   11%) 0.769
  HighTermEasyOrDD1 2513.54  (4.9%) 2503.33  
(4.4%)   -0.4% (  -9% -9%) 0.782
   LowTermHardOrDD2 2349.94  (6.2%) 2346.39  
(4.9%)   -0.2% ( -10% -   11%) 0.932
 LowTermHardDD2 2566.37  (4.5%) 2570.74  
(5.5%)0.2% (  -9% -   10%) 0.914
 MedTermEasyDD2 3231.73  (6.8%) 3239.67  
(6.4%)0.2% ( -12% -   14%) 0.906
 MedTermEasyDD1 3323.66  (7.4%) 3344.85  
(5.4%)0.6% ( -11% -   14%) 0.755
   LowTermEasyOrDD1 2385.39  (5.0%) 2401.56  
(5.6%)0.7% (  -9% -   11%) 0.686
   LowTermEasyOrDD2 2430.31  (4.7%) 2454.95  
(5.6%)1.0% (  -8% -   11%) 0.534
  HighTermEasyOrDD2 2394.51  (5.0%) 2424.64  
(6.2%)1.3% (  -9% -   13%) 0.478
   MedTermEasyOrDD2 2856.22  (7.0%) 2892.76  
(5.1%)1.3% ( -10% -   14%) 0.508
LowTermMixedDD2 2672.52  (5.9%) 2716.97  
(5.9%)1.7% (  -9% -   14%) 0.372
 MedTermHardDD2 2319.03  (6.4%) 2361.41  
(4.6%)1.8% (  -8% -   13%) 0.299
  HighTermHardOrDD1 2550.89  (5.4%) 2598.09  
(2.7%)1.9% (  -5% -   10%) 0.172
   LowTermHardOrDD1 2429.07  (7.3%) 2484.87  
(4.6%)2.3% (  -8% -   15%) 0.234
HighTermHardDD2 2627.08  (5.8%) 2697.49  
(5.7%)2.7% (  -8% -   14%) 0.139
   MedTermHardOrDD2 2846.21  (6.0%) 2956.36  
(6.2%)3.9% (  -7% -   17%) 0.045
HighTermHardDD1 2684.90  (4.4%) 2800.76  
(3.3%)4.3% (  -3% -   12%) 0.000
   ```
   
   I think from results we can see that there is no performance loss from 
removing doUnionScoring` and `doDrillDownAdvanceScoring`.
   
   I am wondering is the setup correct enough to trust the results? Does 
([ds.tasks](https://github.com/mikemccand/luceneutil/blob/5fae60800edd84c70492ac8765824ca2e4a6a991/tasks/ds.tasks))
 contain enough tasks?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[PR] Move MergeState.DocMap to a FunctionalInterface [lucene]

2023-11-23 Thread via GitHub


s1monw opened a new pull request, #12836:
URL: https://github.com/apache/lucene/pull/12836

   This change converts MergeState to an interface to make use of lambda 
expressions. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move MergeState.DocMap to a FunctionalInterface [lucene]

2023-11-23 Thread via GitHub


jpountz commented on code in PR #12836:
URL: https://github.com/apache/lucene/pull/12836#discussion_r1403533598


##
lucene/core/src/java/org/apache/lucene/index/IndexWriter.java:
##
@@ -5237,13 +5234,10 @@ public int get(int docID) {
 for (int i = 0; i < docMaps.length; ++i) {
   MergeState.DocMap reorderDocMap = reorderDocMaps[i];
   docMaps[i] =
-  new MergeState.DocMap() {
-@Override
-public int get(int docID) {
-  int reorderedDocId = reorderDocMap.get(docID);
-  int compactedDocId = compactionDocMap.get(reorderedDocId);
-  return compactedDocId;
-}
+  docID -> {
+int reorderedDocId = reorderDocMap.get(docID);
+int compactedDocId = compactionDocMap.get(reorderedDocId);
+return compactedDocId;

Review Comment:
   Seeing the above makes me want to write something along the lines of 
`(reorderDocMap::get).andThen(compactionDocMap::get)`. (Just food for thoughts, 
it might not actually be practical)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup concurrent multi-segment HNWS graph search [lucene]

2023-11-23 Thread via GitHub


mayya-sharipova commented on code in PR #12794:
URL: https://github.com/apache/lucene/pull/12794#discussion_r1403551939


##
lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java:
##
@@ -26,26 +26,71 @@
  * @lucene.experimental
  */
 public final class TopKnnCollector extends AbstractKnnCollector {
+  private static final float DEFAULT_GREEDINESS = 0.9f;

Review Comment:
   @vigyasharma  Thanks for your feedback. Indeed, more documentation is 
needed, I will add it after we finalize the experiments.
   
   A general idea  with the introduction of a second shorter local queue is 
that different searches from different graphs can progress differently. We 
don't want to stop searching a graph if we are just starting and still in a bad 
neighbourhood where similarity can be worse that the globally collected 
results.  We still want to make some progress.  
   
   As you correctly noticed,  `greediness`  is meant to show how greedy is our 
local segment based search if we are not competitive globally.  A good approach 
could be to  be greedy, and don't do much exploration in this case, keeping 
size of the second queue small. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup concurrent multi-segment HNWS graph search [lucene]

2023-11-23 Thread via GitHub


mayya-sharipova commented on code in PR #12794:
URL: https://github.com/apache/lucene/pull/12794#discussion_r1403560016


##
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##
@@ -79,24 +81,30 @@ public Query rewrite(IndexSearcher indexSearcher) throws 
IOException {
   filterWeight = null;
 }
 
+MaxScoreAccumulator globalMinSimAcc = new MaxScoreAccumulator();
 TaskExecutor taskExecutor = indexSearcher.getTaskExecutor();
 List leafReaderContexts = reader.leaves();
 List> tasks = new ArrayList<>(leafReaderContexts.size());
 for (LeafReaderContext context : leafReaderContexts) {
-  tasks.add(() -> searchLeaf(context, filterWeight));
+  tasks.add(() -> searchLeaf(context, filterWeight, globalMinSimAcc));

Review Comment:
   `MaxScoreAccumulator`  is thread safe, it uses `LongAccumulator` that is 
enough for our purposes to collect max global similarity. 
   And the same instance of `MaxScoreAccumulator` is shared among all slices. 
   
   This approach is similar  to `TopScoreDocCollector`.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup concurrent multi-segment HNWS graph search [lucene]

2023-11-23 Thread via GitHub


mayya-sharipova commented on code in PR #12794:
URL: https://github.com/apache/lucene/pull/12794#discussion_r1403563168


##
lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java:
##
@@ -26,26 +26,71 @@
  * @lucene.experimental
  */
 public final class TopKnnCollector extends AbstractKnnCollector {
+  private static final float DEFAULT_GREEDINESS = 0.9f;
 
   private final NeighborQueue queue;
+  private final float greediness;
+  private final NeighborQueue queueg;
+  private final MaxScoreAccumulator globalMinSimAcc;
+  private boolean kResultsCollected = false;
+  private float cachedGlobalMinSim = Float.NEGATIVE_INFINITY;
+
+  // greediness of globally non-competitive search: [0,1]
 
   /**
* @param k the number of neighbors to collect
* @param visitLimit how many vector nodes the results are allowed to visit
+   * @param globalMinSimAcc the global minimum competitive similarity tracked 
across all segments
*/
-  public TopKnnCollector(int k, int visitLimit) {
+  public TopKnnCollector(int k, int visitLimit, MaxScoreAccumulator 
globalMinSimAcc) {
+super(k, visitLimit);
+this.greediness = DEFAULT_GREEDINESS;
+this.queue = new NeighborQueue(k, false);
+int queuegSize = Math.max(1, Math.round((1 - greediness) * k));
+this.queueg = new NeighborQueue(queuegSize, false);

Review Comment:
   Indeed, it is local.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move MergeState.DocMap to a FunctionalInterface [lucene]

2023-11-23 Thread via GitHub


s1monw commented on code in PR #12836:
URL: https://github.com/apache/lucene/pull/12836#discussion_r1403570681


##
lucene/core/src/java/org/apache/lucene/index/IndexWriter.java:
##
@@ -5237,13 +5234,10 @@ public int get(int docID) {
 for (int i = 0; i < docMaps.length; ++i) {
   MergeState.DocMap reorderDocMap = reorderDocMaps[i];
   docMaps[i] =
-  new MergeState.DocMap() {
-@Override
-public int get(int docID) {
-  int reorderedDocId = reorderDocMap.get(docID);
-  int compactedDocId = compactionDocMap.get(reorderedDocId);
-  return compactedDocId;
-}
+  docID -> {
+int reorderedDocId = reorderDocMap.get(docID);
+int compactedDocId = compactionDocMap.get(reorderedDocId);
+return compactedDocId;

Review Comment:
   yeah I am a bit afraid of the boxing that would happen here but we can for 
sure remove the variables



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Removing TermInSetQuery array ctor [lucene]

2023-11-23 Thread via GitHub


slow-J opened a new pull request, #12837:
URL: https://github.com/apache/lucene/pull/12837

   Creating a PR for the items discussed in 
https://github.com/apache/lucene/issues/12243.
   
   
   
   Currently, calling the 
[`KeywordField#newSetQuery`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/document/KeywordField.java#L174-L180)
 which takes an array and creates:  2 TermInSetQueries (ctors 
used:[1](https://github.com/apache/lucene/blob/569533bd76a115e239d8c621d2540462bc12d891/lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java#L89),
 
[2](https://github.com/apache/lucene/blob/569533bd76a115e239d8c621d2540462bc12d891/lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java#L102))
 with arrays, the TiSQ then immediately changes the array to a list 
`Arrays.asList(terms)`.
   
   I initially wanted to add another `KeywordField#newSetQuery` ctor which 
takes `Collection` as a param to remove the needless calls to change the input 
to array (currently needed to call KeywordField#newSetQuery), which then gets 
immediately changed back to collection (when creating TiSQ).
   
   The discussion in the issue, mentioned that since TermInSetQuery has 
constructors which take both array and collection, we should consolidate them 
(unless I am misunderstanding).
   
   So in this PR, I am removing the TiSQ array ctor (since order does not 
matter in this case) and changing the `KeywordField#newSetQuery` param to 
collection (previously array).
   
   
   
   Closes https://github.com/apache/lucene/issues/12243.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Add a new static method for KeywordField#newSetQuery to support collections parameter [lucene]

2023-11-23 Thread via GitHub


slow-J commented on issue #12243:
URL: https://github.com/apache/lucene/issues/12243#issuecomment-1824681333

   > > I don't think the fact that TermInSetQuery has a Collection ctor should 
impact this at all. Maybe that should be removed?
   > 
   > +1
   > 
   > It's weird to take both array and Collection when the API is just going to 
do the obvious conversion that a caller would do? We should somehow be 
consistent here about such "sugar" APIs.
   > 
   > But if we are going to pick one, shouldn't it be `Collection` for these 
two cases, since order does not matter? Taking an array implies that order 
might be significant, but (I think?) order does not matter for 
`KeywordField.newSetQuery` nor for `TermInSetQuery` ctor?
   > 
   > I don't feel strongly though (Collection vs array) ... but I do agree we 
should be consistent.
   
   Thanks Mike, to progress the discussion I opened a PR to remove the 
`TermInSetQuery` array ctor and change the `KeywordField.newSetQuery` values 
param to `Collections` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup concurrent multi-segment HNWS graph search [lucene]

2023-11-23 Thread via GitHub


mayya-sharipova commented on code in PR #12794:
URL: https://github.com/apache/lucene/pull/12794#discussion_r1403584908


##
lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java:
##
@@ -26,26 +26,71 @@
  * @lucene.experimental
  */
 public final class TopKnnCollector extends AbstractKnnCollector {
+  private static final float DEFAULT_GREEDINESS = 0.9f;
 
   private final NeighborQueue queue;
+  private final float greediness;
+  private final NeighborQueue queueg;
+  private final MaxScoreAccumulator globalMinSimAcc;
+  private boolean kResultsCollected = false;
+  private float cachedGlobalMinSim = Float.NEGATIVE_INFINITY;
+
+  // greediness of globally non-competitive search: [0,1]
 
   /**
* @param k the number of neighbors to collect
* @param visitLimit how many vector nodes the results are allowed to visit
+   * @param globalMinSimAcc the global minimum competitive similarity tracked 
across all segments
*/
-  public TopKnnCollector(int k, int visitLimit) {
+  public TopKnnCollector(int k, int visitLimit, MaxScoreAccumulator 
globalMinSimAcc) {
+super(k, visitLimit);
+this.greediness = DEFAULT_GREEDINESS;
+this.queue = new NeighborQueue(k, false);
+int queuegSize = Math.max(1, Math.round((1 - greediness) * k));
+this.queueg = new NeighborQueue(queuegSize, false);
+this.globalMinSimAcc = globalMinSimAcc;
+  }
+
+  public TopKnnCollector(
+  int k, int visitLimit, MaxScoreAccumulator globalMinSimAcc, float 
greediness) {
 super(k, visitLimit);
+this.greediness = greediness;
 this.queue = new NeighborQueue(k, false);
+this.queueg = new NeighborQueue(Math.round((1 - greediness) * k), false);
+this.globalMinSimAcc = globalMinSimAcc;
   }
 
   @Override
   public boolean collect(int docId, float similarity) {
-return queue.insertWithOverflow(docId, similarity);
+boolean result = queue.insertWithOverflow(docId, similarity);
+queueg.insertWithOverflow(docId, similarity);
+
+boolean reachedKResults = (kResultsCollected == false && queue.size() == 
k());
+if (reachedKResults) {
+  kResultsCollected = true;
+}
+if (globalMinSimAcc != null && kResultsCollected) {
+  // as we've collected k results, we can start exchanging globally
+  globalMinSimAcc.accumulate(queue.topNode(), queue.topScore());
+
+  // periodically update the local copy of global similarity
+  if (reachedKResults || (visitedCount & globalMinSimAcc.modInterval) == 
0) {
+MaxScoreAccumulator.DocAndScore docAndScore = globalMinSimAcc.get();
+cachedGlobalMinSim = docAndScore.score;

Review Comment:
   Thanks for your idea,  it looks like it should provide better speedups than 
the current approach.
   We can check globalMinSim before starting search, and see if it makes 
difference



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Virtual threads and Lucene (support async tasks) [lucene]

2023-11-23 Thread via GitHub


uschindler commented on issue #12531:
URL: https://github.com/apache/lucene/issues/12531#issuecomment-1824732748

   > @uschindler, I understand the complexity of completely rewriting all 
Lucene internals. However, IMO it is necessary to do so in parallel. Relying 
entirely on MMAP is a bad idea it is good for warmup queries.
   > 
   > Ref: https://db.cs.cmu.edu/mmap-cidr2022/
   
   Lucene has different access patterns that are not database like. MMAP works 
perfectly here. Lucene is using WORM when writing files. This paper is known to 
use and our long-time testing figured out that it does not apply to most Lucene 
workloads.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup concurrent multi-segment HNWS graph search [lucene]

2023-11-23 Thread via GitHub


mayya-sharipova commented on PR #12794:
URL: https://github.com/apache/lucene/pull/12794#issuecomment-1824736252

   @vigyasharma Answering other questions:
   
   > We seem to consistently see an improvement in recall between single 
segment, and multi-segment runs (both seq and conc.) on baseline. Is this 
because with multiple segments, we get multiple entry points into the overall 
graph? Whereas in a single merged segment, we only have access to a sparser set 
of nodes in layer-1 while finding the single best entry point?
   
   Indeed, this is the correct observation. For multiple segments, we retrieve 
k results from each segment, and then merge k* num_of_segments results to get 
the best `k` results. As we are retrieving more results, we also get better 
recall than from the single segment. 
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Simplify advancing on postings/impacts enums [lucene]

2023-11-23 Thread via GitHub


jpountz opened a new pull request, #12838:
URL: https://github.com/apache/lucene/pull/12838

   This is a new iteration on #12810, which I had to revert because of test 
failures when docFreq is a multiple of 128.
   
   We currently have a hack when the doc freq is not a multiple of 128 in order 
to not write skip data for the last block. I suspect it made sense in the past 
when postings were only about advancing postings, but storing this last skip 
data point also helps simplify advancing because postings enums can then always 
assume that the last doc ID of a full block is equal to the next skip doc. This 
also helps dynamic pruning by also storing impacts for this last block.
   
   See the second commit of this PR for more details.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move MergeState.DocMap to a FunctionalInterface [lucene]

2023-11-23 Thread via GitHub


s1monw merged PR #12836:
URL: https://github.com/apache/lucene/pull/12836


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Grow arrays up to a given limit to avoid overallocation where possible [lucene]

2023-11-23 Thread via GitHub


stefanvodita opened a new issue, #12839:
URL: https://github.com/apache/lucene/issues/12839

   ### Description
   
   `ArrayUtils` provides methods to grow arrays, overallocating exponentially, 
with the possibility of requesting a minimum size.
   Sometimes we have an upper limit to the number of elements that would go 
into that array. In those situations, it would be nice to have a method that 
grows up to a limit.
   For example, `DirectoryTaxonomyReader.getBulkOrdinals` dynamically grows an 
array pointing to ordinals missing from the cache 
([code](https://github.com/apache/lucene/blob/981339be0413730598631c1f578ba8a6493061b9/lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java#L354)).
 But we know ahead of time there can't be more missing ordinals than there are 
ordinals in the taxonomy. We can limit the array growth to avoid overallocating.
   This pattern might be applicable in multiple places. `TermsAutomatonScorer` 
seems to use a [similar 
pattern](https://github.com/apache/lucene/blob/981339be0413730598631c1f578ba8a6493061b9/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/TermAutomatonScorer.java#L294).
   
   The API could look like this:
   ```
   public static int[] growInRange(int[] array, int minLength, int maxLength)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-23 Thread via GitHub


dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1402058230


##
lucene/core/src/java/org/apache/lucene/util/fst/FST.java:
##
@@ -435,6 +433,13 @@ public FST(FSTMetadata metadata, DataInput in, 
Outputs outputs, FSTStore f
 this.fstReader = fstReader;
   }
 
+  /**
+   * @return true if and only if this FST is readable (e.g has a reverse 
BytesReader)
+   */
+  public boolean isReadable() {

Review Comment:
   Yeah, I was thinking as we will remove it anyway, maybe we don't need this, 
along with the null check (as they seems to be throw-away code). If users try 
to use a non-readable FST now, NPE will thrown instead (of 
IllegalStateExeption). And then when we change FSTCompiler to return 
FSTMetadata, they can't even create a non-readable FST. What do you think?
   
   Alternatively, we can double back to `NullFSTReader`, but let it throw 
exception like this:
   
   ```
   private static final class NullFSTReader implements FSTReader {
   
   @Override
   public long ramBytesUsed() {
 return 0;
   }
   
   @Override
   public FST.BytesReader getReverseBytesReader() {
 throw new UnsupportedOperationException("NullFSTReader does not 
support getReverseBytesReader()");
   }
   
   @Override
   public void writeTo(DataOutput out) {
 throw new UnsupportedOperationException("NullFSTReader does not 
support writeTo(DataOutput)");
   }
 }
   ```
   
   The benefits are:
   - We can enforce `FSTReader` to be non-null and remove all null-check, while 
still throwing meaningful exception when users try to use a nkn-readable FST.
   - In the next PR, `NullFSTReader` can be used for the temporary FST created 
during building so it won't be throw-way code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Random access term dictionary [lucene]

2023-11-23 Thread via GitHub


Tony-X commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1825228925

   After some tweaking and tinkering I was finally able to get the bench 
running the way I wanted in luceneutil! @mikemccand 
   
   Unfortunately luceneutil out of the box doesn't work for my case...
   
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value 
   [19/1802]
   Wildcard   56.20  (2.0%)7.30  
(0.2%)  -87.0% ( -87% -  -86%) 0.000
Respell   44.00  (1.7%)   14.46  
(0.6%)  -67.1% ( -68% -  -65%) 0.000
 Fuzzy1   54.11  (1.2%)   20.86  
(0.8%)  -61.4% ( -62% -  -60%) 0.000
Prefix3  103.61  (0.8%)   41.37  
(0.5%)  -60.1% ( -60% -  -59%) 0.000
 Fuzzy2   42.43  (1.2%)   20.26  
(0.9%)  -52.3% ( -53% -  -50%) 0.000
  HighTermTitleSort  127.60  (1.7%)  114.65  
(1.6%)  -10.1% ( -13% -   -6%) 0.000
  HighTermMonthSort 2549.22  (2.8%) 2435.74  
(3.8%)   -4.5% ( -10% -2%) 0.037
 AndHighLow  708.99  (2.3%)  678.15  
(2.1%)   -4.4% (  -8% -0%) 0.001
LowTerm  369.16  (5.5%)  358.15  
(3.1%)   -3.0% ( -10% -5%) 0.287
   OrNotHighLow  258.51  (1.8%)  252.11  
(2.2%)   -2.5% (  -6% -1%) 0.054
  OrHighLow  348.58  (1.1%)  340.11  
(2.5%)   -2.4% (  -5% -1%) 0.046
  OrNotHighHigh  141.31  (6.7%)  138.54  
(3.2%)   -2.0% ( -11% -8%) 0.554
 IntNRQ   17.93  (1.8%)   17.62  
(3.7%)   -1.7% (  -7% -3%) 0.349
MedSloppyPhrase   26.73  (0.8%)   26.28  
(1.3%)   -1.7% (  -3% -0%) 0.017
   HighIntervalsOrdered3.88  (2.6%)3.82  
(2.3%)   -1.6% (  -6% -3%) 0.314
   HighTerm  290.77  (7.3%)  286.20  
(6.9%)   -1.6% ( -14% -   13%) 0.727
  OrHighNotHigh  296.51  (6.9%)  291.90  
(4.9%)   -1.6% ( -12% -   11%) 0.682
LowIntervalsOrdered   15.48  (1.1%)   15.26  
(1.3%)   -1.4% (  -3% -0%) 0.057
MedTerm  405.54  (6.3%)  399.77  
(5.9%)   -1.4% ( -12% -   11%) 0.713
LowSloppyPhrase   44.64  (0.5%)   44.04  
(2.1%)   -1.3% (  -3% -1%) 0.171
  HighTermDayOfYearSort  177.36  (1.6%)  175.08  
(1.8%)   -1.3% (  -4% -2%) 0.234
  OrHighMed   79.44  (3.7%)   78.48  
(3.0%)   -1.2% (  -7% -5%) 0.572
   OrNotHighMed  268.82  (4.3%)  265.74  
(3.1%)   -1.1% (  -8% -6%) 0.632
  BrowseMonthTaxoFacets3.89  (0.4%)3.85  
(1.2%)   -0.9% (  -2% -0%) 0.134
AndHighMedDayTaxoFacets   44.33  (0.7%)   43.96  
(1.1%)   -0.8% (  -2% -0%) 0.145
MedSpanNear   30.67  (1.1%)   30.42  
(2.3%)   -0.8% (  -4% -2%) 0.481
LowSpanNear4.56  (0.8%)4.53  
(2.1%)   -0.6% (  -3% -2%) 0.576
   HighSpanNear8.52  (1.4%)8.47  
(2.1%)   -0.5% (  -3% -2%) 0.636
   OrHighNotLow  236.32  (6.6%)  235.28  
(4.4%)   -0.4% ( -10% -   11%) 0.901
   AndHighHighDayTaxoFacets4.31  (0.6%)4.29  
(0.8%)   -0.4% (  -1% -1%) 0.446
  LowPhrase   69.90  (1.2%)   69.67  
(2.7%)   -0.3% (  -4% -3%) 0.807
 AndHighMed   41.87  (0.6%)   41.75  
(2.7%)   -0.3% (  -3% -3%) 0.828
 OrHighHigh   45.86  (7.4%)   45.77  
(7.9%)   -0.2% ( -14% -   16%) 0.969
 TermDTSort  101.90  (2.3%)  101.84  
(2.0%)   -0.1% (  -4% -4%) 0.966
   HighSloppyPhrase0.48  (2.2%)0.48  
(2.6%)   -0.0% (  -4% -4%) 0.995
   BrowseDateSSDVFacets1.00  (3.7%)1.00  
(5.3%)   -0.0% (  -8% -9%) 1.000
   HighTermTitleBDVSort4.48  (5.4%)4.49  
(4.4%)0.0% (  -9% -   10%) 0.990
  BrowseDayOfYearSSDVFacets3.51 (13.4%)3.52 
(13.3%)0.1% ( -23% -   30%) 0.990
  BrowseMonthSSDVFacets3.38  (0.7%)3.39  
(0.6%)0.1% (  -1% -1%) 0.740
  Browse

Re: [PR] hunspell: allow in-memory entry sorting for faster dictionary loading [lucene]

2023-11-23 Thread via GitHub


donnerpeter merged PR #12834:
URL: https://github.com/apache/lucene/pull/12834


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Hide the internal data structure of HeapPointWriter [lucene]

2023-11-23 Thread via GitHub


iverase commented on PR #12762:
URL: https://github.com/apache/lucene/pull/12762#issuecomment-1825276830

   I didn't notice any performance change. Added  javadocs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org