[GitHub] [lucene] jpountz opened a new pull request, #12114: Use radix sort to sort postings when index sorting is enabled.

2023-01-27 Thread via GitHub


jpountz opened a new pull request, #12114:
URL: https://github.com/apache/lucene/pull/12114

   This switches to LSBRadixSorter instead of TimSorter to sort postings whose 
index options are `DOCS`. On a synthetic benchmark this yielded barely any 
difference in the case when the index order is the same as the sort order, or 
reverse, but almost a 3x speedup for writing postings in the case when the 
index order is mostly random.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12114: Use radix sort to sort postings when index sorting is enabled.

2023-01-27 Thread via GitHub


jpountz commented on PR #12114:
URL: https://github.com/apache/lucene/pull/12114#issuecomment-1406248901

   Here is the synthetic benchmark that I used if someone is interested in 
reproducing:
   
   ```java
 enum Order {
   RANDOM,
   ASC,
   DESC;
 }
   
 public static void main(String[] args) throws IOException {
   Order order = Order.RANDOM;
   Directory dir = FSDirectory.open(Paths.get("/tmp/a"));
   IndexWriterConfig cfg = new IndexWriterConfig(null);
   cfg.setInfoStream(new PrintStreamInfoStream(System.out));
   cfg.setMaxBufferedDocs(100_000);
   cfg.setRAMBufferSizeMB(IndexWriterConfig.DISABLE_AUTO_FLUSH);
   cfg.setIndexSort(new Sort(LongField.newSortField("sort_field", false, 
SortedNumericSelector.Type.MIN)));
   IndexWriter w = new IndexWriter(dir, cfg);
   Document doc = new Document();
   LongField sortField = new LongField("sort_field", 0);
   doc.add(sortField);
   StringField stringField1 = new StringField("string_field", "", Store.NO);
   doc.add(stringField1);
   StringField stringField2 = new StringField("string_field", "", Store.NO);
   doc.add(stringField2);
   StringField stringField3 = new StringField("string_field", "", Store.NO);
   doc.add(stringField3);
   for (int i = 0; i < 5_000_000; ++i) {
 long sortValue = switch (order) {
 case RANDOM -> i % 15;
 case ASC -> i;
 case DESC -> -i;
 };
 sortField.setLongValue(sortValue);
 stringField1.setStringValue(Integer.toBinaryString(i % 10));
 stringField2.setStringValue(Integer.toBinaryString(i % 100));
 stringField3.setStringValue(Integer.toBinaryString(i % 1000));
 w.addDocument(doc);
   }
 }
   ```
   
   And flush times for postings:
   
   | | Main  | Patch |
   | -- | - | - |
   | Index sort matches indexing order | 6 | 7  |
   | Index sort is reverse indexing order | 7  | 8  |
   | Random sort | 27 | 10 |


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] cpoerschke opened a new issue, #12115: org.apache.lucene.search.uhighlight.TestUnifiedHighlighterTermVec.testFetchTermVecsOncePerDoc fails reproducibly

2023-01-27 Thread via GitHub


cpoerschke opened a new issue, #12115:
URL: https://github.com/apache/lucene/issues/12115

   ### Description
   
   currently 5 matches in the last few weeks: 
https://lists.apache.org/list?bui...@lucene.apache.org:dfr=2022-1-1|dto=2024-1-1:org.apache.lucene.search.uhighlight.TestUnifiedHighlighterTermVec.testFetchTermVecsOncePerDoc
   
   ```
   ...
   Error Message:
   java.lang.AssertionError: Should not request TVs for doc more than once.
   
   Stack Trace:
   java.lang.AssertionError: Should not request TVs for doc more than once.
at junit@4.13.1/org.junit.Assert.fail(Assert.java:89)
at junit@4.13.1/org.junit.Assert.assertTrue(Assert.java:42)
at junit@4.13.1/org.junit.Assert.assertFalse(Assert.java:65)
   ...
   ```
   
   ### Gradle command to reproduce
   
   from 
https://jenkins.thetaphi.de/job/Lucene-main-MacOSX/9238/testReport/junit/org.apache.lucene.search.uhighlight/TestUnifiedHighlighterTermVec/testFetchTermVecsOncePerDoc/
   
   > ./gradlew test --tests 
TestUnifiedHighlighterTermVec.testFetchTermVecsOncePerDoc 
-Dtests.seed=B40B463B9510DF27 -Dtests.locale=ff-Latn-SL 
-Dtests.timezone=Asia/Ulaanbaatar -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
   
   from https://lists.apache.org/thread/np5w9kzklskjxflxmthc8r7y59x5934m
   
   > ./gradlew :lucene:highlighter:test --tests 
"org.apache.lucene.search.uhighlight.TestUnifiedHighlighterTermVec.testFetchTermVecsOncePerDoc"
 -Ptests.jvms=4 -Ptests.haltonfailure=false 
"-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -XX:+UseParallelGC 
-XX:ActiveProcessorCount=1" -Ptests.seed=29E42D85ACE838EF -Ptests.multiplier=2 
-Ptests.nightly=true -Ptests.badapples=false -Ptests.gui=true 
-Ptests.file.encoding=US-ASCII # 
-Ptests.linedocsfile=/home/jenkins/jenkins-slave/workspace/Lucene/Lucene-NightlyTests-9.x/test-data/enwiki.random.lines.txt


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz opened a new pull request, #12116: Improve document API for stored fields.

2023-01-27 Thread via GitHub


jpountz opened a new pull request, #12116:
URL: https://github.com/apache/lucene/pull/12116

   Currently stored fields have to look at binaryValue(), stringValue() and 
numericValue() to guess the type of the value and then store it. This has a few 
issues:
- If there is a problem, e.g. all of these 3 methods return null, it's 
currently discovered late, when we already passed the responsibility of writing 
data from IndexingChain to the codec.
- numericValue() is used both for numeric doc values and storage. This 
makes it impossible to implement a `double` field that is stored and 
doc-valued, as numericValue() needs to return simultaneously a number that 
consists of the double for storage, and the long bits of the double for doc 
values.
- binaryValue() is used both for sorted(_set) doc values and storage. This 
makes it impossible to implement `keyword` fields that is stored and 
doc-valued, as the field returns a non-null value for both binaryValue() and 
stringValue() and stored fields no longer know which field to store.
   
   This commit introduces `IndexableField#storedValue()`, which is used only 
for stored fields. This addresses the above issues. IndexingChain passes the 
storedValue() directly to the codec, so it's impossible for a stored fields 
format to mistakenly use binaryValue()/stringValue()/numericValue() instead of 
storedValue().


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #12054: Introduce a new `KeywordField`.

2023-01-27 Thread via GitHub


jpountz commented on code in PR #12054:
URL: https://github.com/apache/lucene/pull/12054#discussion_r1089154108


##
lucene/demo/src/java/org/apache/lucene/demo/IndexFiles.java:
##
@@ -234,8 +234,8 @@ void indexDoc(IndexWriter writer, Path file, long 
lastModified) throws IOExcepti
   // field that is indexed (i.e. searchable), but don't tokenize
   // the field into separate words and don't index term frequency
   // or positional information:
-  Field pathField = new StringField("path", file.toString(), 
Field.Store.YES);
-  doc.add(pathField);
+  doc.add(new KeywordField("path", file.toString()));
+  doc.add(new StoredField("path", file.toString()));

Review Comment:
   I tried to think more about this and opened 
https://github.com/apache/lucene/pull/12116 as a possible way forward.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"

2023-01-27 Thread via GitHub


gsmiller commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1406839169

   I found some time to come back to this and did some more benchmarking. I 
added a markdown file with some benchmark results in this PR for now just as a 
place to put it. It's 
[here](https://github.com/gsmiller/lucene/blob/explore/sandbox-tis-query-no-multi-dv-support/lucene/sandbox/src/java/org/apache/lucene/sandbox/queries/TiSBenchResults.md).
   
   This is all using a benchmark tool I wrote that's heavily based on something 
@rmuir did over in #12087. I'll figure how a place to upload that benchmark 
tool shortly for visibility.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #12116: Improve document API for stored fields.

2023-01-27 Thread via GitHub


rmuir commented on PR #12116:
URL: https://github.com/apache/lucene/pull/12116#issuecomment-1406967754

   This is great, thanks for looking into it. It moves "type-guessing" into the 
one place that should be doing it, which is the generic Field.java
   
   i didn't really think too deeply about implementation but I suppose there 
might be a few options:
   * "Box" the value along with its type, like what you've done here (are there 
perf implications?)
   * change api of the visitor, e.g. to provide field's type (like the enum 
inside your current "box") before the value
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"

2023-01-27 Thread via GitHub


rmuir commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1407020062

   thanks for the work benchmarking! you can rename to a .txt file and just 
attach it to a github comment, as one solution.
   
   yeah, this one looked to be trickier at a glance. simply indexing the same 
integer values as strings makes the queries quite a bit slower than what I saw 
with Points+DV on #12087. And not even using lots of query terms...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"

2023-01-27 Thread via GitHub


gsmiller commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1407307849

   > attach it to a github comment
   
   That works! Here's how I benchmarked. One note if you're interested in 
running this is to make sure to shuffle the genomes data prior to running or 
you get very skewed results. This is because the file is sorted by country 
code, so a postings-based approach is heavily favored by the natural index 
sorting if you index the lines in this order.
   [TiSBench.txt](https://github.com/apache/lucene/files/10526083/TiSBench.txt)
   
   
   I'm actually a bit surprised/impressed that our existing `IndexOrDocValues` 
functionality works as well as it does across these queries, given how rough of 
an estimate `#cost()` is on the current `TermInSetQuery`. There are some clear 
cases where term-seeking and using term-level stats in a heuristic helps make 
better decisions, but I expected the difference to be more pronounced.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org