[GitHub] [lucene] jpountz opened a new pull request, #12114: Use radix sort to sort postings when index sorting is enabled.
jpountz opened a new pull request, #12114: URL: https://github.com/apache/lucene/pull/12114 This switches to LSBRadixSorter instead of TimSorter to sort postings whose index options are `DOCS`. On a synthetic benchmark this yielded barely any difference in the case when the index order is the same as the sort order, or reverse, but almost a 3x speedup for writing postings in the case when the index order is mostly random. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #12114: Use radix sort to sort postings when index sorting is enabled.
jpountz commented on PR #12114: URL: https://github.com/apache/lucene/pull/12114#issuecomment-1406248901 Here is the synthetic benchmark that I used if someone is interested in reproducing: ```java enum Order { RANDOM, ASC, DESC; } public static void main(String[] args) throws IOException { Order order = Order.RANDOM; Directory dir = FSDirectory.open(Paths.get("/tmp/a")); IndexWriterConfig cfg = new IndexWriterConfig(null); cfg.setInfoStream(new PrintStreamInfoStream(System.out)); cfg.setMaxBufferedDocs(100_000); cfg.setRAMBufferSizeMB(IndexWriterConfig.DISABLE_AUTO_FLUSH); cfg.setIndexSort(new Sort(LongField.newSortField("sort_field", false, SortedNumericSelector.Type.MIN))); IndexWriter w = new IndexWriter(dir, cfg); Document doc = new Document(); LongField sortField = new LongField("sort_field", 0); doc.add(sortField); StringField stringField1 = new StringField("string_field", "", Store.NO); doc.add(stringField1); StringField stringField2 = new StringField("string_field", "", Store.NO); doc.add(stringField2); StringField stringField3 = new StringField("string_field", "", Store.NO); doc.add(stringField3); for (int i = 0; i < 5_000_000; ++i) { long sortValue = switch (order) { case RANDOM -> i % 15; case ASC -> i; case DESC -> -i; }; sortField.setLongValue(sortValue); stringField1.setStringValue(Integer.toBinaryString(i % 10)); stringField2.setStringValue(Integer.toBinaryString(i % 100)); stringField3.setStringValue(Integer.toBinaryString(i % 1000)); w.addDocument(doc); } } ``` And flush times for postings: | | Main | Patch | | -- | - | - | | Index sort matches indexing order | 6 | 7 | | Index sort is reverse indexing order | 7 | 8 | | Random sort | 27 | 10 | -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] cpoerschke opened a new issue, #12115: org.apache.lucene.search.uhighlight.TestUnifiedHighlighterTermVec.testFetchTermVecsOncePerDoc fails reproducibly
cpoerschke opened a new issue, #12115: URL: https://github.com/apache/lucene/issues/12115 ### Description currently 5 matches in the last few weeks: https://lists.apache.org/list?bui...@lucene.apache.org:dfr=2022-1-1|dto=2024-1-1:org.apache.lucene.search.uhighlight.TestUnifiedHighlighterTermVec.testFetchTermVecsOncePerDoc ``` ... Error Message: java.lang.AssertionError: Should not request TVs for doc more than once. Stack Trace: java.lang.AssertionError: Should not request TVs for doc more than once. at junit@4.13.1/org.junit.Assert.fail(Assert.java:89) at junit@4.13.1/org.junit.Assert.assertTrue(Assert.java:42) at junit@4.13.1/org.junit.Assert.assertFalse(Assert.java:65) ... ``` ### Gradle command to reproduce from https://jenkins.thetaphi.de/job/Lucene-main-MacOSX/9238/testReport/junit/org.apache.lucene.search.uhighlight/TestUnifiedHighlighterTermVec/testFetchTermVecsOncePerDoc/ > ./gradlew test --tests TestUnifiedHighlighterTermVec.testFetchTermVecsOncePerDoc -Dtests.seed=B40B463B9510DF27 -Dtests.locale=ff-Latn-SL -Dtests.timezone=Asia/Ulaanbaatar -Dtests.asserts=true -Dtests.file.encoding=UTF-8 from https://lists.apache.org/thread/np5w9kzklskjxflxmthc8r7y59x5934m > ./gradlew :lucene:highlighter:test --tests "org.apache.lucene.search.uhighlight.TestUnifiedHighlighterTermVec.testFetchTermVecsOncePerDoc" -Ptests.jvms=4 -Ptests.haltonfailure=false "-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -XX:+UseParallelGC -XX:ActiveProcessorCount=1" -Ptests.seed=29E42D85ACE838EF -Ptests.multiplier=2 -Ptests.nightly=true -Ptests.badapples=false -Ptests.gui=true -Ptests.file.encoding=US-ASCII # -Ptests.linedocsfile=/home/jenkins/jenkins-slave/workspace/Lucene/Lucene-NightlyTests-9.x/test-data/enwiki.random.lines.txt -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz opened a new pull request, #12116: Improve document API for stored fields.
jpountz opened a new pull request, #12116: URL: https://github.com/apache/lucene/pull/12116 Currently stored fields have to look at binaryValue(), stringValue() and numericValue() to guess the type of the value and then store it. This has a few issues: - If there is a problem, e.g. all of these 3 methods return null, it's currently discovered late, when we already passed the responsibility of writing data from IndexingChain to the codec. - numericValue() is used both for numeric doc values and storage. This makes it impossible to implement a `double` field that is stored and doc-valued, as numericValue() needs to return simultaneously a number that consists of the double for storage, and the long bits of the double for doc values. - binaryValue() is used both for sorted(_set) doc values and storage. This makes it impossible to implement `keyword` fields that is stored and doc-valued, as the field returns a non-null value for both binaryValue() and stringValue() and stored fields no longer know which field to store. This commit introduces `IndexableField#storedValue()`, which is used only for stored fields. This addresses the above issues. IndexingChain passes the storedValue() directly to the codec, so it's impossible for a stored fields format to mistakenly use binaryValue()/stringValue()/numericValue() instead of storedValue(). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #12054: Introduce a new `KeywordField`.
jpountz commented on code in PR #12054: URL: https://github.com/apache/lucene/pull/12054#discussion_r1089154108 ## lucene/demo/src/java/org/apache/lucene/demo/IndexFiles.java: ## @@ -234,8 +234,8 @@ void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOExcepti // field that is indexed (i.e. searchable), but don't tokenize // the field into separate words and don't index term frequency // or positional information: - Field pathField = new StringField("path", file.toString(), Field.Store.YES); - doc.add(pathField); + doc.add(new KeywordField("path", file.toString())); + doc.add(new StoredField("path", file.toString())); Review Comment: I tried to think more about this and opened https://github.com/apache/lucene/pull/12116 as a possible way forward. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"
gsmiller commented on PR #12089: URL: https://github.com/apache/lucene/pull/12089#issuecomment-1406839169 I found some time to come back to this and did some more benchmarking. I added a markdown file with some benchmark results in this PR for now just as a place to put it. It's [here](https://github.com/gsmiller/lucene/blob/explore/sandbox-tis-query-no-multi-dv-support/lucene/sandbox/src/java/org/apache/lucene/sandbox/queries/TiSBenchResults.md). This is all using a benchmark tool I wrote that's heavily based on something @rmuir did over in #12087. I'll figure how a place to upload that benchmark tool shortly for visibility. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #12116: Improve document API for stored fields.
rmuir commented on PR #12116: URL: https://github.com/apache/lucene/pull/12116#issuecomment-1406967754 This is great, thanks for looking into it. It moves "type-guessing" into the one place that should be doing it, which is the generic Field.java i didn't really think too deeply about implementation but I suppose there might be a few options: * "Box" the value along with its type, like what you've done here (are there perf implications?) * change api of the visitor, e.g. to provide field's type (like the enum inside your current "box") before the value -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"
rmuir commented on PR #12089: URL: https://github.com/apache/lucene/pull/12089#issuecomment-1407020062 thanks for the work benchmarking! you can rename to a .txt file and just attach it to a github comment, as one solution. yeah, this one looked to be trickier at a glance. simply indexing the same integer values as strings makes the queries quite a bit slower than what I saw with Points+DV on #12087. And not even using lots of query terms... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"
gsmiller commented on PR #12089: URL: https://github.com/apache/lucene/pull/12089#issuecomment-1407307849 > attach it to a github comment That works! Here's how I benchmarked. One note if you're interested in running this is to make sure to shuffle the genomes data prior to running or you get very skewed results. This is because the file is sorted by country code, so a postings-based approach is heavily favored by the natural index sorting if you index the lines in this order. [TiSBench.txt](https://github.com/apache/lucene/files/10526083/TiSBench.txt) I'm actually a bit surprised/impressed that our existing `IndexOrDocValues` functionality works as well as it does across these queries, given how rough of an estimate `#cost()` is on the current `TermInSetQuery`. There are some clear cases where term-seeking and using term-level stats in a heuristic helps make better decisions, but I expected the difference to be more pronounced. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org