Re: [PR] Add IndexInput isLoaded [lucene]
ChrisHegarty commented on PR #13998: URL: https://github.com/apache/lucene/pull/13998#issuecomment-2482415401 > @ChrisHegarty this will be a very useful thing. Indeed. > Can we also figure out how much data is loaded with this API? So lets say an IndexInput is 30GB and only 10GB is loaded/mapped in memory can return that too? While possible, it's not straightforward and would require some native access. For now, let's go with the basic loaded / not-loaded, since this is useful as is. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] KnnFloatVectorQuery#toString should show the filter [lucene]
benwtrent commented on issue #13983: URL: https://github.com/apache/lucene/issues/13983#issuecomment-2484000904 This is now fixed: https://github.com/apache/lucene/pull/13990 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] KnnFloatVectorQuery#toString should show the filter [lucene]
benwtrent closed issue #13983: KnnFloatVectorQuery#toString should show the filter URL: https://github.com/apache/lucene/issues/13983 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduces IndexInput#updateReadAdvice to change the ReadAdvice while merging vectors [lucene]
shatejas commented on PR #13985: URL: https://github.com/apache/lucene/pull/13985#issuecomment-2484035773 ### Benchmarks Setup 1 - Opensearch cluster Ran with [opensearch benchmarks](https://github.com/opensearch-project/opensearch-benchmark) Total data nodes - 3 Total shards - 6 (2 per node), no replicas Memory - 128gb vCPU - 16 Dataset used: cohere-10m Baseline - OS 2.18 and lucene 9.12 Candidate - OS 2.16 and lucene [9.12 with readAdvice changes](https://github.com/apache/lucene/compare/branch_9_12...shatejas:lucene:branch_9_12) **Why was this tested with lucene 9.12?** Opensearch is not using lucene >9.12 for any of its version. Upgrading it to use lucene 10 requires significant changes. For candidate, required commits were cherry-picked Run 1: sequence of operations: delete-index -> create-index -> add documents -> force-merge -> search # Results | | Force-merge(ms) | Force-merge(hrs) | Search p50 | Search p90 | Search p99 | |---|---|--|||| | Baseline | 15795889.88313920 | 4hrs 23 mins | 9.6| 10.8 | 14.7 | | Candidate | 15204143.95724240 | 4hrs 13mins | 10.7 | 12.0 | 15.0 | Run 2: Search performed on already indexed data from above run | | Search p50 | Search p90 | Search p99 | |---|||| | Baseline | 9.7| 10.6 | 12.1 | | Candidate | 10.4 | 11.3 | 12.5 | Setup 2: Used lucene-utils knnPerfTest.py Baseline - Lucene main Candidate - Lucene main with current commit **Baseline** | recall | latency (ms) | nDoc | topK | fanout | maxConn | beamWidth | quantized | index s | index docs/s | force merge s | num segments | index size (MB) | |---||||---||||---||||| | 0.644 | 0.428 | 5 | 10 | 64 | 64 | 250 | no | 18.97 | 2635.18 | 1.89 | 1|20.62 | **Candidate** | recall | latency (ms) | nDoc | topK | fanout | maxConn | beamWidth | quantized | index s | index docs/s | force merge s | num segments | index size (MB) | |---||||---||||---||||| | 0.644| 0.436 | 5 | 10| 64 | 64 | 250 | no | 20.20|2474.76|1.77 |1 | 20.62 | There is a small affect on search latencies, its hard to say if its due to the change or just a fluctuation in the runs. I couldn't think of a reason that would of search latencies @jpountz @ChrisHegarty thoughts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add IndexInput isLoaded [lucene]
ChrisHegarty commented on PR #13998: URL: https://github.com/apache/lucene/pull/13998#issuecomment-2483288393 Yeah, we can look at how to call `mincore`, and it might not be that much of a lift with the existing plumbing. Maybe something can look at as a follow up? I'm really trying to get to a situation where I a load (`MADV_WILLNEED`), and check even the HNSQ graph. Maybe even `mlock`, as a potential follow up. Since not having the graph in memory results in horrible perf (need to get some numbers). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add IndexInput isLoaded [lucene]
rmuir commented on code in PR #13998: URL: https://github.com/apache/lucene/pull/13998#discussion_r1846706623 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -406,6 +406,14 @@ void advise(long offset, long length, IOConsumer advice) throws I } } + public Optional isLoaded() { +boolean loaded = true; +for (MemorySegment seg : segments) { + loaded = loaded && seg.isLoaded(); +} +return Optional.of(loaded); + } + Review Comment: we could `return false` as soon as we see it, rather than continue to loop and call mincore? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Avoid allocating liveDocs for no soft-deletes (#13895) (#13903) [lucene]
dnhatn merged PR #14001: URL: https://github.com/apache/lucene/pull/14001 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Update lastDoc in ScoreCachingWrappingScorer [lucene]
jpountz commented on code in PR #13987: URL: https://github.com/apache/lucene/pull/13987#discussion_r1846948010 ## lucene/core/src/test/org/apache/lucene/search/TestScoreCachingWrappingScorer.java: ## @@ -157,4 +157,40 @@ public void testGetScores() throws Exception { ir.close(); directory.close(); } + + private static class CountingScorable extends FilterScorable { +int count = 0; + +public CountingScorable(Scorable in) { + super(in); +} + +@Override +public float score() throws IOException { + count++; + return in.score(); +} + } + + public void testRepeatedCollectReusesScore() throws Exception { +Scorer s = new SimpleScorer(); +CountingScorable countingScorable = new CountingScorable(s); +ScoreCachingCollector scc = new ScoreCachingCollector(scores.length * 2); +LeafCollector lc = scc.getLeafCollector(null); +lc.setScorer(countingScorable); + +// We need to iterate on the scorer so that its doc() advances. +int doc; +while ((doc = s.iterator().nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { + lc.collect(doc); + lc.collect(doc); Review Comment: Thanks for explaining, then my suggestion would be to keep the test simple and not call collect() multiple times on the same doc. Your change looks good to me otherwise, so I can merge after that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add IndexInput isLoaded [lucene]
rmuir commented on PR #13998: URL: https://github.com/apache/lucene/pull/13998#issuecomment-2483336503 Also for debugging these issues, you can get this information at non-java level using `fincore` from util-linux, which is probably on any machine: ``` myindexdir$ fincore --output-all * PAGES SIZE FILE RES DIRTY_PAGES DIRTY WRITEBACK_PAGES WRITEBACK EVICTED_PAGES EVICTED RECENTLY_EVICTED_PAGES RECENTLY_EVICTED ... ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add IndexInput isLoaded [lucene]
rmuir commented on PR #13998: URL: https://github.com/apache/lucene/pull/13998#issuecomment-2483230262 You would need to call `mincore` or something yourself. I can't remember, but the native access may already be plumbed. for non-mmapped i/o you can do similar with syscalls such as `cachestat` but you need modern linux kernel for that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Parse escaped brackets and spaces in range queries [lucene]
benchaplin commented on PR #13887: URL: https://github.com/apache/lucene/pull/13887#issuecomment-2483785234 @dweiss you mentioned in my previous PR that I should do some randomized testing. I did, which helped me find the "Addition of "\\" in the negation set" requirement. However I just translated the key test cases to unit tests for this PR. Did you want me to commit this kind of randomized test? If so, I was thinking I'd have to introduce a snapshotting mechanism to record the output of oldParser(testTerm) as a baseline and compare it against newParser(testTerm) to catch any unintended changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding filter to the toString() method of KnnFloatVectorQuery [lucene]
benwtrent merged PR #13990: URL: https://github.com/apache/lucene/pull/13990 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Only consider clauses whose cost is less than the lead cost to compute block boundaries in WANDScorer. [lucene]
jpountz commented on PR #14000: URL: https://github.com/apache/lucene/pull/14000#issuecomment-2482613789 Here are the luceneutil results for filtering tasks: ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value FilteredAnd3Terms 163.08 (2.1%) 161.72 (2.3%) -0.8% ( -5% -3%) 0.224 FilteredAndHighHigh 44.52 (1.1%) 44.18 (1.5%) -0.8% ( -3% -1%) 0.068 FilteredAndHighMed 100.39 (1.2%) 99.64 (1.3%) -0.7% ( -3% -1%) 0.062 FilteredAndStopWords 26.42 (1.6%) 26.26 (2.0%) -0.6% ( -4% -2%) 0.280 FilteredPhrase 25.18 (2.8%) 25.05 (2.2%) -0.5% ( -5% -4%) 0.537 FilteredAnd2Terms2StopWords 92.66 (1.7%) 92.72 (3.2%)0.1% ( -4% -5%) 0.936 PKLookup 275.92 (1.3%) 276.36 (2.7%)0.2% ( -3% -4%) 0.816 FilteredTerm 157.87 (2.6%) 158.21 (2.3%)0.2% ( -4% -5%) 0.781 FilteredOrMany8.60 (4.2%)9.16 (3.6%)6.4% ( -1% - 14%) 0.000 FilteredOr3Terms 129.33 (2.1%) 148.75 (2.6%) 15.0% ( 10% - 20%) 0.000 FilteredOrHighMed 103.76 (1.8%) 121.73 (2.3%) 17.3% ( 12% - 21%) 0.000 FilteredOr2Terms2StopWords 94.29 (1.9%) 116.45 (2.0%) 23.5% ( 19% - 27%) 0.000 FilteredOrHighHigh 46.91 (1.8%) 64.91 (2.6%) 38.4% ( 33% - 43%) 0.000 FilteredOrStopWords 27.98 (1.5%) 42.49 (2.8%) 51.9% ( 46% - 57%) 0.000 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Only consider clauses whose cost is less than the lead cost to compute block boundaries in WANDScorer. [lucene]
jpountz opened a new pull request, #14000: URL: https://github.com/apache/lucene/pull/14000 WANDScorer implements block-max WAND and needs to recompute score upper bounds whenever it moves to a different block. Thus it's important for these blocks to be large enough to avoid re-computing score upper bounds over and over again. With this commit, WANDScorer no longer uses clauses whose cost is higher than the cost of the filter to compute block boundaries. This effectively makes blocks larger when the filter is more selective. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Unable to Tessellate shape for a valid Polygon according to GDAL/OGR and PostGIS [lucene]
garaud commented on issue #13841: URL: https://github.com/apache/lucene/issues/13841#issuecomment-2482280736 Thank you very much @iverase for your work! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding filter to the toString() method of KnnFloatVectorQuery [lucene]
viswanathk commented on PR #13990: URL: https://github.com/apache/lucene/pull/13990#issuecomment-2483545431 > I think a `CHANGES` entry is in order. This seems like a nice little bug fix to aid folks in debugging issues. > > @viswanathk once you add the changes entry, I can merge and backport to 10.x @benwtrent Added in `CHANGES` for 10.1. Also corrected some git history mess up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add IndexInput isLoaded [lucene]
rmuir commented on PR #13998: URL: https://github.com/apache/lucene/pull/13998#issuecomment-2483324513 > Yeah, we can look at how to call `mincore`, and it might not be that much of a lift with the existing plumbing. Maybe something can look at as a follow up? I'm really trying to get to a situation where we can load (`MADV_WILLNEED`), and check even the HNSW graph. Maybe even `mlock`, as a potential follow up. Since not having the graph in memory results in horrible perf (need to get some numbers). yes, agreed about `mincore` as a followup. Let's use existing JDK plumbing as a start as done here. i'm very much against using `mlock`, there are so many problems with this. With an out of box linux system my ulimit for this is set to 8MB. I really don't think we should be mlocking gigabytes of vectors because the access is inefficient. It would be better to improve documentation, so that users avoid the typical mistakes such as setting too-big java heap (leaving no room for buffers/cache), configure swappiness if needed, etc. mlock will just make problems worse. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add IndexInput isLoaded [lucene]
ChrisHegarty commented on code in PR #13998: URL: https://github.com/apache/lucene/pull/13998#discussion_r1846722542 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -406,6 +406,14 @@ void advise(long offset, long length, IOConsumer advice) throws I } } + public Optional isLoaded() { +boolean loaded = true; +for (MemorySegment seg : segments) { + loaded = loaded && seg.isLoaded(); +} +return Optional.of(loaded); + } + Review Comment: yeah, I simplified to return early if false is encountered. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Only consider clauses whose cost is less than the lead cost to compute block boundaries in WANDScorer. [lucene]
jpountz commented on PR #14000: URL: https://github.com/apache/lucene/pull/14000#issuecomment-2483629696 Will do, thanks @benwtrent! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up top-k retrieval on filtered conjunctions. [lucene]
jpountz commented on PR #13994: URL: https://github.com/apache/lucene/pull/13994#issuecomment-2482217169 Thanks @benwtrent ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up top-k retrieval of filtered disjunctions a bit. [lucene]
jpountz commented on PR #13996: URL: https://github.com/apache/lucene/pull/13996#issuecomment-2482516359 I ran with more tasks to confirm it's generally helpful: ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value FilteredAnd2Terms2StopWords 93.33 (1.2%) 92.88 (2.1%) -0.5% ( -3% -2%) 0.364 FilteredAndStopWords 26.52 (1.1%) 26.45 (1.5%) -0.3% ( -2% -2%) 0.519 FilteredAndHighHigh 44.67 (0.8%) 44.70 (1.2%)0.1% ( -1% -2%) 0.865 PKLookup 277.59 (2.0%) 277.98 (2.3%)0.1% ( -4% -4%) 0.839 FilteredAndHighMed 100.60 (0.8%) 100.78 (1.5%)0.2% ( -2% -2%) 0.657 FilteredTerm 160.28 (1.5%) 160.62 (1.9%)0.2% ( -3% -3%) 0.696 FilteredAnd3Terms 163.21 (1.5%) 163.85 (2.3%)0.4% ( -3% -4%) 0.519 FilteredPhrase 25.14 (2.0%) 25.26 (2.0%)0.5% ( -3% -4%) 0.450 FilteredOrStopWords 28.12 (1.7%) 29.53 (2.2%)5.0% ( 1% -9%) 0.000 FilteredOr2Terms2StopWords 94.79 (1.9%) 100.43 (1.9%)6.0% ( 2% -9%) 0.000 FilteredOrHighHigh 47.22 (1.9%) 50.34 (2.2%)6.6% ( 2% - 10%) 0.000 FilteredOrHighMed 104.71 (2.1%) 112.43 (2.2%)7.4% ( 2% - 11%) 0.000 FilteredOr3Terms 130.07 (2.5%) 142.22 (2.4%)9.3% ( 4% - 14%) 0.000 FilteredOrMany8.72 (3.2%)9.72 (2.6%) 11.5% ( 5% - 17%) 0.000 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org