Re: [PR] Fix 9.12.0 backcompat break (Lucene 9.12.0 cannot read 9.11.x indices written with quantized HNSW, `Lucene99HnswScalarQuantizedVectorsFormat`) [lucene]
javanna commented on PR #13874: URL: https://github.com/apache/lucene/pull/13874#issuecomment-2412274304 Heya, I am working on generating the backwards indices after releasing Lucene 10, and here are my observations: - I think that we need to forward port this change to main as well? - We need to update the tooling to generate the backwards indices accordingly (python code and java code are no longer in sync) I am happy to work on this, I think I understand what needs to happen but I wanted to confirm first. Perhaps the forward port is just a cherry-pick to main, hopefully? And once the branches are aligned, I can adjust the tooling and test them direclty as I generate the needed bwc indices for 10.0.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Make dynamic range facets value collection and sorting faster [lucene]
HoustonPutman commented on issue #13760: URL: https://github.com/apache/lucene/issues/13760#issuecomment-2412577287 @timgrein , I've posted a PR for my idea that @stefanvodita mentioned. If you have the JMH benchmark, I'd love to test it out on mine as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix 9.12.0 backcompat break (Lucene 9.12.0 cannot read 9.11.x indices written with quantized HNSW, `Lucene99HnswScalarQuantizedVectorsFormat`) [lucene]
javanna commented on PR #13874: URL: https://github.com/apache/lucene/pull/13874#issuecomment-2412365840 I did find above a diff that is exactly the same as the PR I opened to update AddBackcompatindices :) (#13911) . I also opened a PR against main to cherry-pick this change and make the necessary adjustments: #13912 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Align TestGenerateBwcIndices.java with AddBackcompatindices.py [lucene]
javanna opened a new pull request, #13911: URL: https://github.com/apache/lucene/pull/13911 We updated TestGenerateBwcIndices to create int7 HNSW indices instead of int8 with #13874. The corresponding python code part of the release wizard needs to be updated accordingly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Try using Murmurhash 3 for bloom filters [lucene]
vsop-479 commented on PR #12868: URL: https://github.com/apache/lucene/pull/12868#issuecomment-2412698518 BTW, why `StringHelper` is a abstract class? Can we make it final? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Add an S3-based directory. [lucene]
atris commented on issue #13868: URL: https://github.com/apache/lucene/issues/13868#issuecomment-2412914455 @jpountz Interestingly, I have been spending some time on making this work - if its ok, I will self assign this issue? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] [DISCUSS] Could we have a different ANN algorithm for Learned Sparse Vectors? [lucene]
atris commented on issue #13675: URL: https://github.com/apache/lucene/issues/13675#issuecomment-2412918581 I have recently been interested in this direction and plan on spending non trivial amount of time on this over the next few weeks. Assuming we haven't started dev on this, I am assigning it to myself. What I have been hacking on is impact sorted prioritisation of docIDs during the ranking phase (especially for custom rankers), so that would probably be the first thing that comes out of this ticket. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add BaseKnnVectorsFormatTestCase.testRecall() and fix old codecs [lucene]
msokolov commented on PR #13910: URL: https://github.com/apache/lucene/pull/13910#issuecomment-2412421807 I wonder whether we should backport the fixes to the `Lucene90HnswVectorsReader`? I tend to think we ought to, although the usage might be tiny to nonexistent -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Support multi-tenant RAM buffers for IndexWriter [lucene]
mdmarshmallow opened a new issue, #13913: URL: https://github.com/apache/lucene/issues/13913 ### Description This is related to https://github.com/apache/lucene/issues/13883. The idea is to allow users to specify the RAM usage once and it will be automatically spread across N IndexWriter's so they will each have 1/N RAM allocated to them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Use multi-select instead of a full sort for DynamicRange creation [lucene]
HoustonPutman opened a new pull request, #13914: URL: https://github.com/apache/lucene/pull/13914 Resolves #13760 ### Description This is using a similar approach to how Solr used to compute multiple percentiles at a single time. Basically utilize the quick select method, but instead of following a single path, follow a path for each of the `k`s that is requested. Multi-quickselect. That's what I originally made, until I realized that the `DynamicRangeUtil` is weighted, so I refactored it to choose by weights instead, and also capture the running-value-total and running-weight-total, because that information is used in the `DynamicRangeInfo`. My goal was to add this as a generic capability of the `Selector` (or `IntroSelector`) class, but because of the limitations above, it is currently a separate class to handle this. If there's any suggestions on how to make this generic enough to be put in the generic class, that would be great. But it might not be worth the effort if it wouldn't be used anywhere else. As for the original multi-quickSelect algorithm I mentioned, I looked for other multi-select use cases across Lucene, but I only found one instance (The quantized does two select calls in succession). If there's more instances we can find, I would be happy to add multiSelect as an option on the `Selector` class, and implement it in all provided classes. ### To-Do - The code needs to be cleaned up and better documented, this is just a POC - Benchmarks comparing this to the full-sorting implementation. ### Caveat The implement is slightly different, as it will pick the groups according to "The first value for which the running weight is <= weight-range-boundary". The old logic would start counting again after a weight range was complete, which removes information from the overflow of previous weight-ranges. I'm not sure either approach is right or wrong, but I wanted to explicitly state how the results would be different and why I had to alter a unit test to pass. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove synchronization from IndexWriter.isClosed [lucene]
github-actions[bot] commented on PR #13834: URL: https://github.com/apache/lucene/pull/13834#issuecomment-2412576598 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use RandomAccessInput instead of seeking in Lucene90DocValuesProducer [lucene]
uschindler commented on PR #13894: URL: https://github.com/apache/lucene/pull/13894#issuecomment-2410749511 I dont think this is a real slowdown caused by the commit here. It is more caused by the Hotspot optimizer misinterpreting something. We should get the assembly code from the benchmark and compare them to figure out why it gets worse. Actually the non-random access modifies global state (moves position), so actually optimizer does a harder job, so the slowdown is very strange. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Try using Murmurhash 3 for bloom filters [lucene]
shubhamvishu commented on PR #12868: URL: https://github.com/apache/lucene/pull/12868#issuecomment-2410748497 @jpountz I simplified the expression now. Let me know if the change looks good? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Try using Murmurhash 3 for bloom filters [lucene]
shubhamvishu commented on code in PR #12868: URL: https://github.com/apache/lucene/pull/12868#discussion_r1799172633 ## lucene/codecs/src/java/org/apache/lucene/codecs/bloom/FuzzySet.java: ## @@ -150,9 +149,10 @@ private FuzzySet(FixedBitSet filter, int bloomSize, int hashCount) { * @return NO or MAYBE */ public ContainsResult contains(BytesRef value) { -long hash = hashFunction.hash(value); -int msb = (int) (hash >>> Integer.SIZE); -int lsb = (int) hash; +long[] hash = StringHelper.murmurhash3_x64_128(value); + +int msb = (((int) hash[0] >>> Integer.SIZE) >>> 1) + (((int) hash[1] >>> Integer.SIZE) >>> 1); +int lsb = ((int) hash[0] >>> 1) + ((int) hash[1] >>> 1); Review Comment: Done! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use RandomAccessInput instead of seeking in Lucene90DocValuesProducer [lucene]
uschindler commented on PR #13894: URL: https://github.com/apache/lucene/pull/13894#issuecomment-2410803086 I was a bit afraid that this might have been caused by missing "prefetch()" implementation, but for MemorySegmentIndexInput also the random access slice uses the correct implementation. My only idea could be that there's a difference between the inlining decisions for *interface* `RandomAccessInput` and the class `IndexInput` (both implemented by MemorySegmentIndexInput, but the first is using `invokeinterface` which the latter is using `invokevirtual` bytecode). This could make a difference here, so we need a dump of which methods were inlined before/after. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Create a bot to check if there is a CHANGES entry for new PRs [lucene]
javanna commented on issue #13898: URL: https://github.com/apache/lucene/issues/13898#issuecomment-2410385807 Yes to this! It would be great to combine this with setting the milestone :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] PR 13757 follow-up: add missing with-discountOverlaps Similarity constructor variants, CHANGES.txt entries (#13845) [lucene]
javanna commented on code in PR #13891: URL: https://github.com/apache/lucene/pull/13891#discussion_r1798961357 ## lucene/CHANGES.txt: ## @@ -47,6 +52,9 @@ API Changes the entire segment should be scored. Subclasses that override the method should instead override its replacement. (Luca Cavanna) +* GITHUB#13757: For similarities, provide default computeNorm implementation and remove remaining discountOverlaps setters. + (Christine Poerschke, Adrien Grand, Robert Muir) Review Comment: I am afraid is is now too late to modify the changelog for 9.12, as it's already been released? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Replace Map with IntObjectHashMap for KnnVectorsReader [lucene]
bugmakerr commented on PR #13763: URL: https://github.com/apache/lucene/pull/13763#issuecomment-2410922885 hi @jpountz ,since we have moved to lucene 10, should we merge this and add back #13686? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use RandomAccessInput instead of seeking in Lucene90DocValuesProducer [lucene]
jpountz commented on PR #13894: URL: https://github.com/apache/lucene/pull/13894#issuecomment-2410306410 Looks like there was a slowdown on some taxo facets tasks (which use binary doc values to store the taxonomy). `OrHighMedDayTaxoFacets`, `AndHighHighDayTaxoFacets` and `MedTermDayTaxoFacets` had a slowdown with a low p-value. On the other hand, `BrowseMonthTaxoFacets` went a bit faster (p-value is not so low, but this is the best data point in a long time, which suggests it's not noise). https://benchmarks.mikemccandless.com/2024.10.12.18.20.19.html We should not revert, but I'll annotate. I'm curious if someone has ideas on the cause. Also cc @stefanvodita @gsmiller since you have interest in facets. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Better handle dynamic pruning when the leading clause has a single impact block. [lucene]
jpountz opened a new pull request, #13904: URL: https://github.com/apache/lucene/pull/13904 `BlockMaxConjunctionBulkScorer` only checks if it can early exit based on impacts once per window, and windows are computed using impact blocks of the leading clause. So this logic is defeated if the leading clause produces a single block (e.g. `ConstantScoreQuery`). This commit addresses this problem by artificially lowering the window size to contain ~128 docs of the leading clause. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use RandomAccessInput instead of seeking in Lucene90DocValuesProducer [lucene]
original-brownbear commented on PR #13894: URL: https://github.com/apache/lucene/pull/13894#issuecomment-2410525472 Looks to me like a few inlining decisions changed (judging by the profiling and the relative weight of inlined versions vs non-inlined versions of some method). I'm not even entirely sure it's because of this change and not one of the other changes. The std-deviation for `BrowseMonthTaxoFacets` also dropped from 20%+ to 0.1%+, which has me wondering if maybe we have too little warmup time here before measuring? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Added support for highlighting `IndexOrDocValuesQuery` [lucene]
jpountz merged PR #13902: URL: https://github.com/apache/lucene/pull/13902 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make MaxScoreBulkScorer repartition scorers when the min competitive increases. [lucene]
jpountz commented on PR #13800: URL: https://github.com/apache/lucene/pull/13800#issuecomment-243632 Here's the fix for conjunctions: https://github.com/apache/lucene/pull/13904. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Test changelog verifier [lucene]
stefanvodita opened a new pull request, #13908: URL: https://github.com/apache/lucene/pull/13908 test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Add changelog verifier [lucene]
stefanvodita opened a new pull request, #13909: URL: https://github.com/apache/lucene/pull/13909 ### Description This runs along the checks we already have for PR creation/update and warns us if there is no CHANGES.txt entry. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Create a bot to check if there is a CHANGES entry for new PRs [lucene]
stefanvodita commented on issue #13898: URL: https://github.com/apache/lucene/issues/13898#issuecomment-2411705472 @prudhvigodithi - I opened #13898 inspired by your PR, tested it on my fork, and it's working correctly. @javanna - I like that idea! Maybe we can do that as a follow-up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [SOLR-11191] SolrIndexSplitter: Init support for routing docs by _root_ when available [lucene-solr]
dsmiley commented on code in PR #2686: URL: https://github.com/apache/lucene-solr/pull/2686#discussion_r1799803611 ## solr/core/src/java/org/apache/solr/update/SolrIndexSplitter.java: ## @@ -130,9 +131,26 @@ public SolrIndexSplitter(SplitIndexCommand cmd) { } routeFieldName = cmd.routeFieldName; if (routeFieldName == null) { - field = searcher.getSchema().getUniqueKeyField(); + // To support routing child documents, use the root field if it exists (which would be populated with unique field), + // otherwise use the unique key field + field = searcher.getSchema().getFieldOrNull(IndexSchema.ROOT_FIELD_NAME); + if(field == null) { +field = searcher.getSchema().getUniqueKeyField(); + } } else { - field = searcher.getSchema().getField(routeFieldName); + SchemaField uniqueField = searcher.getSchema().getUniqueKeyField(); + if (uniqueField.getName().equals(routeFieldName)) { +// Explicitly routing based on unique field +// To support routing child documents, use the root field if it exists (which would be populated with unique field), +// otherwise use the unique key field +field = searcher.getSchema().getFieldOrNull(IndexSchema.ROOT_FIELD_NAME); +if (field == null) { + field = searcher.getSchema().getUniqueKeyField(); +} + } else { +// Custom routing +field = searcher.getSchema().getField(routeFieldName); + } Review Comment: Notice it's conditional based on the router having a "field" in the JSON. It's rare to use that. Also if we don't support something (child docs + whatever), we ought to throw an exception instead of do the wrong thing. See `org.apache.solr.schema.IndexSchema#isUsableForChildDocs` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Avoid slicing memory segments unnecessarily [lucene]
jpountz commented on code in PR #13906: URL: https://github.com/apache/lucene/pull/13906#discussion_r1799801328 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -563,6 +563,13 @@ public final MemorySegmentIndexInput slice(String sliceDescription, long offset, return buildSlice(sliceDescription, offset, length); } + public RandomAccessInput randomAccessSlice(long offset, long length) throws IOException { +if (offset == 0 && length == this.length) { + return this; Review Comment: I'm curious if this code path is properly tested, I would assume most call sites of `randomAccessSlice` to only use a subset of the file/clone. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Test changelog verifier [lucene]
stefanvodita closed pull request #13908: Test changelog verifier URL: https://github.com/apache/lucene/pull/13908 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Test changelog verifier [lucene]
stefanvodita commented on PR #13908: URL: https://github.com/apache/lucene/pull/13908#issuecomment-2411670171 Testing a Github workflow, not intended to merge. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Avoid slicing memory segments unnecessarily [lucene]
original-brownbear commented on PR #13906: URL: https://github.com/apache/lucene/pull/13906#issuecomment-2411864517 > One thing that is a bit awkward to me is that it makes clones cheaper than slices, so e.g. refactoring TermsEnum#postings to work on a slice that contains just the postings of the current term would be a regression vs. cloning the whole file like today. Not a blocker, it just gets me thinking. I had the same thought :D that's what got me here in part. Moving the offset handling into whatever object that uses an `IndexInput` now and replacing that with a stateless shared `RandomAccessInput` should allow for some performance gains here and there and more importantly, should allow for massive heap savings. It's still quite wild today for ES, I took a heap dump of a http_logs track benchmark node and somehow tens of shards today translate into MBs of `IndexInput` instances for us. I get that there's two sides to this, ES should probably have fewer segments but also still there's no need to redundantly clone (stateful) `IndexInput` when we often also track file pointers in the users of those things anyway in Lucene? > I am not sure if this really helps much See above. But also, even in the Lucene benchmarks it's about 1% of allocations in wikimedium saved for me. Not a massive win, but not entirely irrelevant either :) Also, this is somewhat helpful as far as making heap-dumps easier to interpret actually. Always easier to find redundancies if object sharing isn't super far down the reference chains :) On it, looking looking into a test :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Add BaseKnnVectorsFormatTestCase.testRecall() and fix old codecs [lucene]
msokolov opened a new pull request, #13910: URL: https://github.com/apache/lucene/pull/13910 While exploring some recall-related failures in another PR I went looking for a unit test that checks HNSW/KNN recall and couldn't find any. I think we used to have one but maybe we removed it because it was flaky? But we really do need such a test since it is possible to make changes that preserve all the formal properties of the codecs and queries yet destroy recall. I thought if we can create such a test with known data and vectors it would be more predictable than one using random data, so I made one, and it uncovered a couple of bugs: In Lucene90HnswVectorsReader we messed up (removed) ord-to-doc mappings so we were returning vector ords instead of docids in search results. I guess this would have totally borked back-compat for Lucene90 indexes. Probably there are none in the wild, and this was never noticed? In Lucene91RWFormat (used only for back-compat testing) we messed up diversity check so we were producing bad graphs. This PR fixes these things and adds the new test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Avoid slicing memory segments unnecessarily [lucene]
uschindler commented on PR #13906: URL: https://github.com/apache/lucene/pull/13906#issuecomment-2411866037 P.P.S.: Just some background: The code was copied from ByteBufferIndexInput where the clones were necessary. With MemorySegment we no longer create `dup()`s, as MemorySegment is stateless. So the changes here look fine, but need more tests. Maybe we can refactor the method a bit, so the `clone()` does not go through `buildSlice()` at all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Better handle dynamic pruning when the leading clause has a single impact block. [lucene]
jpountz commented on PR #13904: URL: https://github.com/apache/lucene/pull/13904#issuecomment-2410663119 I confirmed that this makes things better when all clauses are `ConstantScoreQuery`s, luceneutil doesn't report a slowdown: ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value OrNotHighLow 585.86 (7.2%) 566.68 (5.1%) -3.3% ( -14% -9%) 0.095 HighTermDayOfYearSort 306.99 (8.9%) 301.37 (8.6%) -1.8% ( -17% - 17%) 0.506 And3Terms 226.85 (5.6%) 222.90 (6.6%) -1.7% ( -13% - 11%) 0.369 OrHighNotLow 307.70 (6.4%) 302.80 (7.2%) -1.6% ( -14% - 12%) 0.461 OrNotHighHigh 185.78 (5.2%) 183.47 (6.6%) -1.2% ( -12% - 11%) 0.510 Or3Terms 223.86 (7.1%) 221.64 (7.7%) -1.0% ( -14% - 14%) 0.674 Phrase 26.60 (2.8%) 26.37 (4.0%) -0.9% ( -7% -6%) 0.423 OrHighNotHigh 220.17 (5.3%) 218.70 (5.8%) -0.7% ( -11% - 10%) 0.702 HighTerm 294.16 (7.0%) 292.23 (6.7%) -0.7% ( -13% - 13%) 0.761 AndHighLow 595.88 (4.3%) 591.99 (4.7%) -0.7% ( -9% -8%) 0.648 OrHighHigh 117.92 (2.3%) 117.19 (5.0%) -0.6% ( -7% -6%) 0.614 AndHighMed 271.88 (7.2%) 270.24 (7.9%) -0.6% ( -14% - 15%) 0.800 AndHighHigh 119.61 (3.9%) 118.96 (4.5%) -0.5% ( -8% -8%) 0.683 OrStopWords 66.94 (7.5%) 66.58 (7.0%) -0.5% ( -13% - 15%) 0.814 OrHighLow 468.84 (3.9%) 466.32 (3.9%) -0.5% ( -7% -7%) 0.662 HighTermMonthSort 763.50 (2.3%) 761.25 (2.9%) -0.3% ( -5% -5%) 0.724 TermDTSort 216.36 (5.8%) 215.86 (4.0%) -0.2% ( -9% - 10%) 0.885 OrHighMed 197.37 (4.8%) 197.03 (7.4%) -0.2% ( -11% - 12%) 0.929 HighTermTitleBDVSort 52.15 (7.5%) 52.11 (4.7%) -0.1% ( -11% - 13%) 0.963 And2Terms2StopWords 207.53 (4.6%) 207.84 (5.7%)0.1% ( -9% - 10%) 0.927 OrHighNotMed 284.45 (5.8%) 284.99 (6.8%)0.2% ( -11% - 13%) 0.925 HighTermTitleSort 97.13 (4.9%) 97.34 (6.5%)0.2% ( -10% - 12%) 0.904 AndStopWords 61.72 (5.1%) 61.86 (5.5%)0.2% ( -9% - 11%) 0.892 CountTerm 5714.94 (3.6%) 5737.59 (5.3%)0.4% ( -8% -9%) 0.782 MedTerm 376.20 (6.7%) 377.74 (7.4%)0.4% ( -12% - 15%) 0.855 PKLookup 191.27 (3.0%) 192.19 (3.5%)0.5% ( -5% -7%) 0.647 IntNRQ 163.19 (12.4%) 164.32 (8.5%)0.7% ( -18% - 24%) 0.837 Or2Terms2StopWords 209.38 (3.9%) 210.84 (6.3%)0.7% ( -9% - 11%) 0.673 LowTerm 472.83 (6.8%) 476.30 (6.1%)0.7% ( -11% - 14%) 0.718 OrHighRare 254.55 (4.9%) 256.78 (7.7%)0.9% ( -11% - 14%) 0.667 Wildcard 69.55 (3.4%) 70.32 (5.0%)1.1% ( -7% -9%) 0.418 CountAndHighMed 233.44 (6.1%) 236.36 (5.3%)1.3% ( -9% - 13%) 0.487 CountAndHighHigh 95.44 (3.6%) 96.85 (3.7%)1.5% ( -5% -9%) 0.201 Prefix3 180.59 (5.4%) 183.71 (7.7%)1.7% ( -10% - 15%) 0.410 CountOrHighMed 177.68 (6.6%) 181.10 (7.5%)1.9% ( -11% - 17%) 0.387 OrNotHighMed 264.61 (6.9%) 270.30 (7.9%)2.1% ( -11% - 18%) 0.358 CountOrHighHigh 113.18 (7.4%) 118.20 (11.8%)4.4% ( -13% - 25%) 0.155 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] Remove redundant code in PointInSetQuery [lucene]
easyice opened a new pull request, #13905: URL: https://github.com/apache/lucene/pull/13905 Clean up unused variable `MergePointVisitor#sortedPackedPoints` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Make dynamic range facets value collection and sorting faster [lucene]
timgrein commented on issue #13760: URL: https://github.com/apache/lucene/issues/13760#issuecomment-2411631641 Cool, I've found some performance improvements (~10-15%), which can be reproduced through a new `jmh` benchmark I've added. I'll open a PR the next few days and tag you :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [SOLR-11191] SolrIndexSplitter: Init support for routing docs by _root_ when available [lucene-solr]
zkendall commented on code in PR #2686: URL: https://github.com/apache/lucene-solr/pull/2686#discussion_r1799751372 ## solr/core/src/java/org/apache/solr/update/SolrIndexSplitter.java: ## @@ -130,9 +131,26 @@ public SolrIndexSplitter(SplitIndexCommand cmd) { } routeFieldName = cmd.routeFieldName; if (routeFieldName == null) { - field = searcher.getSchema().getUniqueKeyField(); + // To support routing child documents, use the root field if it exists (which would be populated with unique field), + // otherwise use the unique key field + field = searcher.getSchema().getFieldOrNull(IndexSchema.ROOT_FIELD_NAME); + if(field == null) { +field = searcher.getSchema().getUniqueKeyField(); + } } else { - field = searcher.getSchema().getField(routeFieldName); + SchemaField uniqueField = searcher.getSchema().getUniqueKeyField(); + if (uniqueField.getName().equals(routeFieldName)) { +// Explicitly routing based on unique field +// To support routing child documents, use the root field if it exists (which would be populated with unique field), +// otherwise use the unique key field +field = searcher.getSchema().getFieldOrNull(IndexSchema.ROOT_FIELD_NAME); +if (field == null) { + field = searcher.getSchema().getUniqueKeyField(); +} + } else { +// Custom routing +field = searcher.getSchema().getField(routeFieldName); + } Review Comment: It kinda seems like the opposite. The routeFieldName seems to be pre-populated by the zk collection state (not from split caller (api doesn't support setting afaict)). Doesn't that mean it'll usually be set? https://github.com/apache/lucene-solr/blob/2d63a9d1208bbf950135b90496268b0a40e119b5/solr/core/src/java/org/apache/solr/handler/admin/SplitOp.java#L118-L135 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [SOLR-11191] SolrIndexSplitter: Init support for routing docs by _root_ when available [lucene-solr]
zkendall commented on code in PR #2686: URL: https://github.com/apache/lucene-solr/pull/2686#discussion_r1799751372 ## solr/core/src/java/org/apache/solr/update/SolrIndexSplitter.java: ## @@ -130,9 +131,26 @@ public SolrIndexSplitter(SplitIndexCommand cmd) { } routeFieldName = cmd.routeFieldName; if (routeFieldName == null) { - field = searcher.getSchema().getUniqueKeyField(); + // To support routing child documents, use the root field if it exists (which would be populated with unique field), + // otherwise use the unique key field + field = searcher.getSchema().getFieldOrNull(IndexSchema.ROOT_FIELD_NAME); + if(field == null) { +field = searcher.getSchema().getUniqueKeyField(); + } } else { - field = searcher.getSchema().getField(routeFieldName); + SchemaField uniqueField = searcher.getSchema().getUniqueKeyField(); + if (uniqueField.getName().equals(routeFieldName)) { +// Explicitly routing based on unique field +// To support routing child documents, use the root field if it exists (which would be populated with unique field), +// otherwise use the unique key field +field = searcher.getSchema().getFieldOrNull(IndexSchema.ROOT_FIELD_NAME); +if (field == null) { + field = searcher.getSchema().getUniqueKeyField(); +} + } else { +// Custom routing +field = searcher.getSchema().getField(routeFieldName); + } Review Comment: It kinda seems like the opposite. The routeFieldName seems to be pre-populated by the zk collection state (not from split caller (api doesn't support setting afaict)). Doesn't that mean it'll usually be set or at least could be set enough to support it? (I believe my use-case does, but I'll learn more when I setup e2e test in this thing.) https://github.com/apache/lucene-solr/blob/2d63a9d1208bbf950135b90496268b0a40e119b5/solr/core/src/java/org/apache/solr/handler/admin/SplitOp.java#L118-L135 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [SOLR-11191] SolrIndexSplitter: Init support for routing docs by _root_ when available [lucene-solr]
zkendall commented on code in PR #2686: URL: https://github.com/apache/lucene-solr/pull/2686#discussion_r1799751372 ## solr/core/src/java/org/apache/solr/update/SolrIndexSplitter.java: ## @@ -130,9 +131,26 @@ public SolrIndexSplitter(SplitIndexCommand cmd) { } routeFieldName = cmd.routeFieldName; if (routeFieldName == null) { - field = searcher.getSchema().getUniqueKeyField(); + // To support routing child documents, use the root field if it exists (which would be populated with unique field), + // otherwise use the unique key field + field = searcher.getSchema().getFieldOrNull(IndexSchema.ROOT_FIELD_NAME); + if(field == null) { +field = searcher.getSchema().getUniqueKeyField(); + } } else { - field = searcher.getSchema().getField(routeFieldName); + SchemaField uniqueField = searcher.getSchema().getUniqueKeyField(); + if (uniqueField.getName().equals(routeFieldName)) { +// Explicitly routing based on unique field +// To support routing child documents, use the root field if it exists (which would be populated with unique field), +// otherwise use the unique key field +field = searcher.getSchema().getFieldOrNull(IndexSchema.ROOT_FIELD_NAME); +if (field == null) { + field = searcher.getSchema().getUniqueKeyField(); +} + } else { +// Custom routing +field = searcher.getSchema().getField(routeFieldName); + } Review Comment: It kinda seems like the opposite. The routeFieldName seems to be pre-populated by the zk collection state (not from split caller (api doesn't support setting afaict)). Doesn't that mean it'll usually be set or at least could be set enough to support it? (I believe my use-case does, but I'll learn more when I setup more thorough test in this thing.) https://github.com/apache/lucene-solr/blob/2d63a9d1208bbf950135b90496268b0a40e119b5/solr/core/src/java/org/apache/solr/handler/admin/SplitOp.java#L118-L135 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [SOLR-11191] SolrIndexSplitter: Init support for routing docs by _root_ when available [lucene-solr]
zkendall commented on code in PR #2686: URL: https://github.com/apache/lucene-solr/pull/2686#discussion_r1799751372 ## solr/core/src/java/org/apache/solr/update/SolrIndexSplitter.java: ## @@ -130,9 +131,26 @@ public SolrIndexSplitter(SplitIndexCommand cmd) { } routeFieldName = cmd.routeFieldName; if (routeFieldName == null) { - field = searcher.getSchema().getUniqueKeyField(); + // To support routing child documents, use the root field if it exists (which would be populated with unique field), + // otherwise use the unique key field + field = searcher.getSchema().getFieldOrNull(IndexSchema.ROOT_FIELD_NAME); + if(field == null) { +field = searcher.getSchema().getUniqueKeyField(); + } } else { - field = searcher.getSchema().getField(routeFieldName); + SchemaField uniqueField = searcher.getSchema().getUniqueKeyField(); + if (uniqueField.getName().equals(routeFieldName)) { +// Explicitly routing based on unique field +// To support routing child documents, use the root field if it exists (which would be populated with unique field), +// otherwise use the unique key field +field = searcher.getSchema().getFieldOrNull(IndexSchema.ROOT_FIELD_NAME); +if (field == null) { + field = searcher.getSchema().getUniqueKeyField(); +} + } else { +// Custom routing +field = searcher.getSchema().getField(routeFieldName); + } Review Comment: It kinda seems like the opposite. The routeFieldName seems to be pre-populated by the zk collection state (not from split caller (api doesn't support setting afaict)). Doesn't that mean it'll usually be set or at least could be set enough to support it? My own collection has this in zk state from cursory glance. I'll learn more how this behaves when I setup a broader test-case. ``` "router":{ "field":"id", "name":"compositeId"}, ``` https://github.com/apache/lucene-solr/blob/2d63a9d1208bbf950135b90496268b0a40e119b5/solr/core/src/java/org/apache/solr/handler/admin/SplitOp.java#L118-L135 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Avoid allocating liveDocs for no soft-deletes (#13895) [lucene]
dnhatn merged PR #13903: URL: https://github.com/apache/lucene/pull/13903 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [SOLR-11191] SolrIndexSplitter: Init support for routing docs by _root_ when available [lucene-solr]
zkendall commented on code in PR #2686: URL: https://github.com/apache/lucene-solr/pull/2686#discussion_r1799751372 ## solr/core/src/java/org/apache/solr/update/SolrIndexSplitter.java: ## @@ -130,9 +131,26 @@ public SolrIndexSplitter(SplitIndexCommand cmd) { } routeFieldName = cmd.routeFieldName; if (routeFieldName == null) { - field = searcher.getSchema().getUniqueKeyField(); + // To support routing child documents, use the root field if it exists (which would be populated with unique field), + // otherwise use the unique key field + field = searcher.getSchema().getFieldOrNull(IndexSchema.ROOT_FIELD_NAME); + if(field == null) { +field = searcher.getSchema().getUniqueKeyField(); + } } else { - field = searcher.getSchema().getField(routeFieldName); + SchemaField uniqueField = searcher.getSchema().getUniqueKeyField(); + if (uniqueField.getName().equals(routeFieldName)) { +// Explicitly routing based on unique field +// To support routing child documents, use the root field if it exists (which would be populated with unique field), +// otherwise use the unique key field +field = searcher.getSchema().getFieldOrNull(IndexSchema.ROOT_FIELD_NAME); +if (field == null) { + field = searcher.getSchema().getUniqueKeyField(); +} + } else { +// Custom routing +field = searcher.getSchema().getField(routeFieldName); + } Review Comment: It kinda seems like the opposite. The routeFieldName seems to be pre-populated by the zk collection state (not from split caller (api doesn't support setting afaict)). Doesn't that mean it'll usually be set or at least could be set enough to support it? https://github.com/apache/lucene-solr/blob/2d63a9d1208bbf950135b90496268b0a40e119b5/solr/core/src/java/org/apache/solr/handler/admin/SplitOp.java#L118-L135 My own collection has this in zk state from cursory glance. I'll learn more how this behaves when I setup a broader test-case. ``` "router":{ "field":"id", "name":"compositeId"}, ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Avoid slicing memory segments unnecessarily [lucene]
uschindler commented on code in PR #13906: URL: https://github.com/apache/lucene/pull/13906#discussion_r1799826647 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -563,6 +563,13 @@ public final MemorySegmentIndexInput slice(String sliceDescription, long offset, return buildSlice(sliceDescription, offset, length); } + public RandomAccessInput randomAccessSlice(long offset, long length) throws IOException { +if (offset == 0 && length == this.length) { + return this; Review Comment: Yes, we need a test! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Avoid slicing memory segments unnecessarily [lucene]
uschindler commented on code in PR #13906: URL: https://github.com/apache/lucene/pull/13906#discussion_r1799835249 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -563,6 +563,13 @@ public final MemorySegmentIndexInput slice(String sliceDescription, long offset, return buildSlice(sliceDescription, offset, length); } + public RandomAccessInput randomAccessSlice(long offset, long length) throws IOException { +if (offset == 0 && length == this.length) { + return this; Review Comment: But this code is perfectly fine as random access slices are stateless, so we can simply return the current instance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Avoid slicing memory segments unnecessarily [lucene]
original-brownbear opened a new pull request, #13906: URL: https://github.com/apache/lucene/pull/13906 No need to slice when it's a clone that is to be used with random access, we already enforce thread access rules anyway. Also, no point in copying the memory segment instance via noop slicing. This doesn't really have much of a runtime impact (looking at wikimedium): ``` BrowseDayOfYearSSDVFacets6.18 (8.5%)6.07 (8.7%) -1.8% ( -17% - 16%) 0.347 IntNRQ 21.39 (3.3%) 21.13 (3.8%) -1.2% ( -8% -6%) 0.115 HighIntervalsOrdered9.76 (4.3%)9.69 (5.7%) -0.7% ( -10% -9%) 0.511 HighTermDayOfYearSort 156.54 (2.4%) 155.52 (2.9%) -0.7% ( -5% -4%) 0.279 LowSpanNear 61.87 (2.3%) 61.48 (2.8%) -0.6% ( -5% -4%) 0.273 TermDTSort 53.99 (1.3%) 53.71 (1.6%) -0.5% ( -3% -2%) 0.108 HighSpanNear6.69 (4.4%)6.66 (4.3%) -0.4% ( -8% -8%) 0.697 OrHighLow 479.99 (2.1%) 479.02 (2.2%) -0.2% ( -4% -4%) 0.670 Fuzzy1 62.30 (1.2%) 62.17 (1.3%) -0.2% ( -2% -2%) 0.480 HighTermMonthSort 556.56 (1.3%) 555.47 (1.6%) -0.2% ( -3% -2%) 0.546 AndHighHighDayTaxoFacets5.60 (4.7%)5.59 (3.7%) -0.2% ( -8% -8%) 0.844 Fuzzy2 61.63 (1.3%) 61.54 (1.6%) -0.1% ( -3% -2%) 0.651 Respell 34.27 (1.5%) 34.22 (1.4%) -0.1% ( -3% -2%) 0.702 LowSloppyPhrase 19.76 (3.9%) 19.74 (3.5%) -0.1% ( -7% -7%) 0.880 MedIntervalsOrdered 19.12 (2.6%) 19.10 (2.6%) -0.1% ( -5% -5%) 0.872 Wildcard 115.59 (2.8%) 115.51 (2.8%) -0.1% ( -5% -5%) 0.905 Prefix3 318.40 (1.4%) 318.82 (1.9%)0.1% ( -3% -3%) 0.727 AndHighMedDayTaxoFacets 52.44 (2.5%) 52.51 (2.3%)0.1% ( -4% -5%) 0.794 BrowseDateSSDVFacets1.32 (6.3%)1.33 (4.8%)0.2% ( -10% - 11%) 0.895 PKLookup 157.21 (3.4%) 157.48 (3.8%)0.2% ( -6% -7%) 0.829 HighPhrase 95.48 (2.3%) 95.69 (2.5%)0.2% ( -4% -5%) 0.682 BrowseMonthTaxoFacets4.89 (0.4%)4.90 (0.4%)0.2% ( 0% -1%) 0.013 MedSloppyPhrase 13.62 (2.8%) 13.66 (2.5%)0.3% ( -4% -5%) 0.615 MedPhrase 23.07 (2.1%) 23.15 (2.4%)0.3% ( -4% -4%) 0.495 HighSloppyPhrase 18.02 (5.3%) 18.09 (4.0%)0.4% ( -8% - 10%) 0.713 MedSpanNear8.17 (2.9%)8.20 (2.9%)0.4% ( -5% -6%) 0.552 LowPhrase 112.30 (2.0%) 112.75 (2.0%)0.4% ( -3% -4%) 0.373 AndHighLow 607.00 (5.8%) 609.50 (5.4%)0.4% ( -10% - 12%) 0.740 LowTerm 204.43 (2.0%) 205.31 (2.7%)0.4% ( -4% -5%) 0.425 HighTermTitleSort 17.83 (1.4%) 17.91 (1.4%)0.4% ( -2% -3%) 0.159 OrHighMedDayTaxoFacets3.09 (3.9%)3.11 (3.8%)0.5% ( -6% -8%) 0.599 LowIntervalsOrdered 20.57 (3.1%) 20.67 (3.9%)0.5% ( -6% -7%) 0.550 HighTermTitleBDVSort5.50 (3.4%)5.53 (3.7%)0.6% ( -6% -7%) 0.468 OrHighMed 89.37 (4.0%) 89.89 (5.4%)0.6% ( -8% - 10%) 0.583 AndHighMed 71.17 (4.3%) 71.61 (5.2%)0.6% ( -8% - 10%) 0.555 MedTermDayTaxoFacets 20.66 (4.0%) 20.81 (4.4%)0.7% ( -7% -9%) 0.458 OrHighHigh 32.44 (6.3%) 32.70 (8.0%)0.8% ( -12% - 16%) 0.625 BrowseMonthSSDVFacets6.29 (7.4%)6.34 (10.5%)0.8% ( -15% - 20%) 0.684 OrNotHighLow 581.89 (2.8%) 586.83 (2.6%)0.8% ( -4% -6%) 0.164 OrNotHighHigh 208.03 (5.2%) 209.97 (6.0%)0.9% ( -9% - 12%) 0.457
Re: [PR] Dry up EverythingEnum and BlockDocsEnum in Lucene912PostingsReader [lucene]
original-brownbear commented on code in PR #13901: URL: https://github.com/apache/lucene/pull/13901#discussion_r1799704584 ## lucene/core/src/java/org/apache/lucene/codecs/lucene912/Lucene912PostingsReader.java: ## @@ -429,9 +403,44 @@ public PostingsEnum reset(IntBlockTermState termState, int flags) throws IOExcep } level1DocCountUpto = 0; docBufferUpto = BLOCK_SIZE; - freqFP = -1; return this; } + } + + final class BlockDocsEnum extends AbstractPostingsEnum { + +private final long[] freqBuffer = new long[BLOCK_SIZE]; + +private boolean needsFreq; // true if the caller actually needs frequencies +private long freqFP; + +public BlockDocsEnum(FieldInfo fieldInfo) { + super(fieldInfo); +} + +public boolean canReuse(IndexInput docIn, FieldInfo fieldInfo) { + final IndexOptions options = fieldInfo.getIndexOptions(); + return docIn == Lucene912PostingsReader.this.docIn + && indexHasFreq == (options.compareTo(IndexOptions.DOCS_AND_FREQS) >= 0); +} + +public PostingsEnum reset(IntBlockTermState termState, int flags) throws IOException { + startReset(termState); + if (pforUtil == null && docFreq >= BLOCK_SIZE) { +pforUtil = new PForUtil(new ForUtil()); +forDeltaUtil = new ForDeltaUtil(); + } + totalTermFreq = indexHasFreq ? termState.totalTermFreq : docFreq; + + this.needsFreq = PostingsEnum.featureRequested(flags, PostingsEnum.FREQS); + if (indexHasFreq == false || needsFreq == false) { +// Filling this buffer may not be cheap when doing primary key lookups, so we make sure to +// not fill more than `docFreq` entries. +Arrays.fill(freqBuffer, 0, Math.min(ForUtil.BLOCK_SIZE, docFreq), 1); + } + freqFP = -1; + return finishReset(termState); Review Comment: I tried :) I found it quite helpful to dedup this, even if the reader has to jump around a little more, they also have some upside from being able to rely on the thing working the same for both versions? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Only call madvise when necessary. [lucene]
jpountz opened a new pull request, #13907: URL: https://github.com/apache/lucene/pull/13907 This commit tries to save calls to `madvise` which are not necessary, either because they map to the OS' default, or because the advice would be overridden later on anyway. I have not noticed specific problems with this, but it seems desirable to keep calls to `madvise` to a minimum. As a consequence: - Files that are open with `ReadAdvice.NORMAL` do not call `madvise` since this is the OS' default. - Compound files are always open with `ReadAdvice.NORMAL`, and the actual is only set when opening sub files of these compound files. To make the latter less trappy, the `IOContext` parameter has been removed from `CompoundFormat#getCompoundReader`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Dry up EverythingEnum and BlockDocsEnum in Lucene912PostingsReader [lucene]
jpountz commented on code in PR #13901: URL: https://github.com/apache/lucene/pull/13901#discussion_r1799409756 ## lucene/core/src/java/org/apache/lucene/codecs/lucene912/Lucene912PostingsReader.java: ## @@ -429,9 +403,44 @@ public PostingsEnum reset(IntBlockTermState termState, int flags) throws IOExcep } level1DocCountUpto = 0; docBufferUpto = BLOCK_SIZE; - freqFP = -1; return this; } + } + + final class BlockDocsEnum extends AbstractPostingsEnum { + +private final long[] freqBuffer = new long[BLOCK_SIZE]; + +private boolean needsFreq; // true if the caller actually needs frequencies +private long freqFP; + +public BlockDocsEnum(FieldInfo fieldInfo) { + super(fieldInfo); +} + +public boolean canReuse(IndexInput docIn, FieldInfo fieldInfo) { + final IndexOptions options = fieldInfo.getIndexOptions(); + return docIn == Lucene912PostingsReader.this.docIn + && indexHasFreq == (options.compareTo(IndexOptions.DOCS_AND_FREQS) >= 0); +} + +public PostingsEnum reset(IntBlockTermState termState, int flags) throws IOException { + startReset(termState); + if (pforUtil == null && docFreq >= BLOCK_SIZE) { +pforUtil = new PForUtil(new ForUtil()); +forDeltaUtil = new ForDeltaUtil(); + } + totalTermFreq = indexHasFreq ? termState.totalTermFreq : docFreq; + + this.needsFreq = PostingsEnum.featureRequested(flags, PostingsEnum.FREQS); + if (indexHasFreq == false || needsFreq == false) { +// Filling this buffer may not be cheap when doing primary key lookups, so we make sure to +// not fill more than `docFreq` entries. +Arrays.fill(freqBuffer, 0, Math.min(ForUtil.BLOCK_SIZE, docFreq), 1); + } + freqFP = -1; + return finishReset(termState); Review Comment: Now readers of this code need to look up 3 different places to understand what `reset()` does: `startReset()`, `reset()` and `finishReset()`. Can we find more meaningful names to the two helper methods or keep the code duplicated for the sake of making it easier to read? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Try using Murmurhash 3 for bloom filters [lucene]
jpountz merged PR #12868: URL: https://github.com/apache/lucene/pull/12868 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Create a community metrics dashboard [lucene]
stefanvodita commented on issue #13896: URL: https://github.com/apache/lucene/issues/13896#issuecomment-2411304207 That sounds like a plan! I'd be thrilled if you were driving this @prudhvigodithi - thank you for offering your help. It would also be great to get feedback from others on which metrics they'd find most useful. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Create a community metrics dashboard [lucene]
prudhvigodithi commented on issue #13896: URL: https://github.com/apache/lucene/issues/13896#issuecomment-2411592601 Thanks @stefanvodita, I would like to have the maintainers of the repo to take the initial stab at this metrics dashboard (come up the server, app, choosing the database and deploying the resources), and later as mentioned I will be happy to contribute to the dashboard design and features. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] `IndexOrDocValuesQuery` does not support query highlighting [lucene]
prudhvigodithi commented on issue #12686: URL: https://github.com/apache/lucene/issues/12686#issuecomment-2411600774 Hey @harshavamsi the PR to support the `IndexOrDocValuesQuery` for query highlighting is now merged. Thanks. @getsaurabh02 @msfroh -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Dry up EverythingEnum and BlockDocsEnum in Lucene912PostingsReader [lucene]
original-brownbear commented on PR #13901: URL: https://github.com/apache/lucene/pull/13901#issuecomment-241157 Thanks Adrien! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Dry up EverythingEnum and BlockDocsEnum in Lucene912PostingsReader [lucene]
original-brownbear merged PR #13901: URL: https://github.com/apache/lucene/pull/13901 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Added support for highlighting `IndexOrDocValuesQuery` [lucene]
prudhvigodithi commented on PR #13902: URL: https://github.com/apache/lucene/pull/13902#issuecomment-2411600165 Thanks for the review and approval @mkhludnev @jpountz. @getsaurabh02 @dblock -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org