Re: [PR] Upgrade spotless to 6.9.1, google java format to 1.23.0. [lucene]
dweiss commented on PR #13661: URL: https://github.com/apache/lucene/pull/13661#issuecomment-2292980814 Thanks, @rmuir -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Upgrade spotless to 6.9.1, google java format to 1.23.0. [lucene]
dweiss merged PR #13661: URL: https://github.com/apache/lucene/pull/13661 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Simplify code and reduce the number of lines of code [lucene]
mrhbj opened a new pull request, #13662: URL: https://github.com/apache/lucene/pull/13662 ### Description -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Simplify code and reduce the number of lines of code [lucene]
mrhbj closed pull request #13662: Simplify code and reduce the number of lines of code URL: https://github.com/apache/lucene/pull/13662 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Simplify code and reduce the number of lines of code [lucene]
mrhbj opened a new pull request, #13663: URL: https://github.com/apache/lucene/pull/13663 ### Description -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize decoding blocks of postings using the vector API. (#13636) [lucene]
jpountz merged PR #13652: URL: https://github.com/apache/lucene/pull/13652 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] nocommit: demonstrate how a minor change in IndexSearcher can have an inexplicable performance impact [lucene]
epotyom commented on PR #13657: URL: https://github.com/apache/lucene/pull/13657#issuecomment-2293467656 I've made some temporary changes in luceneutil to be able to only run a couple of tasks that show regression and have meaningful profiler results - profiler results that we get for all tasks seems to have too many samples for other tasks e.g. faceting. Results after 20 runs: ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value MedTerm 581.44 (3.8%) 505.29 (3.5%) -13.1% ( -19% - -5%) 0.000 HighTerm 559.77 (3.5%) 501.17 (3.5%) -10.5% ( -16% - -3%) 0.000 ``` The biggest difference in the profiler seems to be that we spend more time in `org.apache.lucene.search.similarities.BM25Similarity$BM25Scorer.score(float, long)` now? Diff image:  JFR files: [Archive.zip](https://github.com/user-attachments/files/16637306/Archive.zip) The code that was tested is slightly different from this PR, sharing branches just in case: - candidate: Branch with regression: https://github.com/epotyom/lucene/tree/IndexSearcher-search-regression - baseline: Branch with NO regression: https://github.com/epotyom/lucene/tree/IndexSearcher-search-NO-regression - Their diff: https://github.com/epotyom/lucene/pull/1/files luceneutil branch to reproduce: https://github.com/mikemccand/luceneutil/compare/main...epotyom:luceneutil:tasks_with_regression you'd need to generate task file manually as it seems to be to large for for github: ``` rm tasks/wikimedium.10M.regressed.tasks cat tasks/wikimedium.10M.nostopwords.tasks | egrep '^(MedTerm|HighTerm):' > tasks/wikimedium.10M.regressed.tasks.1 for n in {1..1}; do cat tasks/wikimedium.10M.regressed.tasks.1 >> tasks/wikimedium.10M.regressed.tasks; done ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Try applying bipartite graph reordering to KNN graph node ids [lucene]
msokolov commented on issue #13565: URL: https://github.com/apache/lucene/issues/13565#issuecomment-2293683592 I made this tool; while testing it I ran into some unexpected wrinkles relating to our vector format. I created a new index from an existing one, with a new docid order by: writer.addIndexes(SortingCodecReader.wrap((CodecReader) ctx.reader(), docMap, null)); But this recreates the HNSW graph doing unneccessary work, when all we really want to do is to *renumber* it. And the new graph it creates ignores the parameters that were used to create the original graph, substituting defaults, such as M=16. It makes me wonder if we ought to write M to the index as per-field metadata so it can be reused by tools such as this that may not have access to a schema, or in general when merging the field in the future. I guess for my preliminary tests I can simply hack the `DEFAULT_MAX_CONNECTIONS` to fit my test data, but I'd like to hear folks' opinions on how we should address this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] nocommit: demonstrate how a minor change in IndexSearcher can have an inexplicable performance impact [lucene]
gsmiller commented on PR #13657: URL: https://github.com/apache/lucene/pull/13657#issuecomment-2293708857 Found one more interesting nuance this morning. If the `collectors` ArrayList is created with no initial size, it has no impact on performance no matter where it gets created (before-or-after createWeight). But if there's an initial size, then the impact remains (where it regresses performance if done before createWeight and has no impact if done after). So I wonder if it has something to do with allocating memory on the heap before calling createWeight and less about this actual collection? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] nocommit: demonstrate how a minor change in IndexSearcher can have an inexplicable performance impact [lucene]
msokolov commented on PR #13657: URL: https://github.com/apache/lucene/pull/13657#issuecomment-2293718689 Note that setting the initial size requires calling `getSlices()` which might perhaps be put off by the compiler if its return value is not used. In turn `getSlices() does some complicated things including creating a `TaskExecutor` -- this seems to me it could possibly cause some mischief? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] nocommit: demonstrate how a minor change in IndexSearcher can have an inexplicable performance impact [lucene]
gsmiller commented on PR #13657: URL: https://github.com/apache/lucene/pull/13657#issuecomment-2293806502 Ah let me clarify a bit. I actually moved the call to getSlices back into the called private method to eliminate it from complicating things and used a constant value (randomly chose 8) to specify the initial capacity. So I think we can rule out interactions with getSlices. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Expose FlatVectorsFormat [lucene]
msokolov commented on PR #13469: URL: https://github.com/apache/lucene/pull/13469#issuecomment-2293814000 > Would like to know how you are doing it if there is some reference. Because I am also using a custom codec. The one way I figured out to use this is I create my own KNNVectorsFormat and then used the FlatWriters and Readers directly. Let me know if there is any better way. But this is best I can think of. I think we need to add the format to SPI. That way we will be able to evolve it going forward and let the index reading code pick the proper implementation corresponding to what was written to the index. Making the Codec figure this out is probably just a short-term hack that won't scale well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Inline skip data into postings lists [lucene]
HoustonPutman commented on PR #13585: URL: https://github.com/apache/lucene/pull/13585#issuecomment-2293832375 It looks like grouping queries are really affected by this change. The throughput of each of them were halved: [100 groups](https://home.apache.org/~mikemccand/lucenebench/TermGroup100.html), [10K groups](https://home.apache.org/~mikemccand/lucenebench/TermGroup10K.html), [1M groups](https://home.apache.org/~mikemccand/lucenebench/TermGroup1M.html), [1M groups (two pass block grouping)](https://home.apache.org/~mikemccand/lucenebench/TermBGroup1M.html), [1M groups (single pass block grouping)](https://home.apache.org/~mikemccand/lucenebench/TermBGroup1M1P.html) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] nocommit: demonstrate how a minor change in IndexSearcher can have an inexplicable performance impact [lucene]
gsmiller commented on PR #13657: URL: https://github.com/apache/lucene/pull/13657#issuecomment-2293872374 OK, I inlined the `IndexSearcher#createWeight` code to further isolate things. The ordering that matters is whether-or-not the `collections` list gets created before the call to `Query#createWeight` (so it rules out anything to do with the query cache stuff). I updated the diff in this PR to show this along with comments explaining. And, as mentioned before, it matters whether-or-not an initial capacity is provided. If no initial capacity is provided then the ordering doesn't matter and no regressions are seen (so I'm standing by my guess that this has something to do with allocating heap before the call to `Query#createWeight` and not so much something specific with this `collections` list) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Speed up GroupingSelectors when using a descending sort on a high cardinality field [lucene]
HoustonPutman opened a new issue, #13664: URL: https://github.com/apache/lucene/issues/13664 ### Description I've ran a benchmark (using Solr admittedly, not Lucene), that compares the speed of various sorted queries. The fields mentioned in the benchmark are the fields that were sorted on. Benchmark parameters: - 1M documents - High cardinality fields are unique values per document (an incrementing counter) - Low cardinality fields had 500 unique values - All fields were single-valued - The different fields were tested: `LongPointField`, `StrField` and `TrieLongField` (No longer in lucene) - All sort fields were tested using both DocValues and Uninversion. - Queries - 10 results were requested, for both grouped and non-grouped - Each field type, docValues/Uninversion combination were ran in 8 configurations doing a matrix of: - Grouping and Non-grouping - Sort `asc` and `desc` - High Cardinality values and Low Cardinality values - The grouping: - All queries were grouped by a the same string field (250,000 unique values, docValues enabled) - The sort field was used both for both the group sorting and the document sorting within the groups https://github.com/user-attachments/assets/75868683-3fe8-4745-b0a9-1501e7af8f28";> Overall there are some interesting findings: - **For un-grouped queries (sorting by a high cardinality field), ascending sorts were 5x faster than descending sorts. For grouped queries, they were 40x faster.** That is ultimately what this issue is meant to address, so I highlighted it in the chart above. - Note: Low cardinality fields had no difference in speed - DocValues and Uninverted fields has similar sorting performance for grouped queries. (This is likely related to https://github.com/apache/lucene/issues/10368) I know that since this benchmark is using Solr, it's only so useful here. So I can utilize lucenebench to try to recreate this as well, if that would help. I also have the flame graphs for these benchmarks, which aren't as useful as a screenshot but I will provide one anyways (LongPointField, docValues, highCardinality, grouped, sorted descending): https://github.com/user-attachments/assets/0bba8cff-1a47-436d-9a53-3e52d31376ed";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Simplify code and reduce the number of lines of code [lucene]
gsmiller merged PR #13663: URL: https://github.com/apache/lucene/pull/13663 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up prefix sums when decoding doc IDs. [lucene]
jpountz merged PR #13658: URL: https://github.com/apache/lucene/pull/13658 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Inline skip data into postings lists [lucene]
jpountz commented on PR #13585: URL: https://github.com/apache/lucene/pull/13585#issuecomment-2294115438 @HoustonPutman This is the same issue as reported above: the logic for lazily decoding blocks of freqs was broken and would decompress whole blocks of freqs on every doc ID. It is now fixed as the benchmarks that you linked confirm (requires zooming in). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Inline skip data into postings lists [lucene]
HoustonPutman commented on PR #13585: URL: https://github.com/apache/lucene/pull/13585#issuecomment-2294155051 Thanks for the correction, sorry for the noise! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add reopen method in PerThreadPKLookup [lucene]
github-actions[bot] commented on PR #13596: URL: https://github.com/apache/lucene/pull/13596#issuecomment-2294481436 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize binary search call [lucene]
github-actions[bot] commented on PR #13595: URL: https://github.com/apache/lucene/pull/13595#issuecomment-2294481483 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move synonym map off-heap for SynonymGraphFilter [lucene]
github-actions[bot] commented on PR #13054: URL: https://github.com/apache/lucene/pull/13054#issuecomment-2294482584 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]
goankur commented on PR #13572: URL: https://github.com/apache/lucene/pull/13572#issuecomment-2294502514 > also, would be good to compare apples-to-apples here. currently from what i see, benchmark compares `dot8s(MemorySegment..)` vs `BinaryDotProduct(byte[])`. To me this mixes up concerns about memorysegment vs java heap with C vs vector API, so it would be good to resolve for understanding purposes. As I understand, here is the call stack `PanamaVectorUtilSupport.dotProduct(MemorySegment, MemorySegment)` `PanamaVectorUtilSupport.dotProduct(byte[], byte[])` `VectorUtil.dotProduct(byte[], byte[])` `VectorUtilBenchmark.binaryDotProductVector()` So heap allocated `byte[]` gets wrapped into MemorySegment before the dotProduct computation that uses Panama APIs. For strict apples-to-apples comparison, one would wrap the heap-allocated `byte[]` into a MemorySegment and pass it to the native code -- `dot8s(MemorySegment..)` -- but this does not work and throws the following exception ``` Caused by: java.lang.IllegalArgumentException: Heap segment not allowed: MemorySegment{ heapBase: Optional[[B@53a133c5] address:0 limit: 768 } at java.base/jdk.internal.foreign.abi.SharedUtils.unboxSegment(SharedUtils.java:331) at org.apache.lucene.util.VectorUtil.dot8s(VectorUtil.java:217) ... 13 more ``` So I can't think of a good way to make the comparison fairer. Do you have any recommendations ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]
goankur commented on code in PR #13572: URL: https://github.com/apache/lucene/pull/13572#discussion_r1720493763 ## lucene/core/src/c/dotProduct.c: ## @@ -0,0 +1,143 @@ +// dotProduct.c + +#include +#include + +#ifdef __ARM_ACLE +#include +#endif + +#if (defined(__ARM_FEATURE_SVE) && !defined(__APPLE__)) +#include +/* + * Unrolled and vectorized int8 dotProduct implementation using SVE instructions + * NOTE: Clang 15.0 compiler on Apple M3 Max compiles the code below sucessfully + * with '-march=native+sve' option but throws "Illegal Hardware Instruction" error + * Looks like Apple M3 does not implement SVE and Apple's official documentation + * is not explicit about this or at least I could not find it. + * + */ +int32_t vdot8s_sve(int8_t *vec1, int8_t *vec2, int32_t limit) { +int32_t result = 0; +int32_t i = 0; +// Vectors of 8-bit signed integers +svint8_t va1, va2, va3, va4; +svint8_t vb1, vb2, vb3, vb4; +// Init accumulators +svint32_t acc1 = svdup_n_s32(0); +svint32_t acc2 = svdup_n_s32(0); +svint32_t acc3 = svdup_n_s32(0); +svint32_t acc4 = svdup_n_s32(0); + +// Number of 8-bits elements in the SVE vector +int32_t vec_length = svcntb(); + +// Manually unroll the loop +for (i = 0; i + 4 * vec_length <= limit; i += 4 * vec_length) { + // Load vectors into the Z registers which can range from 128-bit to 2048-bit wide + // The predicate register - P determines which bytes are active + // svptrue_b8() returns a predictae in which every element is true Review Comment: @rmuir -- I tried using `svwhilelt_b8_u32(i, limit)` intrinsic for generating the predicate in both the unrolled-loop and vector tail but the performance was actually worse :-(. To give you an idea of what I did, here is link to the ARM documentation with code sample https://developer.arm.com/documentation/102699/0100/Optimizing-with-intrinsics -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org