[jira] [Created] (LUCENE-10014) docvalue writeBlock gcd encode improve
weizijun created LUCENE-10014: - Summary: docvalue writeBlock gcd encode improve Key: LUCENE-10014 URL: https://issues.apache.org/jira/browse/LUCENE-10014 Project: Lucene - Core Issue Type: Improvement Components: core/codecs Reporter: weizijun Lucene90DocValuesConsumer.writeBlock calculate bitsPerValue as: {code:java} final int bitsPerValue = DirectWriter.unsignedBitsRequired(max - min); {code} it can use gcd in this place as: {code:java} (max - min) / gcd {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10014) docvalue writeBlock gcd encode improve
[ https://issues.apache.org/jira/browse/LUCENE-10014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] weizijun updated LUCENE-10014: -- Status: Patch Available (was: Open) > docvalue writeBlock gcd encode improve > -- > > Key: LUCENE-10014 > URL: https://issues.apache.org/jira/browse/LUCENE-10014 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: weizijun >Priority: Major > > Lucene90DocValuesConsumer.writeBlock calculate bitsPerValue as: > {code:java} > final int bitsPerValue = DirectWriter.unsignedBitsRequired(max - min); > {code} > it can use gcd in this place as: > {code:java} > (max - min) / gcd > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10014) docvalue writeBlock gcd encode improve
[ https://issues.apache.org/jira/browse/LUCENE-10014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] weizijun updated LUCENE-10014: -- Attachment: LUCENE-10014.patch > docvalue writeBlock gcd encode improve > -- > > Key: LUCENE-10014 > URL: https://issues.apache.org/jira/browse/LUCENE-10014 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: weizijun >Priority: Major > Attachments: LUCENE-10014.patch > > > Lucene90DocValuesConsumer.writeBlock calculate bitsPerValue as: > {code:java} > final int bitsPerValue = DirectWriter.unsignedBitsRequired(max - min); > {code} > it can use gcd in this place as: > {code:java} > (max - min) / gcd > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10014) docvalue writeBlock gcd encode improve
[ https://issues.apache.org/jira/browse/LUCENE-10014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] weizijun updated LUCENE-10014: -- Status: Patch Available (was: Open) > docvalue writeBlock gcd encode improve > -- > > Key: LUCENE-10014 > URL: https://issues.apache.org/jira/browse/LUCENE-10014 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: weizijun >Priority: Major > Attachments: LUCENE-10014.patch > > > Lucene90DocValuesConsumer.writeBlock calculate bitsPerValue as: > {code:java} > final int bitsPerValue = DirectWriter.unsignedBitsRequired(max - min); > {code} > it can use gcd in this place as: > {code:java} > (max - min) / gcd > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10014) docvalue writeBlock gcd encode improve
[ https://issues.apache.org/jira/browse/LUCENE-10014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] weizijun updated LUCENE-10014: -- Status: Open (was: Patch Available) > docvalue writeBlock gcd encode improve > -- > > Key: LUCENE-10014 > URL: https://issues.apache.org/jira/browse/LUCENE-10014 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: weizijun >Priority: Major > Attachments: LUCENE-10014.patch > > > Lucene90DocValuesConsumer.writeBlock calculate bitsPerValue as: > {code:java} > final int bitsPerValue = DirectWriter.unsignedBitsRequired(max - min); > {code} > it can use gcd in this place as: > {code:java} > (max - min) / gcd > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand merged pull request #193: For stability of DisjunctionIntervalsSource.toString(), sort subSources
mikemccand merged pull request #193: URL: https://github.com/apache/lucene/pull/193 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on pull request #177: Initial rewrite of MMapDirectory for JDK-17 preview (incubating) Panama APIs (>= JDK-17-ea-b25)
mikemccand commented on pull request #177: URL: https://github.com/apache/lucene/pull/177#issuecomment-866784477 > > > The problem with luceneutil is also that it respawns a JVM multiple times. > > > > > > Hmm, we added multiple JVMs long ago precisely because HotSpot was so unpredictable. I.e. we had clear examples where HotSpot would paint itself into a corner, compiling e.g. `readVInt` poorly and never re-compiling it, or something, such that no matter how long the benchmark ran, it would never reach as good performance as if you simply restarted the whole JVM and rolled the dice again. But maybe this situation has been improved and these were somehow early HotSpot bugs/issues and we could really remove multiple JVMs without harming how accurately we can extract the mean/variance performance of all our benchmark tasks? > > This is also not reality: Would you restart your Elasticsearch server from time to time because you think there might be a broken `readVInt()` optimization? Yeah, that is true! But perhaps it shouldn't be the case :) Maybe Elasticsearch/OpenSearch/Solr should spawn JVM a few times until they get a "good" `readVInt` compilation! The noisy mis-compilation was such a sizable impact (back then, hopefully not anymore?). If we only ran benchmarks in nightly runs so that we could see that noise/variance with time, maybe we could do just one JVM. But when a developer is trying to test an exciting optimization, in the privacy of their `git clone`, it really sucks to have hotspot noise completely drown out any small gains your optimization might show! Benchmarking is hard :) > Here are the berlinbuzzwords slides about this: https://2021.berlinbuzzwords.de/sites/berlinbuzzwords.de/files/2021-06/The%20future%20of%20Lucene%27s%20MMapDirectory.pdf Oooh, thanks for sharing! The talk looks AWESOME! I will watch recording when it's out :) You should share these slides on Twitter too? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on pull request #177: Initial rewrite of MMapDirectory for JDK-17 preview (incubating) Panama APIs (>= JDK-17-ea-b25)
mikemccand commented on pull request #177: URL: https://github.com/apache/lucene/pull/177#issuecomment-866787723 > the JMH will do this too. I forget the defaults, but uses multiple jvm iterations and iterations within each jvm and warmup iterations. But it has smarts around the JIT compiler and can dump profiled assembly for its microbenchmarks. I never have noise issues with it. Excellent! > The current big "integration test" (lucene util) is useful for some things: e.g. something has to tell us there is pollution from too many java abstractions going megamorphic and so on :) +1 It really is more of an integration test, yeah. It runs many different kinds of queries/tasks, concurrently across multiple threads, trying to exercise Lucene roughly in a way that OpenSearch/Elasticsearch/Solr might. Though, it does not do concurrent indexing with searching in a single JVM, at least not with the default benchmarks. Really, distributed search engines should not do that -- they should rather use [Lucene's near-real-time segment replication](https://blog.mikemccandless.com/2017/09/lucenes-near-real-time-segment-index.html#:~:text=Lucene's%20near%2Dreal%2Dtime%20segment%20index%20replication,-%5BTL%3BDR%3A&text=Lucene%20has%20a%20unique%20write,files%20will%20never%20again%20change.), which is more efficient if you have deep replicas, and also enables strong physical isolation of indexing and searching JVMs which have very different resources requirements! OK ``! > But I think it would be improved by providing some more diagnostics (LogCompilation or whatever, maybe JIT stats in the JFR output). Let it be a "canary" to find little ways to improve. +1. I wonder if we could tap into those in real-time and get a sense of when the JVM really is roughly "warmed up", instead of the static "discard first N samples for each task" that we do now. Or maybe to detect mis-compilation of `readVInt`! > But we have nothing setup to do simple noise-free microbenchmarks over some specific code, e.g. like "unit tests" running different query types. And for those you don't want crazy JFR and logging and stuff as it is so targeted, you can just dump the hot assembly code instead. For now if you want to do this, you are writing one-off stuff yourself. Yeah maybe consing up a quick JMH for such cases is perfectly fine solution for we developers? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on pull request #193: For stability of DisjunctionIntervalsSource.toString(), sort subSources
mikemccand commented on pull request #193: URL: https://github.com/apache/lucene/pull/193#issuecomment-866789270 Thanks @magibney -- I pushed this fix and backported to 8.x too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] magibney commented on pull request #193: For stability of DisjunctionIntervalsSource.toString(), sort subSources
magibney commented on pull request #193: URL: https://github.com/apache/lucene/pull/193#issuecomment-866806262 Thanks @mikemccand ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #186: LUCENE-9613: Encode ordinals like numerics.
jpountz merged pull request #186: URL: https://github.com/apache/lucene/pull/186 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-9836) Fix 8.x Maven Validation and publication to work with Maven Central and HTTPS again; remove pure Maven build (did not work anymore)
[ https://issues.apache.org/jira/browse/LUCENE-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova closed LUCENE-9836. --- Closing after the 8.9.0 release > Fix 8.x Maven Validation and publication to work with Maven Central and HTTPS > again; remove pure Maven build (did not work anymore) > --- > > Key: LUCENE-9836 > URL: https://issues.apache.org/jira/browse/LUCENE-9836 > Project: Lucene - Core > Issue Type: Improvement > Components: general/build >Affects Versions: 8.x, 8.9 >Reporter: Uwe Schindler >Assignee: Uwe Schindler >Priority: Major > Fix For: 8.x, 8.9 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Currenty the Maven related stuff in 8.x completely fails, because > Maven-Ant-Tasks is so outdated, that it has hardcoded Maven Central without > HTTPS. This makes downloading fail. > You can mostly fix this with an additional remote repository, so it can > fallback to that one. > I'd like to do the following on 8.x: > - Remove the Ant-Support for Maven: {{ant run-maven-build}} (this no longer > bootsraps, because Maven Ant Tasks can't download Maven, as here is no way to > override hardcoded repo; I have a workaround in forbiddenapis, but that's too > complicated, so I will simply remoe that task) > - Fix the dependency checker: This works, but unfortunately there are some > artifacts which itsself have "http:" in their POM file, those fail to > download. Newer Maven versions have an hardcoded "fixer" in it, but Maven Ant > Tasks again is missing this. I have no idea how to handle that. > I already tried some heavy committing, but the only way to solve this is to > replace maven-ant-tasks with the followup ant task. I am not sure if this > worth the trouble! > What do others think? Should we maybe simply disable the Maven Dependency > checker? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-9967) ReplicaNode.start NPE on exception with no message
[ https://issues.apache.org/jira/browse/LUCENE-9967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova closed LUCENE-9967. --- Closing after the 8.9.0 release > ReplicaNode.start NPE on exception with no message > -- > > Key: LUCENE-9967 > URL: https://issues.apache.org/jira/browse/LUCENE-9967 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/replicator >Affects Versions: 8.8.2 > Environment: Java 16.0.1, Fedora Linux 33 >Reporter: Steven Schlansker >Priority: Major > Labels: easyfix, patch > Fix For: 8.9 > > Attachments: LUCENE-9967.patch > > > We are starting a new project and trying to implement Lucene near real time > replication. > While stubbing out some code such that it throws an exception, we found that > Lucene's error handling itself fails when the exception has no message: > > {code:java} > } catch (Throwable t) { > if (t.getMessage().startsWith("replica cannot start") == false) {{code} > > This obscures the actual root cause exception source (you cannot see it > without a debugger) and replaces it with a useless NPE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-9976) WANDScorer assertion error in ensureConsistent
[ https://issues.apache.org/jira/browse/LUCENE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova closed LUCENE-9976. --- Closing after the 8.9.0 release > WANDScorer assertion error in ensureConsistent > -- > > Key: LUCENE-9976 > URL: https://issues.apache.org/jira/browse/LUCENE-9976 > Project: Lucene - Core > Issue Type: Bug >Reporter: Dawid Weiss >Assignee: Zach Chen >Priority: Major > Fix For: 8.9 > > Time Spent: 1.5h > Remaining Estimate: 0h > > Build fails and is reproducible: > https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/283/console > {code} > ./gradlew test --tests TestExpressionSorts.testQueries > -Dtests.seed=FF571CE915A0955 -Dtests.multiplier=2 -Dtests.nightly=true > -Dtests.slow=true -Dtests.asserts=true -p lucene/expressions/ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-9756) Extend FieldInfosFormat tests to cover points and vectors
[ https://issues.apache.org/jira/browse/LUCENE-9756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova closed LUCENE-9756. --- Closing after the 8.9.0 release > Extend FieldInfosFormat tests to cover points and vectors > - > > Key: LUCENE-9756 > URL: https://issues.apache.org/jira/browse/LUCENE-9756 > Project: Lucene - Core > Issue Type: Test >Reporter: Julie Tibshirani >Priority: Major > Fix For: 8.9 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Currently {{BaseFieldInfoFormatTestCase}} doesn't exercise points, vectors, > or the soft deletes field. We should make sure the test covers these options. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-9574) Add a token filter to drop tokens based on flags.
[ https://issues.apache.org/jira/browse/LUCENE-9574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova closed LUCENE-9574. --- Closing after the 8.9.0 release > Add a token filter to drop tokens based on flags. > - > > Key: LUCENE-9574 > URL: https://issues.apache.org/jira/browse/LUCENE-9574 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Gus Heck >Assignee: Gus Heck >Priority: Major > Fix For: 8.9 > > Time Spent: 8h 50m > Remaining Estimate: 0h > > (Breaking this off of SOLR-14597 for independent review) > A filter that tests flags on tokens vs a bitmask and drops tokens that have > all specified flags. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-9668) Deprecate MinShouldMatchSumScorer with WANDScorer
[ https://issues.apache.org/jira/browse/LUCENE-9668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova closed LUCENE-9668. --- Closing after the 8.9.0 release > Deprecate MinShouldMatchSumScorer with WANDScorer > - > > Key: LUCENE-9668 > URL: https://issues.apache.org/jira/browse/LUCENE-9668 > Project: Lucene - Core > Issue Type: Improvement > Components: core/query/scoring >Reporter: Zach Chen >Priority: Minor > Fix For: 8.9 > > Time Spent: 3h 10m > Remaining Estimate: 0h > > This is a follow up issue of > https://issues.apache.org/jira/browse/LUCENE-9346, where support to > minShouldMatch has been added to WANDScorer, and thus would like to see if > MinShouldMatchSumScorer can be deprecated completely by WANDScorer, given how > similar they are. > For context, some initial discussion of this during the previous work is > available at > https://github.com/apache/lucene-solr/pull/2141#discussion_r550806711 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-9537) Add Indri Search Engine Functionality to Lucene
[ https://issues.apache.org/jira/browse/LUCENE-9537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova closed LUCENE-9537. --- Closing after the 8.9.0 release > Add Indri Search Engine Functionality to Lucene > --- > > Key: LUCENE-9537 > URL: https://issues.apache.org/jira/browse/LUCENE-9537 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Cameron VandenBerg >Priority: Major > Labels: patch > Fix For: 8.9 > > Attachments: LUCENE-9537.patch, LUCENE-INDRI.patch > > Time Spent: 4h 40m > Remaining Estimate: 0h > > Indri ([http://lemurproject.org/indri.php]) is an academic search engine > developed by The University of Massachusetts and Carnegie Mellon University. > The major difference between Lucene and Indri is that Indri will give a > document a "smoothing score" to a document that does not contain the search > term, which has improved the search ranking accuracy in our experiments. I > have created an Indri patch, which adds the search code needed to implement > the Indri AND logic as well as Indri's implementation of Dirichlet Smoothing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues
[ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova closed LUCENE-9663. --- Closing after the 8.9.0 release > Adding compression to terms dict from SortedSet/Sorted DocValues > > > Key: LUCENE-9663 > URL: https://issues.apache.org/jira/browse/LUCENE-9663 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Jaison.Bi >Priority: Trivial > Fix For: 8.9 > > Time Spent: 11h 10m > Remaining Estimate: 0h > > Elasticsearch keyword field uses SortedSet DocValues. In our applications, > “keyword” is the most frequently used field type. > LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do > better by replacing prefix-compression with LZ4. In one of our application, > the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB). > I've done simple tests based on the real application data, comparing the > write/merge time cost, and the on-disk *.dvd file size(after merge into 1 > segment). > || ||Before||After|| > |Write time cost(ms)|591972|618200| > |Merge time cost(ms)|270661|294663| > |*.dvd file size(GB)|1.95|1.15| > This feature is only for the high-cardinality fields. > I'm doing the benchmark test based on luceneutil. Will attach the report and > patch after the test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-9791) Monitor (aka Luwak) has concurrency issues related to BytesRefHash#find
[ https://issues.apache.org/jira/browse/LUCENE-9791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova closed LUCENE-9791. --- Closing after the 8.9.0 release > Monitor (aka Luwak) has concurrency issues related to BytesRefHash#find > --- > > Key: LUCENE-9791 > URL: https://issues.apache.org/jira/browse/LUCENE-9791 > Project: Lucene - Core > Issue Type: Bug > Components: core/other >Affects Versions: main (9.0), 8.7, 8.8 >Reporter: Paweł Bugalski >Priority: Major > Fix For: main (9.0), 8.9 > > Attachments: LUCENE-9791.patch, LUCENE-97910-8.x-backport.patch, > LUCENE-9791_example.patch > > Time Spent: 4h 20m > Remaining Estimate: 0h > > _org.apache.lucene.monitor.Monitor_ can sometimes *NOT* match a document that > should be matched by one of registered queries if match operations are run > concurrently from multiple threads. > This is because sometimes in a concurrent environment > _TermFilteredPresearcher_ might not select a query that could later on match > one of documents being matched. > Internally _TermFilteredPresearcher_ is using a term acceptor: an instance of > _org.apache.lucene.monitor.QueryIndex.QueryTermFilter_. _QueryTermFilter_ is > correctly initialized under lock and its internal state (a map of > _org.apache.lucene.util.BytesRefHash_ instances) is correctly published. > Later one when those instances are used concurrently a problem with > _org.apache.lucene.util.BytesRefHash#find_ is triggered since it is not > thread safe. > _org.apache.lucene.util.BytesRefHash#find_ internally is using a private > _org.apache.lucene.util.BytesRefHash#equals_ method, which is using an > instance field _scratch1_ as a temporary buffer to compare its _ByteRef_ > parameter with contents of _ByteBlockPool_. This is not thread safe and can > cause incorrect answers as well as _ArrayOutOfBoundException_. > __ > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-9385) Skip indexing facet drill down terms
[ https://issues.apache.org/jira/browse/LUCENE-9385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova closed LUCENE-9385. --- Closing after the 8.9.0 release > Skip indexing facet drill down terms > > > Key: LUCENE-9385 > URL: https://issues.apache.org/jira/browse/LUCENE-9385 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/facet >Affects Versions: 8.5.2 >Reporter: Ankur >Priority: Minor > Labels: easyfix > Fix For: main (9.0), 8.9 > > Time Spent: 10h 10m > Remaining Estimate: 0h > > FacetsConfig creates index terms from the Facet dimension and path > automatically for the purpose of supporting drill-down queries. > An application that does not need drill-down ends up paying the index cost of > the extra terms. > Ideally an option to skip indexing these drill down terms should be exposed > to the application. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9613) Create blocks for ords when it helps in Lucene80DocValuesFormat
[ https://issues.apache.org/jira/browse/LUCENE-9613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-9613. -- Fix Version/s: main (9.0) Resolution: Fixed > Create blocks for ords when it helps in Lucene80DocValuesFormat > --- > > Key: LUCENE-9613 > URL: https://issues.apache.org/jira/browse/LUCENE-9613 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Fix For: main (9.0) > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Currently for sorted(-set) values, we always write ords using > log2(valueCount) bits per entry. However in several cases like when the field > is used in the index sort, or if one value is _very_common, splitting into > blocks like we do for numerics would help. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-9877) Explore increasing the allowable exceptions in PForUtil
[ https://issues.apache.org/jira/browse/LUCENE-9877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova closed LUCENE-9877. --- Closing after the 8.9.0 release > Explore increasing the allowable exceptions in PForUtil > --- > > Key: LUCENE-9877 > URL: https://issues.apache.org/jira/browse/LUCENE-9877 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs >Affects Versions: main (9.0) >Reporter: Greg Miller >Priority: Minor > Fix For: 8.9 > > Time Spent: 1h > Remaining Estimate: 0h > > Piggybacking a little off of the investigation I was doing over in > LUCENE-9850 I thought it might also be worth-while exploring the impact of > increasing the number of allowable exceptions in PForUtil. The aim of this > investigation is to see if we could reduce index size by allowing for more > exceptions without significant negative impact to performance. > PForUtil currently allows for up to 3 exceptions, and it only uses 3 bits to > encode the number of exceptions (using the remaining 3 bits of the byte used > to also encode the number of bits-per-value, which requires 5 bits). Each > exception used is encoded with a two full bytes, using a maximum of 6 bytes > per block. > It seems to me like 7 might be a more ideal number of exceptions if index > size is the driving motivation. My thought process is that, in the > worst-case, 7 exceptions would be used to save only a single bit-per-value in > the corresponding block. With 128 entries per block, this would save 16 > bytes. So with 14 bytes used to encode the exception values (7 x 2 bytes per > exception), we would save a two bytes in total (just slightly better than > breaking even). If we need fewer than the 7 exceptions, or if we're able to > save more than 1 bit-per-value, it's all additional savings. I suppose the > question is what kind of performance hit we might observe due to decoding > more exceptions. > Also note that 7 exceptions is the max we can encode with the 3 bits we > currently have available for the number of exceptions. So moving to 8 > exceptions would not only take 16 bytes to encode the exceptions (if using > all of them), but we'd need one more byte per block to encode the exception > count. So in the worst case of using all 8 exceptions to save 1 bit per > value, we'd actually be worse off. > I'll post some results here for discussion or at least for public record of > my work for future reference. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-9575) Add PatternTypingFilter
[ https://issues.apache.org/jira/browse/LUCENE-9575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova closed LUCENE-9575. --- Closing after the 8.9.0 release > Add PatternTypingFilter > --- > > Key: LUCENE-9575 > URL: https://issues.apache.org/jira/browse/LUCENE-9575 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Gus Heck >Assignee: Gus Heck >Priority: Major > Fix For: 8.9 > > Time Spent: 5h 40m > Remaining Estimate: 0h > > One of the key asks when the Library of Congress was asking me to develop the > Advanced Query Parser was to be able to recognize arbitrary patterns that > included punctuation such as POW/MIA or 401(k) or C++ etc. Additionally they > wanted 401k and 401(k) to match documents with either style reference, and > NOT match documents that happen to have isolated 401 or k tokens (i.e. not > documents about the http status code) And of course we wanted to give up as > little of the text analysis features they were already using. > This filter in conjunction with the filters from LUCENE-9572, LUCENE-9574 and > one solr specific filter in SOLR-14597 that re-analyzes tokens with an > arbitrary analyzer defined for a type in the solr schema, combine to achieve > this. > This filter has the job of spotting the patterns, and adding the intended > synonym as at type to the token (from which minimal punctuation has been > removed). It also sets flags on the token which are retained through the > analysis chain, and at the very end the type is converted to a synonym and > the original token(s) for that type are dropped avoiding the match on 401 > (for example) > The pattern matching is specified in a file that looks like: > {code} > 2 (\d+)\(?([a-z])\)? ::: legal2_$1_$2 > 2 (\d+)\(?([a-z])\)?\(?(\d+)\)? ::: legal3_$1_$2_$3 > 2 C\+\+ ::: c_plus_plus > {code} > That file would match match legal reference patterns such as 401(k), 401k, > 501(c)3 and C++ The format is: > ::: > and groups in the pattern are substituted into the replacement so the first > line above would create synonyms such as: > {code} > 401k --> legal2_401_k > 401(k) --> legal2_401_k > 503(c) --> legal2_503_c > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-9680) Re-add IndexWriter.getFieldNames
[ https://issues.apache.org/jira/browse/LUCENE-9680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova closed LUCENE-9680. --- Closing after the 8.9.0 release > Re-add IndexWriter.getFieldNames > > > Key: LUCENE-9680 > URL: https://issues.apache.org/jira/browse/LUCENE-9680 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Oren Ovadia >Priority: Major > Fix For: 8.9 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > IndexWriter.getFieldNames was deprecated in LUCENE-8909. > It is useful to have this information exposed by IW to cap (or report) when > too many fields have been created. > getFieldNames was introduced in LUCENE-7659. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-9887) error param use in RadixSelector
[ https://issues.apache.org/jira/browse/LUCENE-9887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova closed LUCENE-9887. --- Closing after the 8.9.0 release > error param use in RadixSelector > > > Key: LUCENE-9887 > URL: https://issues.apache.org/jira/browse/LUCENE-9887 > Project: Lucene - Core > Issue Type: Improvement > Components: core/other >Affects Versions: 8.8 > Environment: None >Reporter: liupanfeng >Priority: Trivial > Labels: patch > Fix For: 8.9 > > Attachments: LUCENE-9887.patch, LUCENE-9887.patch > > > There is a param use error in > *org.apache.lucene.util.RadixSelector#select(int, int, int, int, int).* > What is we expected in this method is: > if the range becomes narrow or when the maximum level of recursion has been > exceeded, then we get a fall-back selector(it's a IntroSelector). > *So, we should use the recursion level(param f) compare to LEVEL_THRESHOLD. > NOT the byte index of value(param d).* > effect: > This bug will not affect the correctness of the program. but affect > performance in some bad case. In average, RadixSelector and IntroSelector are > all in linear time. This bug will let we choose a fall-back selector too > early, then the constant of O(n) will be bigger. > > other evidence: > # In comments, said we use recursion level (f) not byte index of value(d). > # if *d* is right, then the *param f* could be deleted because of it was > not used by any method. > verification: > # It also can select right value if i change d -> f. > # I did some benchmark works. but the result was unstable on random data. > > Thanks for your read. I'm new of lucene. So please reply me if I am wrong. Or > fix it in future. > > I will do benchmark. But I can't promised the result is better. If you need > the result. Ask for me. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-9958) Performance regression when a minimum number of matching SHOULD clauses is required
[ https://issues.apache.org/jira/browse/LUCENE-9958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova closed LUCENE-9958. --- Closing after the 8.9.0 release > Performance regression when a minimum number of matching SHOULD clauses is > required > --- > > Key: LUCENE-9958 > URL: https://issues.apache.org/jira/browse/LUCENE-9958 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Priority: Minor > Fix For: 8.9 > > > Opening this issue on behalf of [~mattweber], who reported this at > https://discuss.elastic.co/t/es-7-7-1-es-7-12-0-wand-performance-issue/272854. > It looks like the fact that we introduced dynamic pruning for queries that > already have a minimum number of SHOULD clauses configured makes things > _slower_, at least in some cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-9572) Allow TypeAsSynonymFilter to propagate selected flags and Ignore some types
[ https://issues.apache.org/jira/browse/LUCENE-9572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova closed LUCENE-9572. --- Closing after the 8.9.0 release > Allow TypeAsSynonymFilter to propagate selected flags and Ignore some types > --- > > Key: LUCENE-9572 > URL: https://issues.apache.org/jira/browse/LUCENE-9572 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis, modules/test-framework >Reporter: Gus Heck >Assignee: Gus Heck >Priority: Major > Fix For: 8.9 > > Time Spent: 5h 50m > Remaining Estimate: 0h > > (Breaking this off of SOLR-14597 for independent review) > TypeAsSynonymFilter converts types attributes to a synonym. In some cases the > original token may have already had flags set on it and it may be useful to > propagate some or all of those flags to the synonym we are generating. This > ticket provides that ability and allows the user to specify a bitmask to > specify which flags are retained. > Additionally there may be some set of types that should not be converted to > synonyms, and this change allows the user to specify a comma separated list > of types to ignore (most common case will be to ignore a common default type > of 'word' I suspect) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-9953) FacetResult#value is inaccurate in LongValueFacetCounts for multi-value docs
[ https://issues.apache.org/jira/browse/LUCENE-9953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova closed LUCENE-9953. --- Closing after the 8.9.0 release > FacetResult#value is inaccurate in LongValueFacetCounts for multi-value docs > > > Key: LUCENE-9953 > URL: https://issues.apache.org/jira/browse/LUCENE-9953 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet >Affects Versions: 8.9 >Reporter: Greg Miller >Priority: Minor > Fix For: 8.9 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > As described in a dev@ list > [thread|http://mail-archives.apache.org/mod_mbox/lucene-dev/202105.mbox/%3CCANJ0CDo-9zt0U_pxWNOBkfiJpaAXZGGwOEJPnENAP6JzWz_t9Q%40mail.gmail.com%3E], > the value of {{FacetResult#value}} should reflect the number of docs > containing at least one value in a given facet path. LongValueFacetCounts > counts the number of values contributed by all docs. In cases where all docs > contain a single value, this is fine, but if a doc contains multiple values, > {{FacetResult#value}} will be incorrect. > This is a simple fix so I think we can include it in 8.9. > Note: Spinning this off from LUCENE-9952 since fixing this for all cases > (particularly SSDV) is trickier and may require non-backwards compatible > changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-9950) Support both single- and multi-value string fields in facet counting (non-taxonomy based approaches)
[ https://issues.apache.org/jira/browse/LUCENE-9950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova closed LUCENE-9950. --- Closing after the 8.9.0 release > Support both single- and multi-value string fields in facet counting > (non-taxonomy based approaches) > > > Key: LUCENE-9950 > URL: https://issues.apache.org/jira/browse/LUCENE-9950 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: main (9.0) >Reporter: Greg Miller >Priority: Minor > Fix For: main (9.0), 8.9 > > Time Spent: 3h > Remaining Estimate: 0h > > Users wanting to facet count string-based fields using a non-taxonomy-based > approach can use {{SortedSetDocValueFacetCounts}}, which accumulates facet > counts based on a {{SortedSetDocValues}} field. This requires the stored doc > values to be multi-valued (i.e., {{SORTED_SET}}), and doesn't work on > single-valued fields (i.e., SORTED). In contrast, if a user wants to facet > count on a stored numeric field, they can use {{LongValueFacetCounts}}, which > supports both single- and multi-valued fields (and in LUCENE-9948, we now > auto-detect instead of asking the user to specify). > Let's update {{SortedSetDocValueFacetCounts}} to also support, and > automatically detect single- and multi-value fields. Note that this is a > spin-off issue from LUCENE-9946, where [~rcmuir] points out that this can > essentially be a one-line change, but we may want to do some class renaming > at the same time. Also note that we should do this in > {{ConcurrentSortedSetDocValuesFacetCounts}} while we're at it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-9694) New tool for creating a deterministic index
[ https://issues.apache.org/jira/browse/LUCENE-9694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova closed LUCENE-9694. --- Closing after the 8.9.0 release > New tool for creating a deterministic index > --- > > Key: LUCENE-9694 > URL: https://issues.apache.org/jira/browse/LUCENE-9694 > Project: Lucene - Core > Issue Type: New Feature > Components: general/tools >Reporter: Haoyu Zhai >Priority: Minor > Fix For: main (9.0), 8.9 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > Lucene's index is segmented, and sometimes number of segments and documents > arrangement greatly impact performance. > Given a stable index sort, our team create a tool that records document > arrangement (called index map) of an index and rearrange another index > (consists of same documents) into the same structure (segment num, and > documents included in each segment). > This tool could be also used in lucene benchmarks for a faster deterministic > index construction (if I understand correctly lucene benchmark is using a > single thread manner to achieve this). > > We've already had some discussion in email > [https://markmail.org/message/lbtdntclpnocmfuf] > And I've implemented the first method, using {{IndexWriter.addIndexes}} and a > customized {{FilteredCodecReader}} to achieve the goal. The index > construction time is about 25min and time executing this tool is about 10min. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-9991) Fix TestStringValueFacetCounts
[ https://issues.apache.org/jira/browse/LUCENE-9991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova closed LUCENE-9991. --- Closing after the 8.9.0 release > Fix TestStringValueFacetCounts > -- > > Key: LUCENE-9991 > URL: https://issues.apache.org/jira/browse/LUCENE-9991 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet >Affects Versions: main (9.0) >Reporter: Greg Miller >Priority: Minor > Fix For: main (9.0), 8.9 > > Time Spent: 1h > Remaining Estimate: 0h > > As reported by [~julietibs] in LUCENE-9950, there's a randomized test failure > in {{TestStringValueFacetCounts}}. It's actually an issue with the test > itself. > Since count ties are broken in {{StringValueFacetCounts}} by ordinal, but the > test doesn't know anything about the ordinals, the test breaks ties by the > value itself before comparing results. The edge-case is if we only request a > topN of 1, but the top result ties in count with other results. In this > scenario, the result returned by the {{Facets}} might be one that sorts > higher than another when secondarily sorted by value, but the test can't > solve for this since it only sees the one result. Should be a fairly simple > fix in the test case itself. Will do so shortly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-9725) Allow BM25FQuery to use other similarities
[ https://issues.apache.org/jira/browse/LUCENE-9725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova closed LUCENE-9725. --- Closing after the 8.9.0 release > Allow BM25FQuery to use other similarities > -- > > Key: LUCENE-9725 > URL: https://issues.apache.org/jira/browse/LUCENE-9725 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Assignee: Julie Tibshirani >Priority: Major > Fix For: 8.9 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > From a high level, BM25FQuery works as follows: > # Given a list of fields and weights, it pretends there's a synthetic > combined field where all terms have been indexed. It computes new term and > collection statistics for this combined field. > # It uses a disjunction iterator and BM25Similarity to score the documents. > The steps are (1) compute statistics that represent the combined field > content, and (2) pass these to a similarity function. There is nothing really > specific to BM25Similarity in this approach. In step 2, we could use another > similarity, for example BooleanSimilarity or those based on language models > like LMDirichletSimilarity. The main restriction is that norms have to be > additive (the norm of the combined field must be the sum of the field norms). > Maybe we could unhardcode BM25Similarity in BM25FQuery and instead use the > one configured on IndexSearcher. We could think of this as providing a > sensible default approach to cross-field scoring for many similarities. It's > an incremental step towards LUCENE-8711, which would give similarities more > fine-grained control over how stats/ scores are combined across fields. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-9932) Performance improvement for BKD index building
[ https://issues.apache.org/jira/browse/LUCENE-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova closed LUCENE-9932. --- Closing after the 8.9.0 release > Performance improvement for BKD index building > -- > > Key: LUCENE-9932 > URL: https://issues.apache.org/jira/browse/LUCENE-9932 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 8.8.2 >Reporter: neoremind >Priority: Critical > Fix For: 8.9 > > Attachments: benchmark_data.png, flame-graph.png, > refined-code-benchmark.png, refined-code-benchmark2.png > > Time Spent: 12.5h > Remaining Estimate: 0h > > In BKD index building, the input bytes must be sorted before calling BKD > writer related API. The sorting method leverages MSB Radix Sort algorithm, > and the comparing method takes both the bytes itself and the DocId, but in > real cases, DocIds are usually monotonically increasing. This could yield one > possible performance enhancer. I found this enhancement when I dig into one > performance issue in our system. Then I research on the possible solution. > DocId is usually increased by one when building index in a thread-safe way, > by assuming such condition, the comparing method can eliminate the > unnecessary comparing input - DocId, only leave the bytes itself to compare. > In order to do so, MSB radix sorting and its fallback sorting method must be > *stable*, so that when elements are the same, the sorting method maintains > its original order when added, which makes DocId still monotonically > increasing. To make MSB Radix Sort stable, it needs a trivial update; to make > fallback sort table, use merge sort instead of quick sort. Meanwhile, there > should introduce a switch which is able to turn the stable option on or off. > To validate how much performance could be gained. I make a benchmark taking > down only the time elapsed in _MutablePointsReaderUtils.sort_ stage. > *Test environment:* > MacBook Pro (Retina, 15-inch, Mid 2015), 2.2 GHz Intel Core i7, 16 GB 1600 > MHz DDR3 > *Java version:* > java version "1.8.0_161" > Java(TM) SE Runtime Environment (build 1.8.0_161-b12) > Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode) > *Testcase:* > bytesPerDim = [1, 2, 3, 4, 8, 16, 32] > dim = 1 > doc num = 2,000,000 > warm up 5 time, run 10 times to calculate average time used. > *Result:* > > ||bytesPerDim\scenario||disable sort doc id (PR branch)||enable sort doc id > (master branch)|| > |1|30989.594 us|1151149.9 us| > |2|313469.47 us|1115595.1 us| > |3|844617.8 us|1465465.1 us| > |4|1350946.8 us|1465465.1 us| > |8|1344814.6 us|1458115.5 us| > |16|1344516.6 us|1459849.6 us| > |32|1386847.8 us|1583097.5 us| > !benchmark_data.png|width=580,height=283! > Result shows that, by disabling sort DocId, sorting runs 1.73x to 37x faster > when there are many duplicate bytes (bytesPerDim = 1 or 2 or 3). When data > cardinality is high (bytesPerDim >= 4, test cases will generate random bytes > which are more scatter, not likely to be duplicate), the performance does not > go backward, still a little better. > In conclusion, in the end to end process for building BKD index, which relies > on BKDWriter for some data types, performance could be better by ignoring > DocId if they are already monotonically increasing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-9985) Upgrade Jetty to 9.4.41
[ https://issues.apache.org/jira/browse/LUCENE-9985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova closed LUCENE-9985. --- Closing after the 8.9.0 release > Upgrade Jetty to 9.4.41 > --- > > Key: LUCENE-9985 > URL: https://issues.apache.org/jira/browse/LUCENE-9985 > Project: Lucene - Core > Issue Type: Task >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Major > Fix For: main (9.0), 8.9 > > Time Spent: 20m > Remaining Estimate: 0h > > As Solr is upgrading jetty dependency in 8.9 (shared with lucene), Lucene > main should also do the same -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-9827) Small segments are slower to merge due to stored fields since 8.7
[ https://issues.apache.org/jira/browse/LUCENE-9827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova closed LUCENE-9827. --- Closing after the 8.9.0 release > Small segments are slower to merge due to stored fields since 8.7 > - > > Key: LUCENE-9827 > URL: https://issues.apache.org/jira/browse/LUCENE-9827 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Priority: Minor > Fix For: main (9.0), 8.9 > > Attachments: Indexer.java, log-and-lucene-9827.patch, > merge-count-by-num-docs.png, merge-type-by-version.png, > total-merge-time-by-num-docs-on-small-segments.png, > total-merge-time-by-num-docs.png > > Time Spent: 2h 10m > Remaining Estimate: 0h > > [~dm] and [~dimitrisli] looked into an interesting case where indexing slowed > down after upgrading to 8.7. After digging we identified that this was due to > the merging of stored fields, which had become slower on average. > This is due to changes to stored fields, which now have top-level blocks that > are then split into sub-blocks and compressed using shared dictionaries (one > dictionary per top-level block). As the top-level blocks are larger than they > were before, segments are more likely to be considered "dirty" by the merging > logic. Dirty segments are segments were 1% of the data or more consists of > incomplete blocks. For large segments, the size of blocks doesn't really > affect the dirtiness of segments: if you flush a segment that has 100 blocks > or more, it will never be considered dirty as only the last block may be > incomplete. But for small segments it does: for instance if your segment is > only 10 blocks, it is very likely considered dirty given that the last block > is always incomplete. And the fact that we increased the top-level block size > means that segments that used to be considered clean might now be considered > dirty. > And indeed benchmarks reported that while large stored fields merges became > slightly faster after upgrading to 8.7, the smaller merges actually became > slower. See attached chart, which gives the total merge time as a function of > the number of documents in the segment. > I don't know how we can address this, this is a natural consequence of the > larger block size, which is needed to achieve better compression ratios. But > I wanted to open an issue about it in case someone has a bright idea how we > could make things better. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-9980) Do not expose deleted commits in IndexDeletionPolicy#onCommit
[ https://issues.apache.org/jira/browse/LUCENE-9980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova closed LUCENE-9980. --- Closing after the 8.9.0 release > Do not expose deleted commits in IndexDeletionPolicy#onCommit > - > > Key: LUCENE-9980 > URL: https://issues.apache.org/jira/browse/LUCENE-9980 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 8.8.1 >Reporter: Nhat Nguyen >Priority: Major > Fix For: 8.9, 9.0 > > Time Spent: 40m > Remaining Estimate: 0h > > If we fail to delete files that belong to a commit point, then we will expose > that deleted commit in the next calls of IndexDeletionPolicy#onCommit(). I > think we should never expose those deleted commit points as some of their > files might have been deleted already. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-9888) Re-instate CheckIndex's attempts to confirm index sort is consistent across all segments
[ https://issues.apache.org/jira/browse/LUCENE-9888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova closed LUCENE-9888. --- Closing after the 8.9.0 release > Re-instate CheckIndex's attempts to confirm index sort is consistent across > all segments > > > Key: LUCENE-9888 > URL: https://issues.apache.org/jira/browse/LUCENE-9888 > Project: Lucene - Core > Issue Type: Bug >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Major > Fix For: main (9.0), 8.9 > > Time Spent: 20m > Remaining Estimate: 0h > > [~rmuir] opened this awesome PR to enable ecj redundant {{null}} checking: > [https://github.com/apache/lucene/pull/44] > But one of the chunks of dead code we removed from {{CheckIndex}} was spooky: > [https://github.com/apache/lucene/pull/44/files#r602733991] > I think the intention here was to confirm that each segment's {{indexSort}} > is the same, but because the {{Sort previousIndexSort = null}} declaration > was *inside* the {{for}} body, it made the check pointless! > I'll make a simple PR to re-instate the code and move the declaration outside > the loop. Who knows, maybe fixing this long latent bug in {{CheckIndex}} > will catch a fly? And maybe we could do some git archaeology to understand > how the code became zombified? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-9722) Aborted merge can leak readers if the output is empty
[ https://issues.apache.org/jira/browse/LUCENE-9722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova closed LUCENE-9722. --- Closing after the 8.9.0 release > Aborted merge can leak readers if the output is empty > - > > Key: LUCENE-9722 > URL: https://issues.apache.org/jira/browse/LUCENE-9722 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: main (9.0), 8.8 >Reporter: Nhat Nguyen >Assignee: Nhat Nguyen >Priority: Major > Fix For: main (9.0), 8.9, 8.8.1 > > Time Spent: 1h > Remaining Estimate: 0h > > We fail to close the merged readers of an aborted merge if its output segment > contains no document. > This bug was discovered by a test in Elasticsearch > ([elastic/elasticsearch#67884|https://github.com/elastic/elasticsearch/issues/67884]). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9613) Create blocks for ords when it helps in Lucene80DocValuesFormat
[ https://issues.apache.org/jira/browse/LUCENE-9613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17368151#comment-17368151 ] ASF subversion and git services commented on LUCENE-9613: - Commit 1d5d4589606e5acbc1f7f6059c8f76965f472435 in lucene's branch refs/heads/main from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=1d5d458 ] LUCENE-9613: Encode ordinals like numerics. (#186) This helps simplify the code, and also adds some optimizations to ordinals like better compression for long runs of equal values or fields that are used in index sorts. > Create blocks for ords when it helps in Lucene80DocValuesFormat > --- > > Key: LUCENE-9613 > URL: https://issues.apache.org/jira/browse/LUCENE-9613 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Fix For: main (9.0) > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Currently for sorted(-set) values, we always write ords using > log2(valueCount) bits per entry. However in several cases like when the field > is used in the index sort, or if one value is _very_common, splitting into > blocks like we do for numerics would help. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-9507) Custom order for leaves in DirectoryReader, IndexWriter and searcher
[ https://issues.apache.org/jira/browse/LUCENE-9507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova closed LUCENE-9507. --- Closing after the 8.9.0 release > Custom order for leaves in DirectoryReader, IndexWriter and searcher > > > Key: LUCENE-9507 > URL: https://issues.apache.org/jira/browse/LUCENE-9507 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Jim Ferenczi >Priority: Minor > Fix For: main (9.0), 8.9 > > Time Spent: 5h 50m > Remaining Estimate: 0h > > Now that we're able [to skip documents efficiently when sorting by a numeric > field|https://issues.apache.org/jira/browse/LUCENE-9280], I was wondering if > we could optimize sorted queries further by also sorting the leaf readers > based on the primary sort. > For time-based indices in Elasticsearch, we've implemented an optimization > that does that at query time. If the query is sorted by a numeric docvalue > field, prior to search, we sort the leaves according to the query sort. When > sorting by timestamp this small optimization can have a big impact since > early termination can be reached much faster if the sort values in the > segments don't overlap too much. Applying this optimization at query time is > challenging , it has the benefit to work on any numeric field sort and order > but it requires to use a multi-reader that will reorganize the segments. It > can also be deceptive that after a force merge to 1 segment sorted queries > may be slower since there is nothing to sort anymore. > So, another option that I look at is to add the ability to provide a leaf > order directly in the IndexWriter and DirectoryReader. That could be similar > to an index sort or even complementary to it since sorting segments based on > the index sort could also help at query time. For time-based indices that > cannot afford index sorting but have lots of sorted queries on timestamp, > forcing the order of segments could speed up sorted queries significantly. > The advantage of forcing a single leaf sort in the writer/reader is that we > can also use it to influence the merges by putting the segments with the > highest value first. That would help with the case of indices that are merged > to a single segment but would like to keep the sorted queries fast but also > for the multi-segments case since big segments would have more chance to have > highest values first too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-9687) Hunspell support improvements
[ https://issues.apache.org/jira/browse/LUCENE-9687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova closed LUCENE-9687. --- Closing after the 8.9.0 release > Hunspell support improvements > - > > Key: LUCENE-9687 > URL: https://issues.apache.org/jira/browse/LUCENE-9687 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Peter Gromov >Priority: Major > Fix For: main (9.0), 8.9 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > I'd like Lucene's Hunspell support to be on a par with the native C++ > Hunspell for spellchecking and suggestions, at least for some languages. So I > propose to: > * support the affix rules necessary for English, German, French, Spanish and > Russian dictionaries, possibly more languages later > * mirror Hunspell's suggestion algorithm in Lucene > * provide a public APIs for spellchecking, suggestion, stemming, > morphological data > * check corpora for specific languages to find and fix > spellchecking/suggestion discrepancices between Lucene's implementation and > Hunspell/C++ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #177: Initial rewrite of MMapDirectory for JDK-17 preview (incubating) Panama APIs (>= JDK-17-ea-b25)
uschindler commented on pull request #177: URL: https://github.com/apache/lucene/pull/177#issuecomment-866967688 > > Here are the berlinbuzzwords slides about this: https://2021.berlinbuzzwords.de/sites/berlinbuzzwords.de/files/2021-06/The%20future%20of%20Lucene%27s%20MMapDirectory.pdf > > Oooh, thanks for sharing! The talk looks AWESOME! I will watch recording when it's out :) You should share these slides on Twitter too? I got he link yesterday by Nina, will post it later on twitter. I just had no time. >Yeah, that is true! But perhaps it shouldn't be the case :) Maybe Elasticsearch/OpenSearch/Solr should spawn JVM a few times until they get a "good" readVInt compilation! The noisy mis-compilation was such a sizable impact (back then, hopefully not anymore?). If that is still there, show dumps of assembly and I will for sure open a bug report. This should not happen. At least not with tiered compilation. If you use batch compilation, of course it could be problematic, because it can't "re-optimize" easily. It has to wait for a trap caused by a wrong assumption and switch to interpreter first before trying agin. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #177: Initial rewrite of MMapDirectory for JDK-17 preview (incubating) Panama APIs (>= JDK-17-ea-b25)
uschindler commented on pull request #177: URL: https://github.com/apache/lucene/pull/177#issuecomment-866980038 > > But I think it would be improved by providing some more diagnostics (LogCompilation or whatever, maybe JIT stats in the JFR output). Let it be a "canary" to find little ways to improve. > > +1. I wonder if we could tap into those in real-time and get a sense of when the JVM really is roughly "warmed up", instead of the static "discard first N samples for each task" that we do now. Or maybe to detect mis-compilation of `readVInt`! Unfortunately, you can't get the compilation events from inside the JVM, but with the help of the outer python process it might be possible: The inner java process just benchmarks every round/query and does not throw away anything. After each round it prints the information in "machine readable form" to stdout. In addition we turn on `-XX:+PrintCompilation` on the JVM command line. The outer python process just reads process output and reacts to events: - if a benchmark query was finished it records the machine readable number - if it gets a compilation event on stdout (some regex can catch it), it greps for some "hot method" like "readVInt" and once it sees a compilation event (with tiered you jave to look for compilation stage 4, also known as C2), it switches the flag and from now on it can use the numbers recorded That's just an idea. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand merged pull request #163: LUCENE-9983: Stop sorting determinize powersets unnecessarily
mikemccand merged pull request #163: URL: https://github.com/apache/lucene/pull/163 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily
[ https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17368351#comment-17368351 ] ASF subversion and git services commented on LUCENE-9983: - Commit 48ff29c8f358f4dc4fad48997b8ebfde5d2e5751 in lucene's branch refs/heads/main from Patrick Zhai [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=48ff29c ] LUCENE-9983: Stop sorting determinize powersets unnecessarily (#163) * LUCENE-9983: Stop sorting determinize powersets unnecessarily > Stop sorting determinize powersets unnecessarily > > > Key: LUCENE-9983 > URL: https://issues.apache.org/jira/browse/LUCENE-9983 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Time Spent: 5h 40m > Remaining Estimate: 0h > > Spinoff from LUCENE-9981. > Today, our {{Operations.determinize}} implementation builds powersets of all > subsets of NFA states that "belong" in the same determinized state, using > [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction]. > To hold each powerset, we use a malleable {{SortedIntSet}} and periodically > freeze it to a {{FrozenIntSet}}, also sorted. We pay a high price to keep > these growing maps of int key, int value sorted by key, e.g. upgrading to a > {{TreeMap}} once the map is large enough (> 30 entries). > But I think sorting is entirely unnecessary here! Really all we need is the > ability to add/delete keys from the map, and hashCode / equals (by key only – > ignoring value!), and to freeze the map (a small optimization that we could > skip initially). We only use these maps to lookup in the (growing) > determinized automaton whether this powerset has already been seen. > Maybe we could simply poach the {{IntIntScatterMap}} implementation from > [HPPC|https://github.com/carrotsearch/hppc]? And then change its > {{hashCode}}/{{equals }}to only use keys (not values). > This change should be a big speedup for the kinds of (admittedly adversarial) > regexps we saw on LUCENE-9981. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily
[ https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17368350#comment-17368350 ] ASF subversion and git services commented on LUCENE-9983: - Commit 48ff29c8f358f4dc4fad48997b8ebfde5d2e5751 in lucene's branch refs/heads/main from Patrick Zhai [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=48ff29c ] LUCENE-9983: Stop sorting determinize powersets unnecessarily (#163) * LUCENE-9983: Stop sorting determinize powersets unnecessarily > Stop sorting determinize powersets unnecessarily > > > Key: LUCENE-9983 > URL: https://issues.apache.org/jira/browse/LUCENE-9983 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Time Spent: 5h 40m > Remaining Estimate: 0h > > Spinoff from LUCENE-9981. > Today, our {{Operations.determinize}} implementation builds powersets of all > subsets of NFA states that "belong" in the same determinized state, using > [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction]. > To hold each powerset, we use a malleable {{SortedIntSet}} and periodically > freeze it to a {{FrozenIntSet}}, also sorted. We pay a high price to keep > these growing maps of int key, int value sorted by key, e.g. upgrading to a > {{TreeMap}} once the map is large enough (> 30 entries). > But I think sorting is entirely unnecessary here! Really all we need is the > ability to add/delete keys from the map, and hashCode / equals (by key only – > ignoring value!), and to freeze the map (a small optimization that we could > skip initially). We only use these maps to lookup in the (growing) > determinized automaton whether this powerset has already been seen. > Maybe we could simply poach the {{IntIntScatterMap}} implementation from > [HPPC|https://github.com/carrotsearch/hppc]? And then change its > {{hashCode}}/{{equals }}to only use keys (not values). > This change should be a big speedup for the kinds of (admittedly adversarial) > regexps we saw on LUCENE-9981. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] gsmiller commented on pull request #2507: LUCENE-9946: Support multi-value fields in range facet counting
gsmiller commented on pull request #2507: URL: https://github.com/apache/lucene-solr/pull/2507#issuecomment-867053369 I'm planning to push this later today or tomorrow unless I hear any objections. The change is identical to the one I introduced on main (just backported). Please speak up if you object :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] gsmiller commented on pull request #2506: LUCENE-9962, LUCENE-9944, LUCENE-9988: DrillSideways improvement backports
gsmiller commented on pull request #2506: URL: https://github.com/apache/lucene-solr/pull/2506#issuecomment-867053930 I'm planning to push this later today or tomorrow unless I hear any objections. The change is identical to the ones I introduced on main (just backported). Please speak up if you object :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] thelabdude opened a new pull request #2515: SOLR-15472: New shards.preference option for preferring replicas based on their leader status
thelabdude opened a new pull request #2515: URL: https://github.com/apache/lucene-solr/pull/2515 Backport to 8x, see original PR: https://github.com/apache/solr/pull/188 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] gautamworah96 opened a new pull request #2516: Backport LUCENE-9902 Minor fixes to the faceting API (#62)
gautamworah96 opened a new pull request #2516: URL: https://github.com/apache/lucene-solr/pull/2516 Backported from the original [PR](https://github.com/apache/lucene/pull/62) in apache/lucene -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9204) Move span queries to the queries module
[ https://issues.apache.org/jira/browse/LUCENE-9204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17368432#comment-17368432 ] Michael Gibney commented on LUCENE-9204: Regarding exponential cost in the number of clauses, essentially [the same approach|https://github.com/apache/lucene/blob/baceb1690442c2cdd6164f1faa34d65b54786a04/lucene/core/src/java/org/apache/lucene/util/QueryBuilder.java#L568-L582] is taken at the moment by {{QueryBuilder.analyzeGraphPhrase(...)}}. See: LUCENE-8531 (v7.6+, for slop>0), LUCENE-9207 (v9.0, for all slop values). Even with maxBooleanClauses as a safety valve, this can be problematic. This [recent thread|https://markmail.org/message/n4p2jmsdys6s6buo] on the solr users list is relevant. [~jimczi] mentioned in a [comment on LUCENE-9207|https://issues.apache.org/jira/browse/LUCENE-9207?focusedCommentId=17031526#comment-17031526] that Elasticsearch mitigates this performance issue by disabling graph queries in certain cases. I initially wonder whether this might amount to disabling graph queries in the very cases where graph queries would be most useful? That said, I suppose a similar approach could indeed be prudent in Solr, pending a solution that more directly addresses the issue. (I'm curious, but I have yet to go digging in the Elasticsearch code/docs to find where the disabling of graph queries happens, and to better understand what the tradeoffs are). > Move span queries to the queries module > --- > > Key: LUCENE-9204 > URL: https://issues.apache.org/jira/browse/LUCENE-9204 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: main (9.0) > > Time Spent: 1h > Remaining Estimate: 0h > > We have a slightly odd situation currently, with two parallel query > structures for building complex positional queries: the long-standing span > queries, in core; and interval queries, in the queries module. Given that > interval queries solve at least some of the problems we've had with Spans, I > think we should be pushing users more towards these implementations. It's > counter-intuitive to do that when Spans are in core though. I've opened this > issue to discuss moving the spans package as a whole to the queries module. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9902) Update faceting API to use modern Java features
[ https://issues.apache.org/jira/browse/LUCENE-9902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17368434#comment-17368434 ] Gautam Worah commented on LUCENE-9902: -- I've added a [PR|https://github.com/apache/lucene-solr/pull/2516] to the apache/lucene-solr repo to backport it to 8.10 > Update faceting API to use modern Java features > --- > > Key: LUCENE-9902 > URL: https://issues.apache.org/jira/browse/LUCENE-9902 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Gautam Worah >Priority: Minor > Time Spent: 1h 50m > Remaining Estimate: 0h > > I was using the {{public int getOrdinal(String dim, String[] path)}} API for > a single {{path}} String and found myself creating an array with a single > element. We can start using variable length args for this method. > I also propose this change: > I wanted to know the specific count of an ordinal using using the > {{getValue}} API from {{IntTaxonomyFacets}} but the method is private. It > would be good if we could change it to {{protected}} so that users can know > the value of an ordinal without looking up the {{FacetLabel}} and then > checking its value. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] zhaih opened a new pull request #2517: Backport LUCENE-9142 and LUCENE-9983
zhaih opened a new pull request #2517: URL: https://github.com/apache/lucene-solr/pull/2517 ### Changes Cherry-picked LUCENE-9142 and LUCENE-9983 change LUCENE-9142 is a refactoring change that LUCENE-9983 depending on LUCENE-9983 is a change that speeds up `determinize` process when large amount of states are created. ### Test `ant precommit` && unit tests for lucene-core -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] mikemccand merged pull request #2516: Backport LUCENE-9902 Minor fixes to the faceting API (#62)
mikemccand merged pull request #2516: URL: https://github.com/apache/lucene-solr/pull/2516 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9902) Update faceting API to use modern Java features
[ https://issues.apache.org/jira/browse/LUCENE-9902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17368445#comment-17368445 ] ASF subversion and git services commented on LUCENE-9902: - Commit 2056d61c6f4546cd1086f6314c27aac1747d43a5 in lucene-solr's branch refs/heads/branch_8x from Gautam Worah [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=2056d61 ] Backport LUCENE-9902 Minor fixes to the faceting API (#62) (#2516) Co-authored-by: Gautam Worah > Update faceting API to use modern Java features > --- > > Key: LUCENE-9902 > URL: https://issues.apache.org/jira/browse/LUCENE-9902 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Gautam Worah >Priority: Minor > Time Spent: 2h > Remaining Estimate: 0h > > I was using the {{public int getOrdinal(String dim, String[] path)}} API for > a single {{path}} String and found myself creating an array with a single > element. We can start using variable length args for this method. > I also propose this change: > I wanted to know the specific count of an ordinal using using the > {{getValue}} API from {{IntTaxonomyFacets}} but the method is private. It > would be good if we could change it to {{protected}} so that users can know > the value of an ordinal without looking up the {{FacetLabel}} and then > checking its value. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9902) Update faceting API to use modern Java features
[ https://issues.apache.org/jira/browse/LUCENE-9902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17368446#comment-17368446 ] ASF subversion and git services commented on LUCENE-9902: - Commit db26215f156d956143e29f1ce43f90c30cd8a107 in lucene's branch refs/heads/main from Michael McCandless [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=db26215 ] LUCENE-9902: move CHANGES entry to 8.10.0 > Update faceting API to use modern Java features > --- > > Key: LUCENE-9902 > URL: https://issues.apache.org/jira/browse/LUCENE-9902 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Gautam Worah >Priority: Minor > Time Spent: 2h > Remaining Estimate: 0h > > I was using the {{public int getOrdinal(String dim, String[] path)}} API for > a single {{path}} String and found myself creating an array with a single > element. We can start using variable length args for this method. > I also propose this change: > I wanted to know the specific count of an ordinal using using the > {{getValue}} API from {{IntTaxonomyFacets}} but the method is private. It > would be good if we could change it to {{protected}} so that users can know > the value of an ordinal without looking up the {{FacetLabel}} and then > checking its value. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9204) Move span queries to the queries module
[ https://issues.apache.org/jira/browse/LUCENE-9204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17368474#comment-17368474 ] Michael Gibney commented on LUCENE-9204: I think I see what's going on with disabling graph queries for certain analysis chains in Elasticsearch: # ShingleFilter (via [ShingleTokenFilterFactory|https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/index/analysis/ShingleTokenFilterFactory.java#L116-L124] and [CommonAnalysisPlugin|https://github.com/elastic/elasticsearch/blob/master/modules/analysis-common/src/main/java/org/elasticsearch/analysis/common/CommonAnalysisPlugin.java#L475-L485]) # [CJKBigramFilterFactory|https://github.com/elastic/elasticsearch/blob/master/modules/analysis-common/src/main/java/org/elasticsearch/analysis/common/CJKBigramFilterFactory.java#L70-L78] This makes sense; and if Solr's not already doing this, then it should. In any case these are _definitely_ not, as I had wondered, "the very cases where graph queries would be most useful" :) However, this still leaves SynonymGraphTokenFilterFactory and WordDelimiterGraphTokenFilterFactory (in Elasticsearch) as potentially triggering this kind of expansion (in a manner identical to what's reported in the above-referenced thread from the solr users list). > Move span queries to the queries module > --- > > Key: LUCENE-9204 > URL: https://issues.apache.org/jira/browse/LUCENE-9204 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: main (9.0) > > Time Spent: 1h > Remaining Estimate: 0h > > We have a slightly odd situation currently, with two parallel query > structures for building complex positional queries: the long-standing span > queries, in core; and interval queries, in the queries module. Given that > interval queries solve at least some of the problems we've had with Spans, I > think we should be pushing users more towards these implementations. It's > counter-intuitive to do that when Spans are in core though. I've opened this > issue to discuss moving the spans package as a whole to the queries module. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9204) Move span queries to the queries module
[ https://issues.apache.org/jira/browse/LUCENE-9204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17368474#comment-17368474 ] Michael Gibney edited comment on LUCENE-9204 at 6/23/21, 9:30 PM: -- I think I see what's going on with disabling graph queries for certain analysis chains in Elasticsearch: # ShingleFilter (via [ShingleTokenFilterFactory|https://github.com/elastic/elasticsearch/blob/d9259ccb3f2881d7e77178f091f1f662a47e9cc0/server/src/main/java/org/elasticsearch/index/analysis/ShingleTokenFilterFactory.java#L116-L124] and [CommonAnalysisPlugin|https://github.com/elastic/elasticsearch/blob/d9259ccb3f2881d7e77178f091f1f662a47e9cc0/modules/analysis-common/src/main/java/org/elasticsearch/analysis/common/CommonAnalysisPlugin.java#L475-L485]) # [CJKBigramFilterFactory|https://github.com/elastic/elasticsearch/blob/d9259ccb3f2881d7e77178f091f1f662a47e9cc0/modules/analysis-common/src/main/java/org/elasticsearch/analysis/common/CJKBigramFilterFactory.java#L70-L78] This makes sense; and if Solr's not already doing this, then it should. In any case these are _definitely_ not, as I had wondered, "the very cases where graph queries would be most useful" :) However, this still leaves SynonymGraphTokenFilterFactory and WordDelimiterGraphTokenFilterFactory (in Elasticsearch) as potentially triggering this kind of expansion (in a manner identical to what's reported in the above-referenced thread from the solr users list). was (Author: mgibney): I think I see what's going on with disabling graph queries for certain analysis chains in Elasticsearch: # ShingleFilter (via [ShingleTokenFilterFactory|https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/index/analysis/ShingleTokenFilterFactory.java#L116-L124] and [CommonAnalysisPlugin|https://github.com/elastic/elasticsearch/blob/master/modules/analysis-common/src/main/java/org/elasticsearch/analysis/common/CommonAnalysisPlugin.java#L475-L485]) # [CJKBigramFilterFactory|https://github.com/elastic/elasticsearch/blob/master/modules/analysis-common/src/main/java/org/elasticsearch/analysis/common/CJKBigramFilterFactory.java#L70-L78] This makes sense; and if Solr's not already doing this, then it should. In any case these are _definitely_ not, as I had wondered, "the very cases where graph queries would be most useful" :) However, this still leaves SynonymGraphTokenFilterFactory and WordDelimiterGraphTokenFilterFactory (in Elasticsearch) as potentially triggering this kind of expansion (in a manner identical to what's reported in the above-referenced thread from the solr users list). > Move span queries to the queries module > --- > > Key: LUCENE-9204 > URL: https://issues.apache.org/jira/browse/LUCENE-9204 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: main (9.0) > > Time Spent: 1h > Remaining Estimate: 0h > > We have a slightly odd situation currently, with two parallel query > structures for building complex positional queries: the long-standing span > queries, in core; and interval queries, in the queries module. Given that > interval queries solve at least some of the problems we've had with Spans, I > think we should be pushing users more towards these implementations. It's > counter-intuitive to do that when Spans are in core though. I've opened this > issue to discuss moving the spans package as a whole to the queries module. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on pull request #185: LUCENE-9999: CombinedFieldQuery can fail with an exception when document is missing fields
jtibshirani commented on pull request #185: URL: https://github.com/apache/lucene/pull/185#issuecomment-867185368 @jimczi asked if I could finish the PR -- I pushed some changes: * Add a check that either all fields or no fields have norms enabled * Rework the testing strategy Let me know if it still looks okay or if you have other comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] mikemccand commented on pull request #2517: Backport LUCENE-9142 and LUCENE-9983
mikemccand commented on pull request #2517: URL: https://github.com/apache/lucene-solr/pull/2517#issuecomment-867190900 Thanks @zhaih! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] mikemccand merged pull request #2517: Backport LUCENE-9142 and LUCENE-9983
mikemccand merged pull request #2517: URL: https://github.com/apache/lucene-solr/pull/2517 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9142) Add documentation to Operations.determinize, SortedIntSet, and FrozenSet
[ https://issues.apache.org/jira/browse/LUCENE-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17368484#comment-17368484 ] ASF subversion and git services commented on LUCENE-9142: - Commit c6b9dd95c9f4b9ab9bac5904988432bba2ad4bd3 in lucene-solr's branch refs/heads/branch_8x from Patrick Zhai [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c6b9dd9 ] Backport LUCENE-9142 and LUCENE-9983 (#2517) * LUCENE-9142 Refactor IntSet operations for determinize (#1184) Co-authored-by: Mike > Add documentation to Operations.determinize, SortedIntSet, and FrozenSet > > > Key: LUCENE-9142 > URL: https://issues.apache.org/jira/browse/LUCENE-9142 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Reporter: Mike Drob >Assignee: Mike Drob >Priority: Major > Fix For: main (9.0) > > Time Spent: 4h 40m > Remaining Estimate: 0h > > Was tracing through the fuzzy query code, and IntelliJ helpfully pointed out > that we have mismatched types when trying to reuse states, and so we may be > creating more states than we need to. > Relevant snippets: > {code:title=Operations.java} > Map newstate = new HashMap<>(); > final SortedIntSet statesSet = new SortedIntSet(5); > Integer q = newstate.get(statesSet); > {code} > {{q}} is always going to be null in this path because there are no > SortedIntSet keys in the map. > There are also very little javadoc on SortedIntSet, so I'm having trouble > following the precise relationship between all the pieces here. > cc: [~mikemccand] [~romseygeek] - I would appreciate any pointers if you have > them -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9142) Add documentation to Operations.determinize, SortedIntSet, and FrozenSet
[ https://issues.apache.org/jira/browse/LUCENE-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17368482#comment-17368482 ] ASF subversion and git services commented on LUCENE-9142: - Commit c6b9dd95c9f4b9ab9bac5904988432bba2ad4bd3 in lucene-solr's branch refs/heads/branch_8x from Patrick Zhai [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c6b9dd9 ] Backport LUCENE-9142 and LUCENE-9983 (#2517) * LUCENE-9142 Refactor IntSet operations for determinize (#1184) Co-authored-by: Mike > Add documentation to Operations.determinize, SortedIntSet, and FrozenSet > > > Key: LUCENE-9142 > URL: https://issues.apache.org/jira/browse/LUCENE-9142 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Reporter: Mike Drob >Assignee: Mike Drob >Priority: Major > Fix For: main (9.0) > > Time Spent: 4h 40m > Remaining Estimate: 0h > > Was tracing through the fuzzy query code, and IntelliJ helpfully pointed out > that we have mismatched types when trying to reuse states, and so we may be > creating more states than we need to. > Relevant snippets: > {code:title=Operations.java} > Map newstate = new HashMap<>(); > final SortedIntSet statesSet = new SortedIntSet(5); > Integer q = newstate.get(statesSet); > {code} > {{q}} is always going to be null in this path because there are no > SortedIntSet keys in the map. > There are also very little javadoc on SortedIntSet, so I'm having trouble > following the precise relationship between all the pieces here. > cc: [~mikemccand] [~romseygeek] - I would appreciate any pointers if you have > them -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily
[ https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17368483#comment-17368483 ] ASF subversion and git services commented on LUCENE-9983: - Commit c6b9dd95c9f4b9ab9bac5904988432bba2ad4bd3 in lucene-solr's branch refs/heads/branch_8x from Patrick Zhai [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c6b9dd9 ] Backport LUCENE-9142 and LUCENE-9983 (#2517) * LUCENE-9142 Refactor IntSet operations for determinize (#1184) Co-authored-by: Mike > Stop sorting determinize powersets unnecessarily > > > Key: LUCENE-9983 > URL: https://issues.apache.org/jira/browse/LUCENE-9983 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Time Spent: 5h 40m > Remaining Estimate: 0h > > Spinoff from LUCENE-9981. > Today, our {{Operations.determinize}} implementation builds powersets of all > subsets of NFA states that "belong" in the same determinized state, using > [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction]. > To hold each powerset, we use a malleable {{SortedIntSet}} and periodically > freeze it to a {{FrozenIntSet}}, also sorted. We pay a high price to keep > these growing maps of int key, int value sorted by key, e.g. upgrading to a > {{TreeMap}} once the map is large enough (> 30 entries). > But I think sorting is entirely unnecessary here! Really all we need is the > ability to add/delete keys from the map, and hashCode / equals (by key only – > ignoring value!), and to freeze the map (a small optimization that we could > skip initially). We only use these maps to lookup in the (growing) > determinized automaton whether this powerset has already been seen. > Maybe we could simply poach the {{IntIntScatterMap}} implementation from > [HPPC|https://github.com/carrotsearch/hppc]? And then change its > {{hashCode}}/{{equals }}to only use keys (not values). > This change should be a big speedup for the kinds of (admittedly adversarial) > regexps we saw on LUCENE-9981. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gautamworah96 closed pull request #190: Adjust CHANGES.txt to move LUCENE-9902 entry from release 8.9 to 9.0
gautamworah96 closed pull request #190: URL: https://github.com/apache/lucene/pull/190 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on pull request #190: Adjust CHANGES.txt to move LUCENE-9902 entry from release 8.9 to 9.0
gautamworah96 commented on pull request #190: URL: https://github.com/apache/lucene/pull/190#issuecomment-867217120 @mikemccand independently merged a [commit](https://github.com/apache/lucene/commit/db26215f156d956143e29f1ce43f90c30cd8a107) that made the changes proposed in this PR. Thanks @mikemccand ! Closing this PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #191: LUCENE-9964: Duplicate long values in a field should only be counted once when using SortedNumericDocValuesFields
gautamworah96 commented on a change in pull request #191: URL: https://github.com/apache/lucene/pull/191#discussion_r657528665 ## File path: lucene/facet/src/java/org/apache/lucene/facet/LongValueFacetCounts.java ## @@ -162,8 +164,14 @@ private void count(String field, List matchingDocs) throws IOExcep if (limit > 0) { totCount++; } + Set uniqueLongValues = + new HashSet<>(); // count each repeated long value in a field as the same Review comment: We could reuse a single set and then empty it out after each matching doc. I'll revise and use a HPPC set here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang closed pull request #139: LUCENE-9957: Use DirectMonotonicWriter to store sorted Values in NumericDocValues/SortedNumericDocValues
LuXugang closed pull request #139: URL: https://github.com/apache/lucene/pull/139 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org