[GitHub] [lucene] dweiss commented on a change in pull request #108: LUCENE-9897 Change dependency checking mechanism to use gradle checksum verification
dweiss commented on a change in pull request #108: URL: https://github.com/apache/lucene/pull/108#discussion_r625549585 ## File path: gradle/validation/jar-checks.gradle ## @@ -242,62 +206,14 @@ subprojects { } } - licenses.dependsOn validateJarChecksums, validateJarLicenses + licenses.dependsOn validateJarLicenses } // Add top-project level tasks validating dangling files // and regenerating dependency checksums. configure(project(":lucene")) { def validationTasks = subprojects.collectMany { it.tasks.matching { it.name == "licenses" } } - def jarInfoTasks = subprojects.collectMany { it.tasks.matching { it.name == "collectJarInfos" } } - - // Update dependency checksums. - task updateLicenses() { Review comment: Check is just a convention aggregation task, nothing else. We have *tons* of other stuff that isn't connected with check in the execution graph - I bet some of these dependencies/ configurations are in in plugins and it'd be difficult to even hook into them to disable automatic dependency verification. The work on this attempt isn't lost though (thank you!). Let's keep an eye on what happens with gradle's built-in checkums and retry the attempt when it's more flexible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] tflobbe opened a new pull request #2488: SOLR-15391: Enable 'canUsePoints' for PointFields in Solr
tflobbe opened a new pull request #2488: URL: https://github.com/apache/lucene-solr/pull/2488 Just a draft for now, no tests or performance numbers. For 8.x only -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9951) Add an InfoStream to ReplicationService to facilitate debugging
Christoph Kaser created LUCENE-9951: --- Summary: Add an InfoStream to ReplicationService to facilitate debugging Key: LUCENE-9951 URL: https://issues.apache.org/jira/browse/LUCENE-9951 Project: Lucene - Core Issue Type: Improvement Components: modules/replicator Reporter: Christoph Kaser At the moment, when an exception occurs during replication, the ReplicationService tries to serialize it and send it to the client, which then reports it. This does not work when the exception occurs after the first part of the response has already been sent, or if there was a network error. In these cases, the exception is silently ignored (on the server side), and the client side will report a TruncatedChunkException, making it hard to find the exact cause of the problem. I propose to add an InfoStream to the ReplicationService (analogous to the ReplicationClient) which will log requests and errors that are sent back to the client. I will provide a PR for this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] ChristophKaser opened a new pull request #124: LUCENE-9951: Add InfoStream to ReplicationService
ChristophKaser opened a new pull request #124: URL: https://github.com/apache/lucene/pull/124 An InfoStream is added to the ReplicationService (similar to the ReplicationClient) to allow debugging replication issues # Description Adds InfoStream to ReplicationService to facilitate debugging # Solution # Tests # Checklist Please review the following and check all that apply: - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [x] I have created a Jira issue and added the issue ID to my pull request title. - [x] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [x] I have developed this patch against the `main` branch. - [ ] I have run `./gradlew check`. - [ ] I have added tests for my changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9334) Require consistency between data-structures on a per-field basis
[ https://issues.apache.org/jira/browse/LUCENE-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338845#comment-17338845 ] Ignacio Vera commented on LUCENE-9334: -- I had a look and I see the problem with the test. We need to add an IndexWriterConfig with SerialMergeScheduler in order to reproduce the failures: {code} IndexWriterConfig iwc = newIndexWriterConfig(); // Else seeds may not reproduce: iwc.setMergeScheduler(new SerialMergeScheduler()); {code} Adding that, the following seed reproduces the failure: {code} ./gradlew cleanTest test --tests TestPerFieldConsistency -Dtests.seed=C40258ABF5E76DCB -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=te-IN -Dtests.timezone=SystemV/CST6CDT -Dtests.asserts=true -Dtests.file.encoding=UTF-8 {code} > Require consistency between data-structures on a per-field basis > > > Key: LUCENE-9334 > URL: https://issues.apache.org/jira/browse/LUCENE-9334 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Blocker > Fix For: main (9.0) > > Time Spent: 14.5h > Remaining Estimate: 0h > > Follow-up of > https://lists.apache.org/thread.html/r747de568afd7502008c45783b74cc3aeb31dab8aa60fcafaf65d5431%40%3Cdev.lucene.apache.org%3E. > We would like to start requiring consitency across data-structures on a > per-field basis in order to make it easier to do the right thing by default: > range queries can run faster if doc values are enabled, sorted queries can > run faster if points by indexed, etc. > This would be a big change, so it should be rolled out in a major. > Strict validation is tricky to implement, but we should still implement > best-effort validation: > - Documents all use the same data-structures, e.g. it is illegal for a > document to only enable points and another document to only enable doc values, > - When possible, check whether values are consistent too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] muse-dev[bot] commented on a change in pull request #124: LUCENE-9951: Add InfoStream to ReplicationService
muse-dev[bot] commented on a change in pull request #124: URL: https://github.com/apache/lucene/pull/124#discussion_r625612737 ## File path: lucene/replicator/src/java/org/apache/lucene/replicator/http/ReplicationService.java ## @@ -183,6 +203,17 @@ public void perform(HttpServletRequest req, HttpServletResponse resp) break; } } catch (Exception e) { + if (infoStream.isEnabled(INFO_STREAM_COMPONENT)) { +final StringWriter sw = new StringWriter(); +sw.append("an error occurred during replication service call ("); +sw.append(req.getRequestURI()); +if (req.getQueryString() != null) { + sw.append('?').append(req.getQueryString()); +} +sw.append("): "); +e.printStackTrace(new PrintWriter(sw)); Review comment: *INFORMATION_EXPOSURE_THROUGH_AN_ERROR_MESSAGE:* Possible information exposure through an error message [(details)](https://find-sec-bugs.github.io/bugs.htm#INFORMATION_EXPOSURE_THROUGH_AN_ERROR_MESSAGE) (at-me [in a reply](https://docs.muse.dev/docs/talk-to-muse/) with `help` or `ignore`) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9334) Require consistency between data-structures on a per-field basis
[ https://issues.apache.org/jira/browse/LUCENE-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338864#comment-17338864 ] Ignacio Vera commented on LUCENE-9334: -- The test assumes there will be no merges in the background which is not true. Maybe an easy fix is to disable merges: {code:java} iwc.setMergePolicy(NoMergePolicy.INSTANCE); {code} > Require consistency between data-structures on a per-field basis > > > Key: LUCENE-9334 > URL: https://issues.apache.org/jira/browse/LUCENE-9334 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Blocker > Fix For: main (9.0) > > Time Spent: 14.5h > Remaining Estimate: 0h > > Follow-up of > https://lists.apache.org/thread.html/r747de568afd7502008c45783b74cc3aeb31dab8aa60fcafaf65d5431%40%3Cdev.lucene.apache.org%3E. > We would like to start requiring consitency across data-structures on a > per-field basis in order to make it easier to do the right thing by default: > range queries can run faster if doc values are enabled, sorted queries can > run faster if points by indexed, etc. > This would be a big change, so it should be rolled out in a major. > Strict validation is tricky to implement, but we should still implement > best-effort validation: > - Documents all use the same data-structures, e.g. it is illegal for a > document to only enable points and another document to only enable doc values, > - When possible, check whether values are consistent too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] muse-dev[bot] commented on a change in pull request #124: LUCENE-9951: Add InfoStream to ReplicationService
muse-dev[bot] commented on a change in pull request #124: URL: https://github.com/apache/lucene/pull/124#discussion_r625635306 ## File path: lucene/replicator/src/java/org/apache/lucene/replicator/http/ReplicationService.java ## @@ -183,6 +203,17 @@ public void perform(HttpServletRequest req, HttpServletResponse resp) break; } } catch (Exception e) { + if (infoStream.isEnabled(INFO_STREAM_COMPONENT)) { +final StringWriter sw = new StringWriter(); +sw.append("an error occurred during replication service call ("); +sw.append(req.getRequestURI()); +if (req.getQueryString() != null) { + sw.append('?').append(req.getQueryString()); +} +sw.append("): "); +e.printStackTrace(new PrintWriter(sw)); Review comment: I've recorded this as ignored for this pull request. If you change your mind, just comment `@muse-dev unignore`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] ChristophKaser commented on a change in pull request #124: LUCENE-9951: Add InfoStream to ReplicationService
ChristophKaser commented on a change in pull request #124: URL: https://github.com/apache/lucene/pull/124#discussion_r625635261 ## File path: lucene/replicator/src/java/org/apache/lucene/replicator/http/ReplicationService.java ## @@ -183,6 +203,17 @@ public void perform(HttpServletRequest req, HttpServletResponse resp) break; } } catch (Exception e) { + if (infoStream.isEnabled(INFO_STREAM_COMPONENT)) { +final StringWriter sw = new StringWriter(); +sw.append("an error occurred during replication service call ("); +sw.append(req.getRequestURI()); +if (req.getQueryString() != null) { + sw.append('?').append(req.getQueryString()); +} +sw.append("): "); +e.printStackTrace(new PrintWriter(sw)); Review comment: There is no sensitive information in a replication request @muse-dev ignore -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9334) Require consistency between data-structures on a per-field basis
[ https://issues.apache.org/jira/browse/LUCENE-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338891#comment-17338891 ] Dawid Weiss commented on LUCENE-9334: - I think that's a good first step - I don't know this patch. [~mayyas] may have a better insight. > Require consistency between data-structures on a per-field basis > > > Key: LUCENE-9334 > URL: https://issues.apache.org/jira/browse/LUCENE-9334 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Blocker > Fix For: main (9.0) > > Time Spent: 14.5h > Remaining Estimate: 0h > > Follow-up of > https://lists.apache.org/thread.html/r747de568afd7502008c45783b74cc3aeb31dab8aa60fcafaf65d5431%40%3Cdev.lucene.apache.org%3E. > We would like to start requiring consitency across data-structures on a > per-field basis in order to make it easier to do the right thing by default: > range queries can run faster if doc values are enabled, sorted queries can > run faster if points by indexed, etc. > This would be a big change, so it should be rolled out in a major. > Strict validation is tricky to implement, but we should still implement > best-effort validation: > - Documents all use the same data-structures, e.g. it is illegal for a > document to only enable points and another document to only enable doc values, > - When possible, check whether values are consistent too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova opened a new pull request #125: Fix occasional failures in TestPerFieldConsistency
mayya-sharipova opened a new pull request #125: URL: https://github.com/apache/lucene/pull/125 This test assumes that there is no merging, and was failing when there were merges. This fixes the test but setting NoMergePolicy for IndexWriter. Relates to #11 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9334) Require consistency between data-structures on a per-field basis
[ https://issues.apache.org/jira/browse/LUCENE-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338908#comment-17338908 ] Mayya Sharipova commented on LUCENE-9334: - [~dweiss] Thanks for raising the failure, and thanks [~ivera] for investigation. Indeed the test assumes no merging. I've created a fix in https://github.com/apache/lucene/pull/125, and will merge it today. > Require consistency between data-structures on a per-field basis > > > Key: LUCENE-9334 > URL: https://issues.apache.org/jira/browse/LUCENE-9334 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Blocker > Fix For: main (9.0) > > Time Spent: 14.5h > Remaining Estimate: 0h > > Follow-up of > https://lists.apache.org/thread.html/r747de568afd7502008c45783b74cc3aeb31dab8aa60fcafaf65d5431%40%3Cdev.lucene.apache.org%3E. > We would like to start requiring consitency across data-structures on a > per-field basis in order to make it easier to do the right thing by default: > range queries can run faster if doc values are enabled, sorted queries can > run faster if points by indexed, etc. > This would be a big change, so it should be rolled out in a major. > Strict validation is tricky to implement, but we should still implement > best-effort validation: > - Documents all use the same data-structures, e.g. it is illegal for a > document to only enable points and another document to only enable doc values, > - When possible, check whether values are consistent too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9334) Require consistency between data-structures on a per-field basis
[ https://issues.apache.org/jira/browse/LUCENE-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338908#comment-17338908 ] Mayya Sharipova edited comment on LUCENE-9334 at 5/4/21, 10:50 AM: --- [~dweiss] Thanks for raising the failure, and thanks [~ivera] for investigation. [~ivera] Indeed, the test assumes no merging. I've created a fix in [PR|https://github.com/apache/lucene/pull/125], and will merge it today. was (Author: mayyas): [~dweiss] Thanks for raising the failure, and thanks [~ivera] for investigation. Indeed the test assumes no merging. I've created a fix in https://github.com/apache/lucene/pull/125, and will merge it today. > Require consistency between data-structures on a per-field basis > > > Key: LUCENE-9334 > URL: https://issues.apache.org/jira/browse/LUCENE-9334 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Blocker > Fix For: main (9.0) > > Time Spent: 14.5h > Remaining Estimate: 0h > > Follow-up of > https://lists.apache.org/thread.html/r747de568afd7502008c45783b74cc3aeb31dab8aa60fcafaf65d5431%40%3Cdev.lucene.apache.org%3E. > We would like to start requiring consitency across data-structures on a > per-field basis in order to make it easier to do the right thing by default: > range queries can run faster if doc values are enabled, sorted queries can > run faster if points by indexed, etc. > This would be a big change, so it should be rolled out in a major. > Strict validation is tricky to implement, but we should still implement > best-effort validation: > - Documents all use the same data-structures, e.g. it is illegal for a > document to only enable points and another document to only enable doc values, > - When possible, check whether values are consistent too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on pull request #114: LUCENE-9905: PerFieldVectorFormat
msokolov commented on pull request #114: URL: https://github.com/apache/lucene/pull/114#issuecomment-831919682 no comments here it seems; Anyway, we're really just moving the deck chairs around to be more future-extensible; I'll take the silence as consensus and merge later today. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova merged pull request #125: Fix occasional failures in TestPerFieldConsistency
mayya-sharipova merged pull request #125: URL: https://github.com/apache/lucene/pull/125 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9334) Require consistency between data-structures on a per-field basis
[ https://issues.apache.org/jira/browse/LUCENE-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339005#comment-17339005 ] ASF subversion and git services commented on LUCENE-9334: - Commit b5a77de5126c36582a1beb0fc763b47745d46417 in lucene's branch refs/heads/main from Mayya Sharipova [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=b5a77de ] Fix failures in TestPerFieldConsistency (#125) This test assumes that there is no merging, and was failing when there were merges. This fixes the test but setting NoMergePolicy for IndexWriter. Relates to LUCENE-9334 Relates to #11 > Require consistency between data-structures on a per-field basis > > > Key: LUCENE-9334 > URL: https://issues.apache.org/jira/browse/LUCENE-9334 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Blocker > Fix For: main (9.0) > > Time Spent: 14.5h > Remaining Estimate: 0h > > Follow-up of > https://lists.apache.org/thread.html/r747de568afd7502008c45783b74cc3aeb31dab8aa60fcafaf65d5431%40%3Cdev.lucene.apache.org%3E. > We would like to start requiring consitency across data-structures on a > per-field basis in order to make it easier to do the right thing by default: > range queries can run faster if doc values are enabled, sorted queries can > run faster if points by indexed, etc. > This would be a big change, so it should be rolled out in a major. > Strict validation is tricky to implement, but we should still implement > best-effort validation: > - Documents all use the same data-structures, e.g. it is illegal for a > document to only enable points and another document to only enable doc values, > - When possible, check whether values are consistent too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a change in pull request #114: LUCENE-9905: PerFieldVectorFormat
rmuir commented on a change in pull request #114: URL: https://github.com/apache/lucene/pull/114#discussion_r625823163 ## File path: lucene/core/src/resources/META-INF/services/org.apache.lucene.codecs.VectorFormat ## @@ -0,0 +1,33 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# Licensed to the Apache Software Foundation (ASF) under one or more Review comment: all these SPI files seem to have double copyrights -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #114: LUCENE-9905: PerFieldVectorFormat
rmuir commented on pull request #114: URL: https://github.com/apache/lucene/pull/114#issuecomment-831983465 is the plan to do a separate followup to break out euclidean and dot product into codec parameter and remove from FieldInfo? as these are hnsw-specific parameters, they really belong in that codec versus FieldInfo. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a change in pull request #114: LUCENE-9905: PerFieldVectorFormat
jpountz commented on a change in pull request #114: URL: https://github.com/apache/lucene/pull/114#discussion_r625821205 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorFormat.java ## @@ -77,7 +77,9 @@ static final int VERSION_CURRENT = VERSION_START; /** Sole constructor */ - public Lucene90HnswVectorFormat() {} + public Lucene90HnswVectorFormat() { +super("Lucene90VectorFormat"); Review comment: historically we've used the class name as a format name, should we use ```suggestion super("Lucene90HnswVectorFormat"); ``` ? ## File path: lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldVectorFormat.java ## @@ -0,0 +1,293 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.codecs.perfield; + +import java.io.Closeable; +import java.io.IOException; +import java.util.HashMap; +import java.util.Map; +import java.util.ServiceLoader; +import java.util.TreeMap; +import org.apache.lucene.codecs.VectorFormat; +import org.apache.lucene.codecs.VectorReader; +import org.apache.lucene.codecs.VectorWriter; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.SegmentReadState; +import org.apache.lucene.index.SegmentWriteState; +import org.apache.lucene.index.VectorValues; +import org.apache.lucene.search.TopDocs; +import org.apache.lucene.util.IOUtils; + +/** + * Enables per field numeric vector support. + * + * Note, when extending this class, the name ({@link #getName}) is written into the index. In + * order for the field to be read, the name must resolve to your implementation via {@link + * #forName(String)}. This method uses Java's {@link ServiceLoader Service Provider Interface} to + * resolve format names. + * + * Files written by each numeric vectors format have an additional suffix containing the format + * name. For example, in a per-field configuration instead of _1.dat filenames would + * look like _1_Lucene40_0.dat. + * + * @see ServiceLoader + * @lucene.experimental + */ +public abstract class PerFieldVectorFormat extends VectorFormat { + /** Name of this {@link VectorFormat}. */ + public static final String PER_FIELD_NAME = "PerFieldVectors90"; + + /** {@link FieldInfo} attribute name used to store the format name for each field. */ + public static final String PER_FIELD_FORMAT_KEY = + PerFieldVectorFormat.class.getSimpleName() + ".format"; + + /** {@link FieldInfo} attribute name used to store the segment suffix name for each field. */ + public static final String PER_FIELD_SUFFIX_KEY = + PerFieldVectorFormat.class.getSimpleName() + ".suffix"; + + /** Sole constructor. */ + protected PerFieldVectorFormat() { +super(PER_FIELD_NAME); + } + + @Override + public VectorWriter fieldsWriter(SegmentWriteState state) throws IOException { +return new FieldsWriter(state); + } + + @Override + public VectorReader fieldsReader(SegmentReadState state) throws IOException { +return new FieldsReader(state); + } + + /** + * Returns the numeric vector format that should be used for writing new segments of field + * . + * + * The field to format mapping is written to the index, so this method is only invoked when + * writing, not when reading. + */ + public abstract VectorFormat getVectorFormatForField(String field); + + private class FieldsWriter extends VectorWriter { +private final Map formats; +private final Map suffixes = new HashMap<>(); +private final SegmentWriteState segmentWriteState; + +FieldsWriter(SegmentWriteState segmentWriteState) { + this.segmentWriteState = segmentWriteState; + formats = new HashMap<>(); +} + +@Override +public void writeField(FieldInfo fieldInfo, VectorValues values) throws IOException { + getInstance(fieldInfo).writeField(fieldInfo, values); +} + +@Override +public void finish() throws IOException { + for (WriterAndSuffix was : formats.values()) { +was.writer.finish(); + } +} + +@Override +public void close() throws IOException { + IOUtils.close(formats.values()); +} + +private VectorWriter getInstance
[jira] [Commented] (LUCENE-9843) Remove compression option on doc values
[ https://issues.apache.org/jira/browse/LUCENE-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339052#comment-17339052 ] Adrien Grand commented on LUCENE-9843: -- The patch looks good. This makes me wonder whether we should remove the threshold that only enables compression on the terms dict for non-tiny dictionaries: I believe that it hurts test coverage since our tests rarely index many documents, yet I'm not sure whether it brings real benefits to our users: iterating the terms dict is going to be super fast anyway if you only have few terms? > Remove compression option on doc values > --- > > Key: LUCENE-9843 > URL: https://issues.apache.org/jira/browse/LUCENE-9843 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Blocker > Attachments: LUCENE-9843.patch > > > Options on file formats add complexity and put a big tax on > backward-compatibility testing. I'm the one who introduced it LUCENE-9378 but > I would now like to think about what we can do to remove this option. > For the record, compression was initially introduced because some binary > fields have so much redundancy that it's wasteful not to compress them at > all. But unfortunately, this slowed down some search workloads and we decided > to introduce this option as a way to let users choose the trade-off they want. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9843) Remove compression option on doc values
[ https://issues.apache.org/jira/browse/LUCENE-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339084#comment-17339084 ] Robert Muir commented on LUCENE-9843: - +1 let's simplify and have better test coverage. it does not impact the speed for ord lookup in any way. > Remove compression option on doc values > --- > > Key: LUCENE-9843 > URL: https://issues.apache.org/jira/browse/LUCENE-9843 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Blocker > Attachments: LUCENE-9843.patch > > > Options on file formats add complexity and put a big tax on > backward-compatibility testing. I'm the one who introduced it LUCENE-9378 but > I would now like to think about what we can do to remove this option. > For the record, compression was initially introduced because some binary > fields have so much redundancy that it's wasteful not to compress them at > all. But unfortunately, this slowed down some search workloads and we decided > to introduce this option as a way to let users choose the trade-off they want. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9936) update gradle build to support gpg signing of tgz/zip distributions
[ https://issues.apache.org/jira/browse/LUCENE-9936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339165#comment-17339165 ] ASF subversion and git services commented on LUCENE-9936: - Commit a6cf46dadabfa7f76a645001d5158f818499de8e in lucene's branch refs/heads/main from Chris M. Hostetter [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a6cf46d ] LUCENE-9936: Add gpg signing of the tgz & zip distribution files > update gradle build to support gpg signing of tgz/zip distributions > --- > > Key: LUCENE-9936 > URL: https://issues.apache.org/jira/browse/LUCENE-9936 > Project: Lucene - Core > Issue Type: Task >Reporter: Chris M. Hostetter >Assignee: Chris M. Hostetter >Priority: Major > Attachments: LUCENE-9936.patch, LUCENE-9936.patch > > > the gradle build does not currently have any support for gpg signing the > distributions we produce. > this is neccessary for releases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9936) update gradle build to support gpg signing of tgz/zip distributions
[ https://issues.apache.org/jira/browse/LUCENE-9936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris M. Hostetter resolved LUCENE-9936. Fix Version/s: main (9.0) Resolution: Fixed > update gradle build to support gpg signing of tgz/zip distributions > --- > > Key: LUCENE-9936 > URL: https://issues.apache.org/jira/browse/LUCENE-9936 > Project: Lucene - Core > Issue Type: Task >Reporter: Chris M. Hostetter >Assignee: Chris M. Hostetter >Priority: Major > Fix For: main (9.0) > > Attachments: LUCENE-9936.patch, LUCENE-9936.patch > > > the gradle build does not currently have any support for gpg signing the > distributions we produce. > this is neccessary for releases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a change in pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query
jpountz commented on a change in pull request #101: URL: https://github.com/apache/lucene/pull/101#discussion_r625979068 ## File path: lucene/core/src/java/org/apache/lucene/search/BlockMaxMaxscoreScorer.java ## @@ -0,0 +1,339 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import static org.apache.lucene.search.ScorerUtil.costWithMinShouldMatch; + +import java.io.IOException; +import java.util.*; + +/** Scorer implementing Block-Max Maxscore algorithm */ +public class BlockMaxMaxscoreScorer extends Scorer { + private final ScoreMode scoreMode; + private final int scalingFactor; + + // current doc ID of the leads + private int doc; + + // doc id boundary that all scorers maxScore are valid + private int upTo = -1; + + // heap of scorers ordered by doc ID + private final DisiPriorityQueue essentialsScorers; + + // list of scorers whose sum of maxScore is less than minCompetitiveScore, ordered by maxScore + private final List nonEssentialScorers; + + // sum of max scores of scorers in nonEssentialScorers list + private long nonEssentialMaxScoreSum; + + // sum of score of scorers in essentialScorers list that are positioned on matching doc + private long matchedDocScoreSum; + + private long cost; + + private final MaxScoreSumPropagator maxScoreSumPropagator; + + private final List scorers; + + // scaled min competitive score + private long minCompetitiveScore = 0; + + /** + * Constructs a Scorer + * + * @param weight The weight to be used. + * @param scorers The sub scorers this Scorer should iterate on for optional clauses + * @param scoreMode The scoreMode + */ + public BlockMaxMaxscoreScorer(Weight weight, List scorers, ScoreMode scoreMode) + throws IOException { +super(weight); +assert scoreMode == ScoreMode.TOP_SCORES; + +this.scoreMode = scoreMode; +this.doc = -1; +this.scorers = scorers; +this.cost = +costWithMinShouldMatch( + scorers.stream().map(Scorer::iterator).mapToLong(DocIdSetIterator::cost), +scorers.size(), +1); + +essentialsScorers = new DisiPriorityQueue(scorers.size()); +nonEssentialScorers = new LinkedList<>(); + +scalingFactor = WANDScorer.getScalingFactor(scorers); +maxScoreSumPropagator = new MaxScoreSumPropagator(scorers); + +for (Scorer scorer : scorers) { + nonEssentialScorers.add(new DisiWrapper(scorer)); +} + } + + @Override + public DocIdSetIterator iterator() { +return TwoPhaseIterator.asDocIdSetIterator(twoPhaseIterator()); + } + + @Override + public TwoPhaseIterator twoPhaseIterator() { +DocIdSetIterator approximation = +new DocIdSetIterator() { + private long lastMinCompetitiveScore; + + @Override + public int docID() { +return doc; + } + + @Override + public int nextDoc() throws IOException { +return advance(doc + 1); + } + + @Override + public int advance(int target) throws IOException { +doAdvance(target); + +while (doc != DocIdSetIterator.NO_MORE_DOCS +&& nonEssentialMaxScoreSum + matchedDocScoreSum < minCompetitiveScore) { + doAdvance(doc + 1); +} + +return doc; + } + + private void doAdvance(int target) throws IOException { +matchedDocScoreSum = 0; +// Find next smallest doc id that is larger than or equal to target from the essential +// scorers + +// If the next candidate doc id is still within interval boundary, +if (lastMinCompetitiveScore == minCompetitiveScore && target <= upTo) { + while (essentialsScorers.top().doc < target) { +DisiWrapper w = essentialsScorers.pop(); +w.doc = w.iterator.advance(target); +essentialsScorers.add(w); Review comment: can you use updateTop instead? It's usually faster than pop+add -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spec
[jira] [Commented] (LUCENE-9946) Support multi-value fields in range facet counting
[ https://issues.apache.org/jira/browse/LUCENE-9946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339187#comment-17339187 ] Greg Miller commented on LUCENE-9946: - Just as a small update on this, I've hit a little speed bump in my implementation due to a bug in my approach I discovered when writing tests. The logic for rolling up range counts (in {{LongRangeCounter}}) needs to be revisited to support multi-value cases, which is a little non-trivial. A few cases to think through: # A multi-valued field contributes counts to multiple elementary intervals in the segment tree that roll up to different ranges. Each range should get a count of {{1}} from the doc. The doc should only contribute {{1}} to {{FacetResult#value}}. # A multi-valued field contributes counts to multiple elementary intervals in the segment tree that roll up to some of the same ranges. Each range should receive a count of {{1}} from the doc (need to ensure multiple elementary ranges rolling up to the same range don't double-count). The doc should only contribute {{1}} to {{FacetResult#value}}. # A multi-valued field contributes counts to the same elementary interval in the segment tree. The individual ranges that the elementary interval rolls up into should all only receive a count of {{1}} from the doc (need to ensure the elementary interval doesn't get double counted, contributing > {{1}} to the ranges it rolls up to). The doc should only contribute {{1}} to {{FacetResult#value}}. I'll circle back to this in a few days as I have more time to work on it. > Support multi-value fields in range facet counting > -- > > Key: LUCENE-9946 > URL: https://issues.apache.org/jira/browse/LUCENE-9946 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: main (9.0) >Reporter: Greg Miller >Priority: Minor > > The {{RangeFacetCounts}} implementations ({{LongRangeFacetCounts}} and > {{DoubleRangeFacetCount}}) only work on single-valued fields today. In > contrast, the more recently added {{LongValueFacetCounts}} implementation > supports both single- and multi-valued fields (LUCENE-7927). I'd like to > extend multi-value support to both of the {{LongRangeFacetCounts}} > implementations as well. > Looking through the implementations, I can't think of a good reason to _not_ > support this, but maybe I'm overlooking something? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning
[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339188#comment-17339188 ] Adrien Grand commented on LUCENE-9335: -- Thanks for writing two scorers to test this out! Would you be able to run queries under a profiler to see where your new scorers are spending most time? This might help identify how we could make them faster. Also thanks for testing with more queries, FWIW it would be good enough to only add 4-5 new queries to the tasks file to play with the change. By the way I'd be curious to see how your new scorers perform with 5 "Med" terms, which should be a worst-case scenario for BMW as all terms should have similar max scores. Since the queries you ran have a "Low" term, I wonder that this term drives iteration, which prevents BMM from showing the lower overhead it has compared to BMW. > Add a bulk scorer for disjunctions that does dynamic pruning > > > Key: LUCENE-9335 > URL: https://issues.apache.org/jira/browse/LUCENE-9335 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: wikimedium.10M.nostopwords.tasks > > Time Spent: 2h 40m > Remaining Estimate: 0h > > Lucene often gets benchmarked against other engines, e.g. against Tantivy and > PISA at [https://tantivy-search.github.io/bench/] or against research > prototypes in Table 1 of > [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf]. > Given that top-level disjunctions of term queries are commonly used for > benchmarking, it would be nice to optimize this case a bit more, I suspect > that we could make fewer per-document decisions by implementing a BulkScorer > instead of a Scorer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9843) Remove compression option on doc values
[ https://issues.apache.org/jira/browse/LUCENE-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Conradson updated LUCENE-9843: --- Attachment: LUCENE-9843.patch > Remove compression option on doc values > --- > > Key: LUCENE-9843 > URL: https://issues.apache.org/jira/browse/LUCENE-9843 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Blocker > Attachments: LUCENE-9843.patch, LUCENE-9843.patch, LUCENE-9843.patch > > > Options on file formats add complexity and put a big tax on > backward-compatibility testing. I'm the one who introduced it LUCENE-9378 but > I would now like to think about what we can do to remove this option. > For the record, compression was initially introduced because some binary > fields have so much redundancy that it's wasteful not to compress them at > all. But unfortunately, this slowed down some search workloads and we decided > to introduce this option as a way to let users choose the trade-off they want. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9843) Remove compression option on doc values
[ https://issues.apache.org/jira/browse/LUCENE-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Conradson updated LUCENE-9843: --- Attachment: LUCENE-9843.patch > Remove compression option on doc values > --- > > Key: LUCENE-9843 > URL: https://issues.apache.org/jira/browse/LUCENE-9843 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Blocker > Attachments: LUCENE-9843.patch, LUCENE-9843.patch, LUCENE-9843.patch > > > Options on file formats add complexity and put a big tax on > backward-compatibility testing. I'm the one who introduced it LUCENE-9378 but > I would now like to think about what we can do to remove this option. > For the record, compression was initially introduced because some binary > fields have so much redundancy that it's wasteful not to compress them at > all. But unfortunately, this slowed down some search workloads and we decided > to introduce this option as a way to let users choose the trade-off they want. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9843) Remove compression option on doc values
[ https://issues.apache.org/jira/browse/LUCENE-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339245#comment-17339245 ] Jack Conradson commented on LUCENE-9843: I have attached a new patch ([^LUCENE-9843.patch]) with the additional change of *always* compressing the terms dictionaries. This removes the {color:#9876aa}TERMS_DICT_BLOCK_COMPRESSION_THRESHOLD{color} constant and removes all the if/else blocks that related to compression in Lucene90DocValuesConsumer#addTermsDict. > Remove compression option on doc values > --- > > Key: LUCENE-9843 > URL: https://issues.apache.org/jira/browse/LUCENE-9843 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Blocker > Attachments: LUCENE-9843.patch, LUCENE-9843.patch, LUCENE-9843.patch > > > Options on file formats add complexity and put a big tax on > backward-compatibility testing. I'm the one who introduced it LUCENE-9378 but > I would now like to think about what we can do to remove this option. > For the record, compression was initially introduced because some binary > fields have so much redundancy that it's wasteful not to compress them at > all. But unfortunately, this slowed down some search workloads and we decided > to introduce this option as a way to let users choose the trade-off they want. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9843) Remove compression option on doc values
[ https://issues.apache.org/jira/browse/LUCENE-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339245#comment-17339245 ] Jack Conradson edited comment on LUCENE-9843 at 5/4/21, 7:18 PM: - Thank you [~jpountz] and [~rcmuir] for the feedback! I have attached a new patch ([^LUCENE-9843.patch]) with the additional change of *always* compressing the terms dictionaries. This removes the {color:#9876aa}TERMS_DICT_BLOCK_COMPRESSION_THRESHOLD{color} constant and removes all the if/else blocks that related to compression in Lucene90DocValuesConsumer#addTermsDict. was (Author: jdconradson): I have attached a new patch ([^LUCENE-9843.patch]) with the additional change of *always* compressing the terms dictionaries. This removes the {color:#9876aa}TERMS_DICT_BLOCK_COMPRESSION_THRESHOLD{color} constant and removes all the if/else blocks that related to compression in Lucene90DocValuesConsumer#addTermsDict. > Remove compression option on doc values > --- > > Key: LUCENE-9843 > URL: https://issues.apache.org/jira/browse/LUCENE-9843 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Blocker > Attachments: LUCENE-9843.patch, LUCENE-9843.patch, LUCENE-9843.patch > > > Options on file formats add complexity and put a big tax on > backward-compatibility testing. I'm the one who introduced it LUCENE-9378 but > I would now like to think about what we can do to remove this option. > For the record, compression was initially introduced because some binary > fields have so much redundancy that it's wasteful not to compress them at > all. But unfortunately, this slowed down some search workloads and we decided > to introduce this option as a way to let users choose the trade-off they want. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9843) Remove compression option on doc values
[ https://issues.apache.org/jira/browse/LUCENE-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339245#comment-17339245 ] Jack Conradson edited comment on LUCENE-9843 at 5/4/21, 7:19 PM: - Thank you [~jpountz] and [~rcmuir] for the feedback! I have attached a new patch ([^LUCENE-9843.patch]) with the additional change of *always* compressing the terms dictionaries. This removes the {color:#9876aa}TERMS_DICT_BLOCK_COMPRESSION_THRESHOLD{color} constant and removes all the if/else blocks that related to compression in Lucene90DocValuesConsumer#addTermsDict. was (Author: jdconradson): Thank you [~jpountz] and [~rcmuir] for the feedback! I have attached a new patch ([^LUCENE-9843.patch]) with the additional change of *always* compressing the terms dictionaries. This removes the {color:#9876aa}TERMS_DICT_BLOCK_COMPRESSION_THRESHOLD{color} constant and removes all the if/else blocks that related to compression in Lucene90DocValuesConsumer#addTermsDict. > Remove compression option on doc values > --- > > Key: LUCENE-9843 > URL: https://issues.apache.org/jira/browse/LUCENE-9843 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Blocker > Attachments: LUCENE-9843.patch, LUCENE-9843.patch, LUCENE-9843.patch > > > Options on file formats add complexity and put a big tax on > backward-compatibility testing. I'm the one who introduced it LUCENE-9378 but > I would now like to think about what we can do to remove this option. > For the record, compression was initially introduced because some binary > fields have so much redundancy that it's wasteful not to compress them at > all. But unfortunately, this slowed down some search workloads and we decided > to introduce this option as a way to let users choose the trade-off they want. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9843) Remove compression option on doc values
[ https://issues.apache.org/jira/browse/LUCENE-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Conradson updated LUCENE-9843: --- Attachment: (was: LUCENE-9843.patch) > Remove compression option on doc values > --- > > Key: LUCENE-9843 > URL: https://issues.apache.org/jira/browse/LUCENE-9843 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Blocker > Attachments: LUCENE-9843.patch, LUCENE-9843.patch > > > Options on file formats add complexity and put a big tax on > backward-compatibility testing. I'm the one who introduced it LUCENE-9378 but > I would now like to think about what we can do to remove this option. > For the record, compression was initially introduced because some binary > fields have so much redundancy that it's wasteful not to compress them at > all. But unfortunately, this slowed down some search workloads and we decided > to introduce this option as a way to let users choose the trade-off they want. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #15: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
rmuir commented on pull request #15: URL: https://github.com/apache/lucene/pull/15#issuecomment-832353332 I got the precommit "working" by just disabling a bunch of build checks with corresponding `TODO` in the source code, reducing visibility of some stuff that didn't need to be public, etc. I haven't really looked at the code yet, best to start with the automated checks. Looks to me like removing the old transform impl/tests would really simplify the process too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #15: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
rmuir commented on pull request #15: URL: https://github.com/apache/lucene/pull/15#issuecomment-832354853 ugh, and i guess that `spotlessApply` really made some of the code ugly, especially comments. maybe we can manually wrap them in a way that the spotless checker still accepts. sorry, was just trying to get thru the guantlet of all the build checks... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] magibney commented on pull request #15: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
magibney commented on pull request #15: URL: https://github.com/apache/lucene/pull/15#issuecomment-832355590 Yeah; the heavy hand of spotlessApply was the main reason I didn't fuss with getting the precommit checks to pass. I understand if you want to wait for the build checks to pass before digging into this, and would be happy to (as you suggest) work that out manually. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #15: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
rmuir commented on pull request #15: URL: https://github.com/apache/lucene/pull/15#issuecomment-832362947 yes, it is much easier for me to help out if the build and tests are working, I can't really review otherwise because I rarely write java these days. So to suggest something I usually have to test it out locally -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] magibney commented on pull request #15: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
magibney commented on pull request #15: URL: https://github.com/apache/lucene/pull/15#issuecomment-832365501 That makes sense; apologies for the rough state wrt precommit (though fwiw the _tests_ have been my focus, and those should be solid). I'll get precommit passing with any necessary comment formatting handled manually. Unless you suggest otherwise I'll also rip out all the "rollback"-approach stuff (related to the original approach taken in this PR). It was helpful during development to have that as a point of reference, but it ultimately should not be committed, and at this point I'm confident enough in the streaming approach that the "rollback" stuff has probably outlived its usefulness (and it'll be in the commit history if anyone feels a need to crosscheck against it). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #15: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
rmuir commented on pull request #15: URL: https://github.com/apache/lucene/pull/15#issuecomment-832367527 the tests are failing for me locally too. Mostly it seemed to be previous implementations test? It does `assertEquals(AnalysisResult a, AnalysisResult b)` but AnalysisResult has no equals()... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9843) Remove compression option on doc values
[ https://issues.apache.org/jira/browse/LUCENE-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-9843: Attachment: LUCENE-9843.patch mods.patch Status: Open (was: Open) [~jdconradson] I played with the patch and found some more code that could be removed now that terms dict compression is no longer conditional. For example we no longer need to write a special code in the metadata to indicate terms dict is compressed anymore, terms dict block shift amounts can just be constants, and some {{if (compressed) }} conditionals can go away. I uploaded a new {{LUCENE-9843.patch}} and a smaller {{mods.patch}} just showing what i changed from your patch. > Remove compression option on doc values > --- > > Key: LUCENE-9843 > URL: https://issues.apache.org/jira/browse/LUCENE-9843 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Blocker > Attachments: LUCENE-9843.patch, LUCENE-9843.patch, LUCENE-9843.patch, > mods.patch > > > Options on file formats add complexity and put a big tax on > backward-compatibility testing. I'm the one who introduced it LUCENE-9378 but > I would now like to think about what we can do to remove this option. > For the record, compression was initially introduced because some binary > fields have so much redundancy that it's wasteful not to compress them at > all. But unfortunately, this slowed down some search workloads and we decided > to introduce this option as a way to let users choose the trade-off they want. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] magibney commented on pull request #15: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
magibney commented on pull request #15: URL: https://github.com/apache/lucene/pull/15#issuecomment-832369359 Ah, sorry! yeah, now that you mention it I'm afraid I'm not surprised. I'm going to just remove the previous impl (as you suggested would make things clearer). I think that's the right way to go, and new impl tests should definitely be solid. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9843) Remove compression option on doc values
[ https://issues.apache.org/jira/browse/LUCENE-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339374#comment-17339374 ] Robert Muir commented on LUCENE-9843: - looks like we can do the same trick for binary case. remove BinaryEntry's no-longer needed variables and dead code should light up in your IDE. I only looked at the terms dict with my changes. > Remove compression option on doc values > --- > > Key: LUCENE-9843 > URL: https://issues.apache.org/jira/browse/LUCENE-9843 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Blocker > Attachments: LUCENE-9843.patch, LUCENE-9843.patch, LUCENE-9843.patch, > mods.patch > > > Options on file formats add complexity and put a big tax on > backward-compatibility testing. I'm the one who introduced it LUCENE-9378 but > I would now like to think about what we can do to remove this option. > For the record, compression was initially introduced because some binary > fields have so much redundancy that it's wasteful not to compress them at > all. But unfortunately, this slowed down some search workloads and we decided > to introduce this option as a way to let users choose the trade-off they want. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org