[GitHub] [lucene] dweiss commented on a change in pull request #108: LUCENE-9897 Change dependency checking mechanism to use gradle checksum verification

2021-05-04 Thread GitBox


dweiss commented on a change in pull request #108:
URL: https://github.com/apache/lucene/pull/108#discussion_r625549585



##
File path: gradle/validation/jar-checks.gradle
##
@@ -242,62 +206,14 @@ subprojects {
 }
   }
 
-  licenses.dependsOn validateJarChecksums, validateJarLicenses
+  licenses.dependsOn validateJarLicenses
 }
 
 // Add top-project level tasks validating dangling files
 // and regenerating dependency checksums.
 
 configure(project(":lucene")) {
   def validationTasks = subprojects.collectMany { it.tasks.matching { it.name 
== "licenses" } }
-  def jarInfoTasks = subprojects.collectMany { it.tasks.matching { it.name == 
"collectJarInfos" } }
-
-  // Update dependency checksums.
-  task updateLicenses() {

Review comment:
   Check is just a convention aggregation task, nothing else. We have 
*tons* of other stuff that isn't connected with check in the execution graph - 
I bet some of these dependencies/ configurations are in in plugins and it'd be 
difficult to even hook into them to disable automatic dependency verification.
   
   The work on this attempt isn't lost though (thank you!). Let's keep an eye 
on what happens with gradle's built-in checkums and retry the attempt when it's 
more flexible.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] tflobbe opened a new pull request #2488: SOLR-15391: Enable 'canUsePoints' for PointFields in Solr

2021-05-04 Thread GitBox


tflobbe opened a new pull request #2488:
URL: https://github.com/apache/lucene-solr/pull/2488


   Just a draft for now, no tests or performance numbers.
   
   For 8.x only


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9951) Add an InfoStream to ReplicationService to facilitate debugging

2021-05-04 Thread Christoph Kaser (Jira)
Christoph Kaser created LUCENE-9951:
---

 Summary: Add an InfoStream to ReplicationService to facilitate 
debugging
 Key: LUCENE-9951
 URL: https://issues.apache.org/jira/browse/LUCENE-9951
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/replicator
Reporter: Christoph Kaser


At the moment, when an exception occurs during replication, the 
ReplicationService tries to serialize it and send it to the client, which then 
reports it.

This does not work when the exception occurs after the first part of the 
response has already been sent, or if there was a network error. In these 
cases, the exception is silently ignored (on the server side), and the client 
side will report a TruncatedChunkException, making it hard to find the exact 
cause of the problem.

I propose to add an InfoStream to the ReplicationService (analogous to the 
ReplicationClient) which will log requests and errors that are sent back to the 
client.

I will provide a PR for this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] ChristophKaser opened a new pull request #124: LUCENE-9951: Add InfoStream to ReplicationService

2021-05-04 Thread GitBox


ChristophKaser opened a new pull request #124:
URL: https://github.com/apache/lucene/pull/124


   An InfoStream is added to the ReplicationService (similar to the 
ReplicationClient) to allow debugging replication issues
   
   
   
   
   # Description
   
   Adds InfoStream to ReplicationService to facilitate debugging
   
   # Solution
   
   
   # Tests
   
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code 
conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [x] I have given Lucene maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `main` branch.
   - [ ] I have run `./gradlew check`.
   - [ ] I have added tests for my changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9334) Require consistency between data-structures on a per-field basis

2021-05-04 Thread Ignacio Vera (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338845#comment-17338845
 ] 

Ignacio Vera commented on LUCENE-9334:
--

I had a look and I see the problem with the test. We need to add an 
IndexWriterConfig with SerialMergeScheduler in order to reproduce the failures: 

{code}
 IndexWriterConfig iwc = newIndexWriterConfig();
  // Else seeds may not reproduce:
  iwc.setMergeScheduler(new SerialMergeScheduler());
{code}

Adding that, the following seed reproduces the failure:

{code}
 ./gradlew cleanTest test --tests TestPerFieldConsistency 
-Dtests.seed=C40258ABF5E76DCB -Dtests.multiplier=3 -Dtests.slow=true 
-Dtests.locale=te-IN -Dtests.timezone=SystemV/CST6CDT -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
{code}



> Require consistency between data-structures on a per-field basis
> 
>
> Key: LUCENE-9334
> URL: https://issues.apache.org/jira/browse/LUCENE-9334
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Blocker
> Fix For: main (9.0)
>
>  Time Spent: 14.5h
>  Remaining Estimate: 0h
>
> Follow-up of 
> https://lists.apache.org/thread.html/r747de568afd7502008c45783b74cc3aeb31dab8aa60fcafaf65d5431%40%3Cdev.lucene.apache.org%3E.
> We would like to start requiring consitency across data-structures on a 
> per-field basis in order to make it easier to do the right thing by default: 
> range queries can run faster if doc values are enabled, sorted queries can 
> run faster if points by indexed, etc.
> This would be a big change, so it should be rolled out in a major.
> Strict validation is tricky to implement, but we should still implement 
> best-effort validation:
>  - Documents all use the same data-structures, e.g. it is illegal for a 
> document to only enable points and another document to only enable doc values,
>  - When possible, check whether values are consistent too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] muse-dev[bot] commented on a change in pull request #124: LUCENE-9951: Add InfoStream to ReplicationService

2021-05-04 Thread GitBox


muse-dev[bot] commented on a change in pull request #124:
URL: https://github.com/apache/lucene/pull/124#discussion_r625612737



##
File path: 
lucene/replicator/src/java/org/apache/lucene/replicator/http/ReplicationService.java
##
@@ -183,6 +203,17 @@ public void perform(HttpServletRequest req, 
HttpServletResponse resp)
   break;
   }
 } catch (Exception e) {
+  if (infoStream.isEnabled(INFO_STREAM_COMPONENT)) {
+final StringWriter sw = new StringWriter();
+sw.append("an error occurred during replication service call (");
+sw.append(req.getRequestURI());
+if (req.getQueryString() != null) {
+  sw.append('?').append(req.getQueryString());
+}
+sw.append("): ");
+e.printStackTrace(new PrintWriter(sw));

Review comment:
   *INFORMATION_EXPOSURE_THROUGH_AN_ERROR_MESSAGE:*  Possible information 
exposure through an error message 
[(details)](https://find-sec-bugs.github.io/bugs.htm#INFORMATION_EXPOSURE_THROUGH_AN_ERROR_MESSAGE)
   (at-me [in a reply](https://docs.muse.dev/docs/talk-to-muse/) with `help` or 
`ignore`)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9334) Require consistency between data-structures on a per-field basis

2021-05-04 Thread Ignacio Vera (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338864#comment-17338864
 ] 

Ignacio Vera commented on LUCENE-9334:
--

The test assumes there will be no merges in the background which is not true. 
Maybe an easy fix is to disable merges:
{code:java}
iwc.setMergePolicy(NoMergePolicy.INSTANCE); {code}

> Require consistency between data-structures on a per-field basis
> 
>
> Key: LUCENE-9334
> URL: https://issues.apache.org/jira/browse/LUCENE-9334
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Blocker
> Fix For: main (9.0)
>
>  Time Spent: 14.5h
>  Remaining Estimate: 0h
>
> Follow-up of 
> https://lists.apache.org/thread.html/r747de568afd7502008c45783b74cc3aeb31dab8aa60fcafaf65d5431%40%3Cdev.lucene.apache.org%3E.
> We would like to start requiring consitency across data-structures on a 
> per-field basis in order to make it easier to do the right thing by default: 
> range queries can run faster if doc values are enabled, sorted queries can 
> run faster if points by indexed, etc.
> This would be a big change, so it should be rolled out in a major.
> Strict validation is tricky to implement, but we should still implement 
> best-effort validation:
>  - Documents all use the same data-structures, e.g. it is illegal for a 
> document to only enable points and another document to only enable doc values,
>  - When possible, check whether values are consistent too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] muse-dev[bot] commented on a change in pull request #124: LUCENE-9951: Add InfoStream to ReplicationService

2021-05-04 Thread GitBox


muse-dev[bot] commented on a change in pull request #124:
URL: https://github.com/apache/lucene/pull/124#discussion_r625635306



##
File path: 
lucene/replicator/src/java/org/apache/lucene/replicator/http/ReplicationService.java
##
@@ -183,6 +203,17 @@ public void perform(HttpServletRequest req, 
HttpServletResponse resp)
   break;
   }
 } catch (Exception e) {
+  if (infoStream.isEnabled(INFO_STREAM_COMPONENT)) {
+final StringWriter sw = new StringWriter();
+sw.append("an error occurred during replication service call (");
+sw.append(req.getRequestURI());
+if (req.getQueryString() != null) {
+  sw.append('?').append(req.getQueryString());
+}
+sw.append("): ");
+e.printStackTrace(new PrintWriter(sw));

Review comment:
   I've recorded this as ignored for this pull request. If you change your 
mind, just comment `@muse-dev unignore`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] ChristophKaser commented on a change in pull request #124: LUCENE-9951: Add InfoStream to ReplicationService

2021-05-04 Thread GitBox


ChristophKaser commented on a change in pull request #124:
URL: https://github.com/apache/lucene/pull/124#discussion_r625635261



##
File path: 
lucene/replicator/src/java/org/apache/lucene/replicator/http/ReplicationService.java
##
@@ -183,6 +203,17 @@ public void perform(HttpServletRequest req, 
HttpServletResponse resp)
   break;
   }
 } catch (Exception e) {
+  if (infoStream.isEnabled(INFO_STREAM_COMPONENT)) {
+final StringWriter sw = new StringWriter();
+sw.append("an error occurred during replication service call (");
+sw.append(req.getRequestURI());
+if (req.getQueryString() != null) {
+  sw.append('?').append(req.getQueryString());
+}
+sw.append("): ");
+e.printStackTrace(new PrintWriter(sw));

Review comment:
   There is no sensitive information in a replication request
   @muse-dev ignore




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9334) Require consistency between data-structures on a per-field basis

2021-05-04 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338891#comment-17338891
 ] 

Dawid Weiss commented on LUCENE-9334:
-

I think that's a good first step - I don't know this patch. [~mayyas] may have 
a better insight.

> Require consistency between data-structures on a per-field basis
> 
>
> Key: LUCENE-9334
> URL: https://issues.apache.org/jira/browse/LUCENE-9334
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Blocker
> Fix For: main (9.0)
>
>  Time Spent: 14.5h
>  Remaining Estimate: 0h
>
> Follow-up of 
> https://lists.apache.org/thread.html/r747de568afd7502008c45783b74cc3aeb31dab8aa60fcafaf65d5431%40%3Cdev.lucene.apache.org%3E.
> We would like to start requiring consitency across data-structures on a 
> per-field basis in order to make it easier to do the right thing by default: 
> range queries can run faster if doc values are enabled, sorted queries can 
> run faster if points by indexed, etc.
> This would be a big change, so it should be rolled out in a major.
> Strict validation is tricky to implement, but we should still implement 
> best-effort validation:
>  - Documents all use the same data-structures, e.g. it is illegal for a 
> document to only enable points and another document to only enable doc values,
>  - When possible, check whether values are consistent too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova opened a new pull request #125: Fix occasional failures in TestPerFieldConsistency

2021-05-04 Thread GitBox


mayya-sharipova opened a new pull request #125:
URL: https://github.com/apache/lucene/pull/125


   This test assumes that there is no merging,
   and was failing when there were merges.
   This fixes the test but setting NoMergePolicy for
   IndexWriter.
   
   Relates to #11


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9334) Require consistency between data-structures on a per-field basis

2021-05-04 Thread Mayya Sharipova (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338908#comment-17338908
 ] 

Mayya Sharipova commented on LUCENE-9334:
-

[~dweiss] Thanks for raising the failure, and thanks [~ivera] for 
investigation. Indeed the test assumes no merging. I've created a fix in 
https://github.com/apache/lucene/pull/125, and will merge it today.

> Require consistency between data-structures on a per-field basis
> 
>
> Key: LUCENE-9334
> URL: https://issues.apache.org/jira/browse/LUCENE-9334
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Blocker
> Fix For: main (9.0)
>
>  Time Spent: 14.5h
>  Remaining Estimate: 0h
>
> Follow-up of 
> https://lists.apache.org/thread.html/r747de568afd7502008c45783b74cc3aeb31dab8aa60fcafaf65d5431%40%3Cdev.lucene.apache.org%3E.
> We would like to start requiring consitency across data-structures on a 
> per-field basis in order to make it easier to do the right thing by default: 
> range queries can run faster if doc values are enabled, sorted queries can 
> run faster if points by indexed, etc.
> This would be a big change, so it should be rolled out in a major.
> Strict validation is tricky to implement, but we should still implement 
> best-effort validation:
>  - Documents all use the same data-structures, e.g. it is illegal for a 
> document to only enable points and another document to only enable doc values,
>  - When possible, check whether values are consistent too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9334) Require consistency between data-structures on a per-field basis

2021-05-04 Thread Mayya Sharipova (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338908#comment-17338908
 ] 

Mayya Sharipova edited comment on LUCENE-9334 at 5/4/21, 10:50 AM:
---

[~dweiss] Thanks for raising the failure, and thanks [~ivera] for 
investigation. [~ivera] Indeed, the test assumes no merging. I've created a fix 
in [PR|https://github.com/apache/lucene/pull/125], and will merge it today.


was (Author: mayyas):
[~dweiss] Thanks for raising the failure, and thanks [~ivera] for 
investigation. Indeed the test assumes no merging. I've created a fix in 
https://github.com/apache/lucene/pull/125, and will merge it today.

> Require consistency between data-structures on a per-field basis
> 
>
> Key: LUCENE-9334
> URL: https://issues.apache.org/jira/browse/LUCENE-9334
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Blocker
> Fix For: main (9.0)
>
>  Time Spent: 14.5h
>  Remaining Estimate: 0h
>
> Follow-up of 
> https://lists.apache.org/thread.html/r747de568afd7502008c45783b74cc3aeb31dab8aa60fcafaf65d5431%40%3Cdev.lucene.apache.org%3E.
> We would like to start requiring consitency across data-structures on a 
> per-field basis in order to make it easier to do the right thing by default: 
> range queries can run faster if doc values are enabled, sorted queries can 
> run faster if points by indexed, etc.
> This would be a big change, so it should be rolled out in a major.
> Strict validation is tricky to implement, but we should still implement 
> best-effort validation:
>  - Documents all use the same data-structures, e.g. it is illegal for a 
> document to only enable points and another document to only enable doc values,
>  - When possible, check whether values are consistent too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on pull request #114: LUCENE-9905: PerFieldVectorFormat

2021-05-04 Thread GitBox


msokolov commented on pull request #114:
URL: https://github.com/apache/lucene/pull/114#issuecomment-831919682


   no comments here it seems; Anyway, we're really just moving the deck chairs 
around to be more future-extensible; I'll take the silence as consensus and 
merge later today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova merged pull request #125: Fix occasional failures in TestPerFieldConsistency

2021-05-04 Thread GitBox


mayya-sharipova merged pull request #125:
URL: https://github.com/apache/lucene/pull/125


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9334) Require consistency between data-structures on a per-field basis

2021-05-04 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339005#comment-17339005
 ] 

ASF subversion and git services commented on LUCENE-9334:
-

Commit b5a77de5126c36582a1beb0fc763b47745d46417 in lucene's branch 
refs/heads/main from Mayya Sharipova
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=b5a77de ]

Fix failures in TestPerFieldConsistency (#125)

This test assumes that there is no merging,
and was failing when there were merges.
This fixes the test but setting NoMergePolicy for
IndexWriter.

Relates to LUCENE-9334
Relates to #11

> Require consistency between data-structures on a per-field basis
> 
>
> Key: LUCENE-9334
> URL: https://issues.apache.org/jira/browse/LUCENE-9334
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Blocker
> Fix For: main (9.0)
>
>  Time Spent: 14.5h
>  Remaining Estimate: 0h
>
> Follow-up of 
> https://lists.apache.org/thread.html/r747de568afd7502008c45783b74cc3aeb31dab8aa60fcafaf65d5431%40%3Cdev.lucene.apache.org%3E.
> We would like to start requiring consitency across data-structures on a 
> per-field basis in order to make it easier to do the right thing by default: 
> range queries can run faster if doc values are enabled, sorted queries can 
> run faster if points by indexed, etc.
> This would be a big change, so it should be rolled out in a major.
> Strict validation is tricky to implement, but we should still implement 
> best-effort validation:
>  - Documents all use the same data-structures, e.g. it is illegal for a 
> document to only enable points and another document to only enable doc values,
>  - When possible, check whether values are consistent too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a change in pull request #114: LUCENE-9905: PerFieldVectorFormat

2021-05-04 Thread GitBox


rmuir commented on a change in pull request #114:
URL: https://github.com/apache/lucene/pull/114#discussion_r625823163



##
File path: 
lucene/core/src/resources/META-INF/services/org.apache.lucene.codecs.VectorFormat
##
@@ -0,0 +1,33 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+#  Licensed to the Apache Software Foundation (ASF) under one or more

Review comment:
   all these SPI files seem to have double copyrights




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #114: LUCENE-9905: PerFieldVectorFormat

2021-05-04 Thread GitBox


rmuir commented on pull request #114:
URL: https://github.com/apache/lucene/pull/114#issuecomment-831983465


   is the plan to do a separate followup to break out euclidean and dot product 
into codec parameter and remove from FieldInfo? as these are hnsw-specific 
parameters, they really belong in that codec versus FieldInfo.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a change in pull request #114: LUCENE-9905: PerFieldVectorFormat

2021-05-04 Thread GitBox


jpountz commented on a change in pull request #114:
URL: https://github.com/apache/lucene/pull/114#discussion_r625821205



##
File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorFormat.java
##
@@ -77,7 +77,9 @@
   static final int VERSION_CURRENT = VERSION_START;
 
   /** Sole constructor */
-  public Lucene90HnswVectorFormat() {}
+  public Lucene90HnswVectorFormat() {
+super("Lucene90VectorFormat");

Review comment:
   historically we've used the class name as a format name, should we use
   ```suggestion
   super("Lucene90HnswVectorFormat");
   ```
   ?

##
File path: 
lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldVectorFormat.java
##
@@ -0,0 +1,293 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.perfield;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.ServiceLoader;
+import java.util.TreeMap;
+import org.apache.lucene.codecs.VectorFormat;
+import org.apache.lucene.codecs.VectorReader;
+import org.apache.lucene.codecs.VectorWriter;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.index.VectorValues;
+import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.util.IOUtils;
+
+/**
+ * Enables per field numeric vector support.
+ *
+ * Note, when extending this class, the name ({@link #getName}) is written 
into the index. In
+ * order for the field to be read, the name must resolve to your 
implementation via {@link
+ * #forName(String)}. This method uses Java's {@link ServiceLoader Service 
Provider Interface} to
+ * resolve format names.
+ *
+ * Files written by each numeric vectors format have an additional suffix 
containing the format
+ * name. For example, in a per-field configuration instead of 
_1.dat filenames would
+ * look like _1_Lucene40_0.dat.
+ *
+ * @see ServiceLoader
+ * @lucene.experimental
+ */
+public abstract class PerFieldVectorFormat extends VectorFormat {
+  /** Name of this {@link VectorFormat}. */
+  public static final String PER_FIELD_NAME = "PerFieldVectors90";
+
+  /** {@link FieldInfo} attribute name used to store the format name for each 
field. */
+  public static final String PER_FIELD_FORMAT_KEY =
+  PerFieldVectorFormat.class.getSimpleName() + ".format";
+
+  /** {@link FieldInfo} attribute name used to store the segment suffix name 
for each field. */
+  public static final String PER_FIELD_SUFFIX_KEY =
+  PerFieldVectorFormat.class.getSimpleName() + ".suffix";
+
+  /** Sole constructor. */
+  protected PerFieldVectorFormat() {
+super(PER_FIELD_NAME);
+  }
+
+  @Override
+  public VectorWriter fieldsWriter(SegmentWriteState state) throws IOException 
{
+return new FieldsWriter(state);
+  }
+
+  @Override
+  public VectorReader fieldsReader(SegmentReadState state) throws IOException {
+return new FieldsReader(state);
+  }
+
+  /**
+   * Returns the numeric vector format that should be used for writing new 
segments of field
+   * .
+   *
+   * The field to format mapping is written to the index, so this method is 
only invoked when
+   * writing, not when reading.
+   */
+  public abstract VectorFormat getVectorFormatForField(String field);
+
+  private class FieldsWriter extends VectorWriter {
+private final Map formats;
+private final Map suffixes = new HashMap<>();
+private final SegmentWriteState segmentWriteState;
+
+FieldsWriter(SegmentWriteState segmentWriteState) {
+  this.segmentWriteState = segmentWriteState;
+  formats = new HashMap<>();
+}
+
+@Override
+public void writeField(FieldInfo fieldInfo, VectorValues values) throws 
IOException {
+  getInstance(fieldInfo).writeField(fieldInfo, values);
+}
+
+@Override
+public void finish() throws IOException {
+  for (WriterAndSuffix was : formats.values()) {
+was.writer.finish();
+  }
+}
+
+@Override
+public void close() throws IOException {
+  IOUtils.close(formats.values());
+}
+
+private VectorWriter getInstance

[jira] [Commented] (LUCENE-9843) Remove compression option on doc values

2021-05-04 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339052#comment-17339052
 ] 

Adrien Grand commented on LUCENE-9843:
--

The patch looks good. This makes me wonder whether we should remove the 
threshold that only enables compression on the terms dict for non-tiny 
dictionaries: I believe that it hurts test coverage since our tests rarely 
index many documents, yet I'm not sure whether it brings real benefits to our 
users: iterating the terms dict is going to be super fast anyway if you only 
have few terms?

> Remove compression option on doc values
> ---
>
> Key: LUCENE-9843
> URL: https://issues.apache.org/jira/browse/LUCENE-9843
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Blocker
> Attachments: LUCENE-9843.patch
>
>
> Options on file formats add complexity and put a big tax on 
> backward-compatibility testing. I'm the one who introduced it LUCENE-9378 but 
> I would now like to think about what we can do to remove this option.
> For the record, compression was initially introduced because some binary 
> fields have so much redundancy that it's wasteful not to compress them at 
> all. But unfortunately, this slowed down some search workloads and we decided 
> to introduce this option as a way to let users choose the trade-off they want.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9843) Remove compression option on doc values

2021-05-04 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339084#comment-17339084
 ] 

Robert Muir commented on LUCENE-9843:
-

+1 let's simplify and have better test coverage.  it does not impact the speed 
for ord lookup in any way.

> Remove compression option on doc values
> ---
>
> Key: LUCENE-9843
> URL: https://issues.apache.org/jira/browse/LUCENE-9843
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Blocker
> Attachments: LUCENE-9843.patch
>
>
> Options on file formats add complexity and put a big tax on 
> backward-compatibility testing. I'm the one who introduced it LUCENE-9378 but 
> I would now like to think about what we can do to remove this option.
> For the record, compression was initially introduced because some binary 
> fields have so much redundancy that it's wasteful not to compress them at 
> all. But unfortunately, this slowed down some search workloads and we decided 
> to introduce this option as a way to let users choose the trade-off they want.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9936) update gradle build to support gpg signing of tgz/zip distributions

2021-05-04 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339165#comment-17339165
 ] 

ASF subversion and git services commented on LUCENE-9936:
-

Commit a6cf46dadabfa7f76a645001d5158f818499de8e in lucene's branch 
refs/heads/main from Chris M. Hostetter
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a6cf46d ]

LUCENE-9936: Add gpg signing of the tgz & zip distribution files


> update gradle build to support gpg signing of tgz/zip distributions
> ---
>
> Key: LUCENE-9936
> URL: https://issues.apache.org/jira/browse/LUCENE-9936
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Chris M. Hostetter
>Assignee: Chris M. Hostetter
>Priority: Major
> Attachments: LUCENE-9936.patch, LUCENE-9936.patch
>
>
> the gradle build does not currently have any support for gpg signing the 
> distributions we produce.
> this is neccessary for releases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9936) update gradle build to support gpg signing of tgz/zip distributions

2021-05-04 Thread Chris M. Hostetter (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris M. Hostetter resolved LUCENE-9936.

Fix Version/s: main (9.0)
   Resolution: Fixed

> update gradle build to support gpg signing of tgz/zip distributions
> ---
>
> Key: LUCENE-9936
> URL: https://issues.apache.org/jira/browse/LUCENE-9936
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Chris M. Hostetter
>Assignee: Chris M. Hostetter
>Priority: Major
> Fix For: main (9.0)
>
> Attachments: LUCENE-9936.patch, LUCENE-9936.patch
>
>
> the gradle build does not currently have any support for gpg signing the 
> distributions we produce.
> this is neccessary for releases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a change in pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query

2021-05-04 Thread GitBox


jpountz commented on a change in pull request #101:
URL: https://github.com/apache/lucene/pull/101#discussion_r625979068



##
File path: 
lucene/core/src/java/org/apache/lucene/search/BlockMaxMaxscoreScorer.java
##
@@ -0,0 +1,339 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import static org.apache.lucene.search.ScorerUtil.costWithMinShouldMatch;
+
+import java.io.IOException;
+import java.util.*;
+
+/** Scorer implementing Block-Max Maxscore algorithm */
+public class BlockMaxMaxscoreScorer extends Scorer {
+  private final ScoreMode scoreMode;
+  private final int scalingFactor;
+
+  // current doc ID of the leads
+  private int doc;
+
+  // doc id boundary that all scorers maxScore are valid
+  private int upTo = -1;
+
+  // heap of scorers ordered by doc ID
+  private final DisiPriorityQueue essentialsScorers;
+
+  // list of scorers whose sum of maxScore is less than minCompetitiveScore, 
ordered by maxScore
+  private final List nonEssentialScorers;
+
+  // sum of max scores of scorers in nonEssentialScorers list
+  private long nonEssentialMaxScoreSum;
+
+  // sum of score of scorers in essentialScorers list that are positioned on 
matching doc
+  private long matchedDocScoreSum;
+
+  private long cost;
+
+  private final MaxScoreSumPropagator maxScoreSumPropagator;
+
+  private final List scorers;
+
+  // scaled min competitive score
+  private long minCompetitiveScore = 0;
+
+  /**
+   * Constructs a Scorer
+   *
+   * @param weight The weight to be used.
+   * @param scorers The sub scorers this Scorer should iterate on for optional 
clauses
+   * @param scoreMode The scoreMode
+   */
+  public BlockMaxMaxscoreScorer(Weight weight, List scorers, ScoreMode 
scoreMode)
+  throws IOException {
+super(weight);
+assert scoreMode == ScoreMode.TOP_SCORES;
+
+this.scoreMode = scoreMode;
+this.doc = -1;
+this.scorers = scorers;
+this.cost =
+costWithMinShouldMatch(
+
scorers.stream().map(Scorer::iterator).mapToLong(DocIdSetIterator::cost),
+scorers.size(),
+1);
+
+essentialsScorers = new DisiPriorityQueue(scorers.size());
+nonEssentialScorers = new LinkedList<>();
+
+scalingFactor = WANDScorer.getScalingFactor(scorers);
+maxScoreSumPropagator = new MaxScoreSumPropagator(scorers);
+
+for (Scorer scorer : scorers) {
+  nonEssentialScorers.add(new DisiWrapper(scorer));
+}
+  }
+
+  @Override
+  public DocIdSetIterator iterator() {
+return TwoPhaseIterator.asDocIdSetIterator(twoPhaseIterator());
+  }
+
+  @Override
+  public TwoPhaseIterator twoPhaseIterator() {
+DocIdSetIterator approximation =
+new DocIdSetIterator() {
+  private long lastMinCompetitiveScore;
+
+  @Override
+  public int docID() {
+return doc;
+  }
+
+  @Override
+  public int nextDoc() throws IOException {
+return advance(doc + 1);
+  }
+
+  @Override
+  public int advance(int target) throws IOException {
+doAdvance(target);
+
+while (doc != DocIdSetIterator.NO_MORE_DOCS
+&& nonEssentialMaxScoreSum + matchedDocScoreSum < 
minCompetitiveScore) {
+  doAdvance(doc + 1);
+}
+
+return doc;
+  }
+
+  private void doAdvance(int target) throws IOException {
+matchedDocScoreSum = 0;
+// Find next smallest doc id that is larger than or equal to 
target from the essential
+// scorers
+
+// If the next candidate doc id is still within interval boundary,
+if (lastMinCompetitiveScore == minCompetitiveScore && target <= 
upTo) {
+  while (essentialsScorers.top().doc < target) {
+DisiWrapper w = essentialsScorers.pop();
+w.doc = w.iterator.advance(target);
+essentialsScorers.add(w);

Review comment:
   can you use updateTop instead? It's usually faster than pop+add




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the spec

[jira] [Commented] (LUCENE-9946) Support multi-value fields in range facet counting

2021-05-04 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339187#comment-17339187
 ] 

Greg Miller commented on LUCENE-9946:
-

Just as a small update on this, I've hit a little speed bump in my 
implementation due to a bug in my approach I discovered when writing tests. The 
logic for rolling up range counts (in {{LongRangeCounter}}) needs to be 
revisited to support multi-value cases, which is a little non-trivial. A few 
cases to think through:
 # A multi-valued field contributes counts to multiple elementary intervals in 
the segment tree that roll up to different ranges. Each range should get a 
count of {{1}} from the doc. The doc should only contribute {{1}} to 
{{FacetResult#value}}.
 # A multi-valued field contributes counts to multiple elementary intervals in 
the segment tree that roll up to some of the same ranges. Each range should 
receive a count of {{1}} from the doc (need to ensure multiple elementary 
ranges rolling up to the same range don't double-count). The doc should only 
contribute {{1}} to {{FacetResult#value}}.
 # A multi-valued field contributes counts to the same elementary interval in 
the segment tree. The individual ranges that the elementary interval rolls up 
into should all only receive a count of {{1}} from the doc (need to ensure the 
elementary interval doesn't get double counted, contributing > {{1}} to the 
ranges it rolls up to). The doc should only contribute {{1}} to 
{{FacetResult#value}}.

I'll circle back to this in a few days as I have more time to work on it.

> Support multi-value fields in range facet counting
> --
>
> Key: LUCENE-9946
> URL: https://issues.apache.org/jira/browse/LUCENE-9946
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: main (9.0)
>Reporter: Greg Miller
>Priority: Minor
>
> The {{RangeFacetCounts}} implementations ({{LongRangeFacetCounts}} and 
> {{DoubleRangeFacetCount}}) only work on single-valued fields today. In 
> contrast, the more recently added {{LongValueFacetCounts}} implementation 
> supports both single- and multi-valued fields (LUCENE-7927). I'd like to 
> extend multi-value support to both of the {{LongRangeFacetCounts}} 
> implementations as well.
> Looking through the implementations, I can't think of a good reason to _not_ 
> support this, but maybe I'm overlooking something?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-05-04 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339188#comment-17339188
 ] 

Adrien Grand commented on LUCENE-9335:
--

Thanks for writing two scorers to test this out! Would you be able to run 
queries under a profiler to see where your new scorers are spending most time? 
This might help identify how we could make them faster.

Also thanks for testing with more queries, FWIW it would be good enough to only 
add 4-5 new queries to the tasks file to play with the change. By the way I'd 
be curious to see how your new scorers perform with 5 "Med" terms, which should 
be a worst-case scenario for BMW as all terms should have similar max scores. 
Since the queries you ran have a "Low" term, I wonder that this term drives 
iteration, which prevents BMM from showing the lower overhead it has compared 
to BMW.

> Add a bulk scorer for disjunctions that does dynamic pruning
> 
>
> Key: LUCENE-9335
> URL: https://issues.apache.org/jira/browse/LUCENE-9335
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: wikimedium.10M.nostopwords.tasks
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Lucene often gets benchmarked against other engines, e.g. against Tantivy and 
> PISA at [https://tantivy-search.github.io/bench/] or against research 
> prototypes in Table 1 of 
> [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf].
>  Given that top-level disjunctions of term queries are commonly used for 
> benchmarking, it would be nice to optimize this case a bit more, I suspect 
> that we could make fewer per-document decisions by implementing a BulkScorer 
> instead of a Scorer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9843) Remove compression option on doc values

2021-05-04 Thread Jack Conradson (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Conradson updated LUCENE-9843:
---
Attachment: LUCENE-9843.patch

> Remove compression option on doc values
> ---
>
> Key: LUCENE-9843
> URL: https://issues.apache.org/jira/browse/LUCENE-9843
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Blocker
> Attachments: LUCENE-9843.patch, LUCENE-9843.patch, LUCENE-9843.patch
>
>
> Options on file formats add complexity and put a big tax on 
> backward-compatibility testing. I'm the one who introduced it LUCENE-9378 but 
> I would now like to think about what we can do to remove this option.
> For the record, compression was initially introduced because some binary 
> fields have so much redundancy that it's wasteful not to compress them at 
> all. But unfortunately, this slowed down some search workloads and we decided 
> to introduce this option as a way to let users choose the trade-off they want.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9843) Remove compression option on doc values

2021-05-04 Thread Jack Conradson (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Conradson updated LUCENE-9843:
---
Attachment: LUCENE-9843.patch

> Remove compression option on doc values
> ---
>
> Key: LUCENE-9843
> URL: https://issues.apache.org/jira/browse/LUCENE-9843
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Blocker
> Attachments: LUCENE-9843.patch, LUCENE-9843.patch, LUCENE-9843.patch
>
>
> Options on file formats add complexity and put a big tax on 
> backward-compatibility testing. I'm the one who introduced it LUCENE-9378 but 
> I would now like to think about what we can do to remove this option.
> For the record, compression was initially introduced because some binary 
> fields have so much redundancy that it's wasteful not to compress them at 
> all. But unfortunately, this slowed down some search workloads and we decided 
> to introduce this option as a way to let users choose the trade-off they want.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9843) Remove compression option on doc values

2021-05-04 Thread Jack Conradson (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339245#comment-17339245
 ] 

Jack Conradson commented on LUCENE-9843:


I have attached a new patch ([^LUCENE-9843.patch]) with the additional change 
of *always* compressing the terms dictionaries. This removes the 
{color:#9876aa}TERMS_DICT_BLOCK_COMPRESSION_THRESHOLD{color} constant and 
removes all the if/else blocks that related to compression in 
Lucene90DocValuesConsumer#addTermsDict.

> Remove compression option on doc values
> ---
>
> Key: LUCENE-9843
> URL: https://issues.apache.org/jira/browse/LUCENE-9843
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Blocker
> Attachments: LUCENE-9843.patch, LUCENE-9843.patch, LUCENE-9843.patch
>
>
> Options on file formats add complexity and put a big tax on 
> backward-compatibility testing. I'm the one who introduced it LUCENE-9378 but 
> I would now like to think about what we can do to remove this option.
> For the record, compression was initially introduced because some binary 
> fields have so much redundancy that it's wasteful not to compress them at 
> all. But unfortunately, this slowed down some search workloads and we decided 
> to introduce this option as a way to let users choose the trade-off they want.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9843) Remove compression option on doc values

2021-05-04 Thread Jack Conradson (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339245#comment-17339245
 ] 

Jack Conradson edited comment on LUCENE-9843 at 5/4/21, 7:18 PM:
-

Thank you [~jpountz]  and [~rcmuir]  for the feedback! I have attached a new 
patch ([^LUCENE-9843.patch]) with the additional change of *always* compressing 
the terms dictionaries. This removes the 
{color:#9876aa}TERMS_DICT_BLOCK_COMPRESSION_THRESHOLD{color} constant and 
removes all the if/else blocks that related to compression in 
Lucene90DocValuesConsumer#addTermsDict.


was (Author: jdconradson):
I have attached a new patch ([^LUCENE-9843.patch]) with the additional change 
of *always* compressing the terms dictionaries. This removes the 
{color:#9876aa}TERMS_DICT_BLOCK_COMPRESSION_THRESHOLD{color} constant and 
removes all the if/else blocks that related to compression in 
Lucene90DocValuesConsumer#addTermsDict.

> Remove compression option on doc values
> ---
>
> Key: LUCENE-9843
> URL: https://issues.apache.org/jira/browse/LUCENE-9843
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Blocker
> Attachments: LUCENE-9843.patch, LUCENE-9843.patch, LUCENE-9843.patch
>
>
> Options on file formats add complexity and put a big tax on 
> backward-compatibility testing. I'm the one who introduced it LUCENE-9378 but 
> I would now like to think about what we can do to remove this option.
> For the record, compression was initially introduced because some binary 
> fields have so much redundancy that it's wasteful not to compress them at 
> all. But unfortunately, this slowed down some search workloads and we decided 
> to introduce this option as a way to let users choose the trade-off they want.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9843) Remove compression option on doc values

2021-05-04 Thread Jack Conradson (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339245#comment-17339245
 ] 

Jack Conradson edited comment on LUCENE-9843 at 5/4/21, 7:19 PM:
-

Thank you [~jpountz] and [~rcmuir] for the feedback! I have attached a new 
patch ([^LUCENE-9843.patch]) with the additional change of *always* compressing 
the terms dictionaries. This removes the 
{color:#9876aa}TERMS_DICT_BLOCK_COMPRESSION_THRESHOLD{color} constant and 
removes all the if/else blocks that related to compression in 
Lucene90DocValuesConsumer#addTermsDict.


was (Author: jdconradson):
Thank you [~jpountz]  and [~rcmuir]  for the feedback! I have attached a new 
patch ([^LUCENE-9843.patch]) with the additional change of *always* compressing 
the terms dictionaries. This removes the 
{color:#9876aa}TERMS_DICT_BLOCK_COMPRESSION_THRESHOLD{color} constant and 
removes all the if/else blocks that related to compression in 
Lucene90DocValuesConsumer#addTermsDict.

> Remove compression option on doc values
> ---
>
> Key: LUCENE-9843
> URL: https://issues.apache.org/jira/browse/LUCENE-9843
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Blocker
> Attachments: LUCENE-9843.patch, LUCENE-9843.patch, LUCENE-9843.patch
>
>
> Options on file formats add complexity and put a big tax on 
> backward-compatibility testing. I'm the one who introduced it LUCENE-9378 but 
> I would now like to think about what we can do to remove this option.
> For the record, compression was initially introduced because some binary 
> fields have so much redundancy that it's wasteful not to compress them at 
> all. But unfortunately, this slowed down some search workloads and we decided 
> to introduce this option as a way to let users choose the trade-off they want.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9843) Remove compression option on doc values

2021-05-04 Thread Jack Conradson (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Conradson updated LUCENE-9843:
---
Attachment: (was: LUCENE-9843.patch)

> Remove compression option on doc values
> ---
>
> Key: LUCENE-9843
> URL: https://issues.apache.org/jira/browse/LUCENE-9843
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Blocker
> Attachments: LUCENE-9843.patch, LUCENE-9843.patch
>
>
> Options on file formats add complexity and put a big tax on 
> backward-compatibility testing. I'm the one who introduced it LUCENE-9378 but 
> I would now like to think about what we can do to remove this option.
> For the record, compression was initially introduced because some binary 
> fields have so much redundancy that it's wasteful not to compress them at 
> all. But unfortunately, this slowed down some search workloads and we decided 
> to introduce this option as a way to let users choose the trade-off they want.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #15: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2021-05-04 Thread GitBox


rmuir commented on pull request #15:
URL: https://github.com/apache/lucene/pull/15#issuecomment-832353332


   I got the precommit "working" by just disabling a bunch of build checks with 
corresponding `TODO` in the source code, reducing visibility of some stuff that 
didn't need to be public, etc.
   
   I haven't really looked at the code yet, best to start with the automated 
checks.
   
   Looks to me like removing the old transform impl/tests would really simplify 
the process too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #15: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2021-05-04 Thread GitBox


rmuir commented on pull request #15:
URL: https://github.com/apache/lucene/pull/15#issuecomment-832354853


   ugh, and i guess that `spotlessApply` really made some of the code ugly, 
especially comments. maybe we can manually wrap them in a way that the spotless 
checker still accepts. sorry, was just trying to get thru the guantlet of all 
the build checks...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] magibney commented on pull request #15: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2021-05-04 Thread GitBox


magibney commented on pull request #15:
URL: https://github.com/apache/lucene/pull/15#issuecomment-832355590


   Yeah; the heavy hand of spotlessApply was the main reason I didn't fuss with 
getting the precommit checks to pass. I understand if you want to wait for the 
build checks to pass before digging into this, and would be happy to (as you 
suggest) work that out manually.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #15: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2021-05-04 Thread GitBox


rmuir commented on pull request #15:
URL: https://github.com/apache/lucene/pull/15#issuecomment-832362947


   yes, it is much easier for me to help out if the build and tests are 
working, I can't really review otherwise because I rarely write java these 
days. So to suggest something I usually have to test it out locally


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] magibney commented on pull request #15: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2021-05-04 Thread GitBox


magibney commented on pull request #15:
URL: https://github.com/apache/lucene/pull/15#issuecomment-832365501


   That makes sense; apologies for the rough state wrt precommit (though fwiw 
the _tests_  have been my focus, and those should be solid). I'll get precommit 
passing with any necessary comment formatting handled manually.
   
   Unless you suggest otherwise I'll also rip out all the "rollback"-approach 
stuff (related to the original approach taken in this PR). It was helpful 
during development to have that as a point of reference, but it ultimately 
should not be committed, and at this point I'm confident enough in the 
streaming approach that the "rollback" stuff has probably outlived its 
usefulness (and it'll be in the commit history if anyone feels a need to 
crosscheck against it).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #15: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2021-05-04 Thread GitBox


rmuir commented on pull request #15:
URL: https://github.com/apache/lucene/pull/15#issuecomment-832367527


   the tests are failing for me locally too. Mostly it seemed to be previous 
implementations test? It does `assertEquals(AnalysisResult a, AnalysisResult 
b)` but AnalysisResult has no equals()...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9843) Remove compression option on doc values

2021-05-04 Thread Robert Muir (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-9843:

Attachment: LUCENE-9843.patch
mods.patch
Status: Open  (was: Open)

[~jdconradson] I played with the patch and found some more code that could be 
removed now that terms dict compression is no longer conditional. 

For example we no longer need to write a special code in the metadata to 
indicate terms dict is compressed anymore, terms dict block shift amounts can 
just be constants, and some {{if (compressed) }} conditionals can go away.

I uploaded a new {{LUCENE-9843.patch}} and a smaller {{mods.patch}} just 
showing what i changed from your patch.

> Remove compression option on doc values
> ---
>
> Key: LUCENE-9843
> URL: https://issues.apache.org/jira/browse/LUCENE-9843
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Blocker
> Attachments: LUCENE-9843.patch, LUCENE-9843.patch, LUCENE-9843.patch, 
> mods.patch
>
>
> Options on file formats add complexity and put a big tax on 
> backward-compatibility testing. I'm the one who introduced it LUCENE-9378 but 
> I would now like to think about what we can do to remove this option.
> For the record, compression was initially introduced because some binary 
> fields have so much redundancy that it's wasteful not to compress them at 
> all. But unfortunately, this slowed down some search workloads and we decided 
> to introduce this option as a way to let users choose the trade-off they want.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] magibney commented on pull request #15: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2021-05-04 Thread GitBox


magibney commented on pull request #15:
URL: https://github.com/apache/lucene/pull/15#issuecomment-832369359


   Ah, sorry! yeah, now that you mention it I'm afraid I'm not surprised. I'm 
going to just remove the previous impl (as you suggested would make things 
clearer). I think that's the right way to go, and new impl tests should 
definitely be solid.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9843) Remove compression option on doc values

2021-05-04 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339374#comment-17339374
 ] 

Robert Muir commented on LUCENE-9843:
-

looks like we can do the same trick for binary case. remove BinaryEntry's 
no-longer needed variables and dead code should light up in your IDE.

I only looked at the terms dict with my changes.

> Remove compression option on doc values
> ---
>
> Key: LUCENE-9843
> URL: https://issues.apache.org/jira/browse/LUCENE-9843
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Blocker
> Attachments: LUCENE-9843.patch, LUCENE-9843.patch, LUCENE-9843.patch, 
> mods.patch
>
>
> Options on file formats add complexity and put a big tax on 
> backward-compatibility testing. I'm the one who introduced it LUCENE-9378 but 
> I would now like to think about what we can do to remove this option.
> For the record, compression was initially introduced because some binary 
> fields have so much redundancy that it's wasteful not to compress them at 
> all. But unfortunately, this slowed down some search workloads and we decided 
> to introduce this option as a way to let users choose the trade-off they want.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org