date:20220428

[jira] [Created] (LUCENE-10543) Achieve contribution workflow perfection (with progress)

2022-04-28 Thread Robert Muir (Jira)

Robert Muir created LUCENE-10543:


 Summary: Achieve contribution workflow perfection (with progress)
 Key: LUCENE-10543
 URL: https://issues.apache.org/jira/browse/LUCENE-10543
 Project: Lucene - Core
  Issue Type: Task
Reporter: Robert Muir


Inspired by Dawid's build issue which has worked out for us: LUCENE-9871
He hasn't even linked 10% of the issues/subtasks involved in that work either, 
but we know.

I think we need a similar approach for the contribution workflow. There has 
been some major improvements recently, a couple that come to mind:
* Tomoko made a CONTRIBUTING.md file which github recognizes and is way better 
than the wiki stuff
* Some hazards/error messages/mazes in the build process and so on have gotten 
fixed.

But there is more to do in my opinion, here is 3 ideas:
* Creating a PR still has a massive checklist template. But now this template 
links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it 
enough to just link to CONTRIBUTING.md and fix that as needed?
* Creating a PR still requires signing up for Apache JIRA and creating a JIRA 
issue. There is zero value to this additional process. We often end out with 
either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated 
content. This is just an unnecessary dance, can we use github issues instead?
* Haven't dug into the github actions or configs very deeply. Maybe there's 
simple stuff we can do such as give useful notifications if checks fail. Try to 
guide the user to run ./gradlew check and fix it. It sucks to have to review, 
look at logs, and manually add comments to do this stuff.

So let's have an issue to improve this area.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10543) Achieve contribution workflow perfection (with progress)

2022-04-28 Thread Robert Muir (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529236#comment-17529236
 ] 

Robert Muir commented on LUCENE-10543:
--

another idea is to use github wiki functionality vs the confluence we have 
today. The confluence is so out of date, and almost nobody has permissions. 
Somewhere buried there is instructions to ping the mailing list and ask for 
permissions, and rarely it happens, but come on. I think using github wiki 
functionality would allow it to stay up to date better?

> Achieve contribution workflow perfection (with progress)
> 
>
> Key: LUCENE-10543
> URL: https://issues.apache.org/jira/browse/LUCENE-10543
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> Inspired by Dawid's build issue which has worked out for us: LUCENE-9871
> He hasn't even linked 10% of the issues/subtasks involved in that work 
> either, but we know.
> I think we need a similar approach for the contribution workflow. There has 
> been some major improvements recently, a couple that come to mind:
> * Tomoko made a CONTRIBUTING.md file which github recognizes and is way 
> better than the wiki stuff
> * Some hazards/error messages/mazes in the build process and so on have 
> gotten fixed.
> But there is more to do in my opinion, here is 3 ideas:
> * Creating a PR still has a massive checklist template. But now this template 
> links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it 
> enough to just link to CONTRIBUTING.md and fix that as needed?
> * Creating a PR still requires signing up for Apache JIRA and creating a JIRA 
> issue. There is zero value to this additional process. We often end out with 
> either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated 
> content. This is just an unnecessary dance, can we use github issues instead?
> * Haven't dug into the github actions or configs very deeply. Maybe there's 
> simple stuff we can do such as give useful notifications if checks fail. Try 
> to guide the user to run ./gradlew check and fix it. It sucks to have to 
> review, look at logs, and manually add comments to do this stuff.
> So let's have an issue to improve this area.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10543) Achieve contribution workflow perfection (with progress)

2022-04-28 Thread Robert Muir (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529240#comment-17529240
 ] 

Robert Muir commented on LUCENE-10543:
--

another idea, add a simple "fork me on github" to the website. 

Just look at our website, getting to the sources is not the easiest task. I 
have to evaluate a lot of open-source software myself, and always appreciate 
the quick link to the sources, let's not make this difficult.

> Achieve contribution workflow perfection (with progress)
> 
>
> Key: LUCENE-10543
> URL: https://issues.apache.org/jira/browse/LUCENE-10543
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> Inspired by Dawid's build issue which has worked out for us: LUCENE-9871
> He hasn't even linked 10% of the issues/subtasks involved in that work 
> either, but we know.
> I think we need a similar approach for the contribution workflow. There has 
> been some major improvements recently, a couple that come to mind:
> * Tomoko made a CONTRIBUTING.md file which github recognizes and is way 
> better than the wiki stuff
> * Some hazards/error messages/mazes in the build process and so on have 
> gotten fixed.
> But there is more to do in my opinion, here is 3 ideas:
> * Creating a PR still has a massive checklist template. But now this template 
> links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it 
> enough to just link to CONTRIBUTING.md and fix that as needed?
> * Creating a PR still requires signing up for Apache JIRA and creating a JIRA 
> issue. There is zero value to this additional process. We often end out with 
> either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated 
> content. This is just an unnecessary dance, can we use github issues instead?
> * Haven't dug into the github actions or configs very deeply. Maybe there's 
> simple stuff we can do such as give useful notifications if checks fail. Try 
> to guide the user to run ./gradlew check and fix it. It sucks to have to 
> review, look at logs, and manually add comments to do this stuff.
> So let's have an issue to improve this area.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10543) Achieve contribution workflow perfection (with progress)

2022-04-28 Thread Robert Muir (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529245#comment-17529245
 ] 

Robert Muir commented on LUCENE-10543:
--

Believe it or not, there's actually no "link" to the source code anywhere, only 
commandline instructions to run {{ git clone 
https://github.com/apache/lucene.git }}. You can contribute changes to stuff 
like documentation straight from the github UI. no need to do stuff from any 
commandline. Linking to the sources appropriately (especially given that the 
repo has moved from lucene-solr to lucene), seems like priority number one.

> Achieve contribution workflow perfection (with progress)
> 
>
> Key: LUCENE-10543
> URL: https://issues.apache.org/jira/browse/LUCENE-10543
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> Inspired by Dawid's build issue which has worked out for us: LUCENE-9871
> He hasn't even linked 10% of the issues/subtasks involved in that work 
> either, but we know.
> I think we need a similar approach for the contribution workflow. There has 
> been some major improvements recently, a couple that come to mind:
> * Tomoko made a CONTRIBUTING.md file which github recognizes and is way 
> better than the wiki stuff
> * Some hazards/error messages/mazes in the build process and so on have 
> gotten fixed.
> But there is more to do in my opinion, here is 3 ideas:
> * Creating a PR still has a massive checklist template. But now this template 
> links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it 
> enough to just link to CONTRIBUTING.md and fix that as needed?
> * Creating a PR still requires signing up for Apache JIRA and creating a JIRA 
> issue. There is zero value to this additional process. We often end out with 
> either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated 
> content. This is just an unnecessary dance, can we use github issues instead?
> * Haven't dug into the github actions or configs very deeply. Maybe there's 
> simple stuff we can do such as give useful notifications if checks fail. Try 
> to guide the user to run ./gradlew check and fix it. It sucks to have to 
> review, look at logs, and manually add comments to do this stuff.
> So let's have an issue to improve this area.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10543) Achieve contribution workflow perfection (with progress)

2022-04-28 Thread Dawid Weiss (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529255#comment-17529255
 ] 

Dawid Weiss commented on LUCENE-10543:
--

("with progress"... yeah, that's why LUCENE-9871 is still open :) )

> Achieve contribution workflow perfection (with progress)
> 
>
> Key: LUCENE-10543
> URL: https://issues.apache.org/jira/browse/LUCENE-10543
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> Inspired by Dawid's build issue which has worked out for us: LUCENE-9871
> He hasn't even linked 10% of the issues/subtasks involved in that work 
> either, but we know.
> I think we need a similar approach for the contribution workflow. There has 
> been some major improvements recently, a couple that come to mind:
> * Tomoko made a CONTRIBUTING.md file which github recognizes and is way 
> better than the wiki stuff
> * Some hazards/error messages/mazes in the build process and so on have 
> gotten fixed.
> But there is more to do in my opinion, here is 3 ideas:
> * Creating a PR still has a massive checklist template. But now this template 
> links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it 
> enough to just link to CONTRIBUTING.md and fix that as needed?
> * Creating a PR still requires signing up for Apache JIRA and creating a JIRA 
> issue. There is zero value to this additional process. We often end out with 
> either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated 
> content. This is just an unnecessary dance, can we use github issues instead?
> * Haven't dug into the github actions or configs very deeply. Maybe there's 
> simple stuff we can do such as give useful notifications if checks fail. Try 
> to guide the user to run ./gradlew check and fix it. It sucks to have to 
> review, look at logs, and manually add comments to do this stuff.
> So let's have an issue to improve this area.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?

2022-04-28 Thread Dawid Weiss (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529261#comment-17529261
 ] 

Dawid Weiss commented on LUCENE-10541:
--

Filed a PR at https://github.com/apache/lucene/pull/850. Picked the default 
from CharTokenizer.DEFAULT_MAX_WORD_LEN, although can't reference that directly 
(not accessible from the test framework). Had to tweak the defaults in one or 
two failing tests that expected the tokenizer to return longer tokens, so a 
second set of eyes would be good.

enwiki lines contains 2 million lines. It'd be nice to calculate the 
probability of any of the k faulty (long-term) lines being drawn in n tries and 
distribute it over time - this would address Mike's question about why it took 
so long to discover them. :)

> What to do about massive terms in our Wikipedia EN LineFileDocs?
> 
>
> Key: LUCENE-10541
> URL: https://issues.apache.org/jira/browse/LUCENE-10541
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> Spinoff from this fun build failure that [~dweiss] root caused: 
> [https://lucene.markmail.org/thread/pculfuazll4oebra]
> Thank you and sorry [~dweiss]!!
> This test failure happened because the test case randomly indexed a chunk of 
> the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's 
> ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the 
> test.
> It's crazy that it took so long for Lucene's randomized tests to discover 
> this too-massive term in Lucene's nightly benchmarks.  It's like searching 
> for Nessie, or 
> [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence].
> We need to prevent such false failures, somehow, and there are multiple 
> options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" 
> terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix 
> {{MockTokenizer}} to trim such ridiculous terms (I think this is the best 
> option?), ...



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9871) Achieve build system perfection (with progress)

2022-04-28 Thread Robert Muir (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529263#comment-17529263
 ] 

Robert Muir commented on LUCENE-9871:
-

just a reminder, haven't forgot about java version constraints here. we've 
added special logic to make failures more clear, but there are version numbers 
everywhere.

Ideally we could consolidate a lot of them in a simple .properties file that 
contains the min/max major version numbers. could be then sucked in by:
* gradle logic
* java logic such as checks done in WrapperDownloader
* bash logic such as error messaging in ./gradlew.sh
* python smoketester logic?

> Achieve build system perfection (with progress)
> ---
>
> Key: LUCENE-9871
> URL: https://issues.apache.org/jira/browse/LUCENE-9871
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This issue is an aggregate of various build-related improvements I have in 
> the back of my mind.  The current state is not bad... but it's not perfect. 
> Hello, [~mikemccand]. ;)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10543) Achieve contribution workflow perfection (with progress)

2022-04-28 Thread Robert Muir (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529270#comment-17529270
 ] 

Robert Muir commented on LUCENE-10543:
--

It was entirely too difficult to find the issue! I knew it had "perfection" in 
the name too

> Achieve contribution workflow perfection (with progress)
> 
>
> Key: LUCENE-10543
> URL: https://issues.apache.org/jira/browse/LUCENE-10543
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> Inspired by Dawid's build issue which has worked out for us: LUCENE-9871
> He hasn't even linked 10% of the issues/subtasks involved in that work 
> either, but we know.
> I think we need a similar approach for the contribution workflow. There has 
> been some major improvements recently, a couple that come to mind:
> * Tomoko made a CONTRIBUTING.md file which github recognizes and is way 
> better than the wiki stuff
> * Some hazards/error messages/mazes in the build process and so on have 
> gotten fixed.
> But there is more to do in my opinion, here is 3 ideas:
> * Creating a PR still has a massive checklist template. But now this template 
> links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it 
> enough to just link to CONTRIBUTING.md and fix that as needed?
> * Creating a PR still requires signing up for Apache JIRA and creating a JIRA 
> issue. There is zero value to this additional process. We often end out with 
> either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated 
> content. This is just an unnecessary dance, can we use github issues instead?
> * Haven't dug into the github actions or configs very deeply. Maybe there's 
> simple stuff we can do such as give useful notifications if checks fail. Try 
> to guide the user to run ./gradlew check and fix it. It sucks to have to 
> review, look at logs, and manually add comments to do this stuff.
> So let's have an issue to improve this area.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.

2022-04-28 Thread GitBox



uschindler commented on PR #850:
URL: https://github.com/apache/lucene/pull/850#issuecomment-877517

   Maybe we should set the limit to the maximum term size. This would allow us 
to test also longer terms.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?

2022-04-28 Thread Uwe Schindler (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529286#comment-17529286
 ] 

Uwe Schindler commented on LUCENE-10541:


I think another option might be to use IndexWriter.MAX_TERM_LENGTH as default. 
This would allow us to test full range and make sure the block tree terms impl 
works as expected.

> What to do about massive terms in our Wikipedia EN LineFileDocs?
> 
>
> Key: LUCENE-10541
> URL: https://issues.apache.org/jira/browse/LUCENE-10541
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Spinoff from this fun build failure that [~dweiss] root caused: 
> [https://lucene.markmail.org/thread/pculfuazll4oebra]
> Thank you and sorry [~dweiss]!!
> This test failure happened because the test case randomly indexed a chunk of 
> the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's 
> ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the 
> test.
> It's crazy that it took so long for Lucene's randomized tests to discover 
> this too-massive term in Lucene's nightly benchmarks.  It's like searching 
> for Nessie, or 
> [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence].
> We need to prevent such false failures, somehow, and there are multiple 
> options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" 
> terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix 
> {{MockTokenizer}} to trim such ridiculous terms (I think this is the best 
> option?), ...



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on a diff in pull request #851: LUCENE-10088: backport simple text codec suppression and fs handle limit

2022-04-28 Thread GitBox



uschindler commented on code in PR #851:
URL: https://github.com/apache/lucene/pull/851#discussion_r860615393


##
lucene/test-framework/src/java/org/apache/lucene/tests/mockfile/HandleLimitFS.java:
##
@@ -17,16 +17,38 @@
 package org.apache.lucene.tests.mockfile;
 
 import java.io.IOException;
+import java.lang.annotation.Documented;
+import java.lang.annotation.ElementType;
+import java.lang.annotation.Inherited;
+import java.lang.annotation.Retention;
+import java.lang.annotation.RetentionPolicy;
+import java.lang.annotation.Target;
 import java.nio.file.FileSystem;
 import java.nio.file.FileSystemException;
 import java.nio.file.Path;
 import java.util.concurrent.atomic.AtomicInteger;
 
-/** FileSystem that throws exception if file handles in use exceeds a 
specified limit */
+/**
+ * FileSystem that throws exception if file handles in use exceeds a specified 
limit.
+ *
+ * @see MaxOpenHandles
+ */
 public class HandleLimitFS extends HandleTrackingFS {
   final int limit;
   final AtomicInteger count = new AtomicInteger();
 
+  /** An annotation */
+  @Documented
+  @Inherited
+  @Retention(RetentionPolicy.RUNTIME)
+  @Target(ElementType.TYPE)
+  public static @interface MaxOpenHandles {
+// TODO: can we make the default even lower?
+public static final int MAX_OPEN_FILES = 2048;

Review Comment:
   Jenkins set this limit to 2048 on OS level, too. So tests using a greater 
one and using FSDirectory may fail.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on a diff in pull request #851: LUCENE-10088: backport simple text codec suppression and fs handle limit

2022-04-28 Thread GitBox



uschindler commented on code in PR #851:
URL: https://github.com/apache/lucene/pull/851#discussion_r860615990


##
lucene/test-framework/src/java/org/apache/lucene/tests/mockfile/HandleLimitFS.java:
##
@@ -17,16 +17,38 @@
 package org.apache.lucene.tests.mockfile;
 
 import java.io.IOException;
+import java.lang.annotation.Documented;
+import java.lang.annotation.ElementType;
+import java.lang.annotation.Inherited;
+import java.lang.annotation.Retention;
+import java.lang.annotation.RetentionPolicy;
+import java.lang.annotation.Target;
 import java.nio.file.FileSystem;
 import java.nio.file.FileSystemException;
 import java.nio.file.Path;
 import java.util.concurrent.atomic.AtomicInteger;
 
-/** FileSystem that throws exception if file handles in use exceeds a 
specified limit */
+/**
+ * FileSystem that throws exception if file handles in use exceeds a specified 
limit.
+ *
+ * @see MaxOpenHandles
+ */
 public class HandleLimitFS extends HandleTrackingFS {
   final int limit;
   final AtomicInteger count = new AtomicInteger();
 
+  /** An annotation */
+  @Documented
+  @Inherited
+  @Retention(RetentionPolicy.RUNTIME)
+  @Target(ElementType.TYPE)
+  public static @interface MaxOpenHandles {
+// TODO: can we make the default even lower?
+public static final int MAX_OPEN_FILES = 2048;

Review Comment:
   At least on macosx.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on a diff in pull request #851: LUCENE-10088: backport simple text codec suppression and fs handle limit

2022-04-28 Thread GitBox



dweiss commented on code in PR #851:
URL: https://github.com/apache/lucene/pull/851#discussion_r860616950


##
lucene/test-framework/src/java/org/apache/lucene/tests/mockfile/HandleLimitFS.java:
##
@@ -17,16 +17,38 @@
 package org.apache.lucene.tests.mockfile;
 
 import java.io.IOException;
+import java.lang.annotation.Documented;
+import java.lang.annotation.ElementType;
+import java.lang.annotation.Inherited;
+import java.lang.annotation.Retention;
+import java.lang.annotation.RetentionPolicy;
+import java.lang.annotation.Target;
 import java.nio.file.FileSystem;
 import java.nio.file.FileSystemException;
 import java.nio.file.Path;
 import java.util.concurrent.atomic.AtomicInteger;
 
-/** FileSystem that throws exception if file handles in use exceeds a 
specified limit */
+/**
+ * FileSystem that throws exception if file handles in use exceeds a specified 
limit.
+ *
+ * @see MaxOpenHandles
+ */
 public class HandleLimitFS extends HandleTrackingFS {
   final int limit;
   final AtomicInteger count = new AtomicInteger();
 
+  /** An annotation */
+  @Documented
+  @Inherited
+  @Retention(RetentionPolicy.RUNTIME)
+  @Target(ElementType.TYPE)
+  public static @interface MaxOpenHandles {
+// TODO: can we make the default even lower?
+public static final int MAX_OPEN_FILES = 2048;

Review Comment:
   I just backported what's on main, didn't touch it otherwise. :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on a diff in pull request #851: LUCENE-10088: backport simple text codec suppression and fs handle limit

2022-04-28 Thread GitBox



uschindler commented on code in PR #851:
URL: https://github.com/apache/lucene/pull/851#discussion_r860618388


##
lucene/test-framework/src/java/org/apache/lucene/tests/mockfile/HandleLimitFS.java:
##
@@ -17,16 +17,38 @@
 package org.apache.lucene.tests.mockfile;
 
 import java.io.IOException;
+import java.lang.annotation.Documented;
+import java.lang.annotation.ElementType;
+import java.lang.annotation.Inherited;
+import java.lang.annotation.Retention;
+import java.lang.annotation.RetentionPolicy;
+import java.lang.annotation.Target;
 import java.nio.file.FileSystem;
 import java.nio.file.FileSystemException;
 import java.nio.file.Path;
 import java.util.concurrent.atomic.AtomicInteger;
 
-/** FileSystem that throws exception if file handles in use exceeds a 
specified limit */
+/**
+ * FileSystem that throws exception if file handles in use exceeds a specified 
limit.
+ *
+ * @see MaxOpenHandles
+ */
 public class HandleLimitFS extends HandleTrackingFS {
   final int limit;
   final AtomicInteger count = new AtomicInteger();
 
+  /** An annotation */
+  @Documented
+  @Inherited
+  @Retention(RetentionPolicy.RUNTIME)
+  @Target(ElementType.TYPE)
+  public static @interface MaxOpenHandles {
+// TODO: can we make the default even lower?
+public static final int MAX_OPEN_FILES = 2048;

Review Comment:
   I know. I can also reply to Robert's mail.
   
   I think it did not hit that on macos Jenkins because it does not run nightly 
tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on a diff in pull request #851: LUCENE-10088: backport simple text codec suppression and fs handle limit

2022-04-28 Thread GitBox



rmuir commented on code in PR #851:
URL: https://github.com/apache/lucene/pull/851#discussion_r860619792


##
lucene/test-framework/src/java/org/apache/lucene/tests/mockfile/HandleLimitFS.java:
##
@@ -17,16 +17,38 @@
 package org.apache.lucene.tests.mockfile;
 
 import java.io.IOException;
+import java.lang.annotation.Documented;
+import java.lang.annotation.ElementType;
+import java.lang.annotation.Inherited;
+import java.lang.annotation.Retention;
+import java.lang.annotation.RetentionPolicy;
+import java.lang.annotation.Target;
 import java.nio.file.FileSystem;
 import java.nio.file.FileSystemException;
 import java.nio.file.Path;
 import java.util.concurrent.atomic.AtomicInteger;
 
-/** FileSystem that throws exception if file handles in use exceeds a 
specified limit */
+/**
+ * FileSystem that throws exception if file handles in use exceeds a specified 
limit.
+ *
+ * @see MaxOpenHandles
+ */
 public class HandleLimitFS extends HandleTrackingFS {
   final int limit;
   final AtomicInteger count = new AtomicInteger();
 
+  /** An annotation */
+  @Documented
+  @Inherited
+  @Retention(RetentionPolicy.RUNTIME)
+  @Target(ElementType.TYPE)
+  public static @interface MaxOpenHandles {
+// TODO: can we make the default even lower?
+public static final int MAX_OPEN_FILES = 2048;

Review Comment:
   yeah, as mentioned on the dev list, i don't like to see that we added this 
annotation as escape-mechanism to increase the file count to be even higher.
   
   personally i would prefer 1024 or smaller. I'm using linux 5.17.4 with all 
defaults and my system shows 1024 in ulimit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on a diff in pull request #851: LUCENE-10088: backport simple text codec suppression and fs handle limit

2022-04-28 Thread GitBox



rmuir commented on code in PR #851:
URL: https://github.com/apache/lucene/pull/851#discussion_r860621619


##
lucene/test-framework/src/java/org/apache/lucene/tests/mockfile/HandleLimitFS.java:
##
@@ -17,16 +17,38 @@
 package org.apache.lucene.tests.mockfile;
 
 import java.io.IOException;
+import java.lang.annotation.Documented;
+import java.lang.annotation.ElementType;
+import java.lang.annotation.Inherited;
+import java.lang.annotation.Retention;
+import java.lang.annotation.RetentionPolicy;
+import java.lang.annotation.Target;
 import java.nio.file.FileSystem;
 import java.nio.file.FileSystemException;
 import java.nio.file.Path;
 import java.util.concurrent.atomic.AtomicInteger;
 
-/** FileSystem that throws exception if file handles in use exceeds a 
specified limit */
+/**
+ * FileSystem that throws exception if file handles in use exceeds a specified 
limit.
+ *
+ * @see MaxOpenHandles
+ */
 public class HandleLimitFS extends HandleTrackingFS {
   final int limit;
   final AtomicInteger count = new AtomicInteger();
 
+  /** An annotation */
+  @Documented
+  @Inherited
+  @Retention(RetentionPolicy.RUNTIME)
+  @Target(ElementType.TYPE)
+  public static @interface MaxOpenHandles {
+// TODO: can we make the default even lower?
+public static final int MAX_OPEN_FILES = 2048;

Review Comment:
   at the same time, i would prefer we stabilize the tests in jenkins. we can 
make followup issues for the filehandle stuff. It is just silly, totally 
defeats the purpose of HandleLimitFS if we allow it to go to huge values. The 
whole purpose of this delegator is that we can have a limit that works 
everywhere to equalize it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on a diff in pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.

2022-04-28 Thread GitBox



uschindler commented on code in PR #850:
URL: https://github.com/apache/lucene/pull/850#discussion_r860622081


##
lucene/test-framework/src/java/org/apache/lucene/tests/analysis/MockTokenizer.java:
##
@@ -66,11 +67,11 @@ public class MockTokenizer extends Tokenizer {
* Limit the default token length to a size that doesn't cause random 
analyzer failures on
* unpredictable data like the enwiki data set.
*
-   * This value defaults to {@code CharTokenizer.DEFAULT_MAX_WORD_LEN}.
+   * This value defaults to {@link IndexWriter#MAX_TERM_LENGTH}.
*
* @see "https://issues.apache.org/jira/browse/LUCENE-10541";
*/
-  public static final int DEFAULT_MAX_TOKEN_LENGTH = 255;
+  public static final int DEFAULT_MAX_TOKEN_LENGTH = 
IndexWriter.MAX_TERM_LENGTH;

Review Comment:
   After thinking about it: we may need to be careful: I think the maximum 
token length in IW is in bytes. Tokenizer counts utf16 chars. So it may 
overflow. @rmuir ?
   
   Not sure what to do. I like this maximum approach but it may fail.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on a diff in pull request #851: LUCENE-10088: backport simple text codec suppression and fs handle limit

2022-04-28 Thread GitBox



uschindler commented on code in PR #851:
URL: https://github.com/apache/lucene/pull/851#discussion_r860623990


##
lucene/test-framework/src/java/org/apache/lucene/tests/mockfile/HandleLimitFS.java:
##
@@ -17,16 +17,38 @@
 package org.apache.lucene.tests.mockfile;
 
 import java.io.IOException;
+import java.lang.annotation.Documented;
+import java.lang.annotation.ElementType;
+import java.lang.annotation.Inherited;
+import java.lang.annotation.Retention;
+import java.lang.annotation.RetentionPolicy;
+import java.lang.annotation.Target;
 import java.nio.file.FileSystem;
 import java.nio.file.FileSystemException;
 import java.nio.file.Path;
 import java.util.concurrent.atomic.AtomicInteger;
 
-/** FileSystem that throws exception if file handles in use exceeds a 
specified limit */
+/**
+ * FileSystem that throws exception if file handles in use exceeds a specified 
limit.
+ *
+ * @see MaxOpenHandles
+ */
 public class HandleLimitFS extends HandleTrackingFS {
   final int limit;
   final AtomicInteger count = new AtomicInteger();
 
+  /** An annotation */
+  @Documented
+  @Inherited
+  @Retention(RetentionPolicy.RUNTIME)
+  @Target(ElementType.TYPE)
+  public static @interface MaxOpenHandles {
+// TODO: can we make the default even lower?
+public static final int MAX_OPEN_FILES = 2048;

Review Comment:
   On macos it was lower at the time I setup jenkins, too, so I raised it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on a diff in pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.

2022-04-28 Thread GitBox



dweiss commented on code in PR #850:
URL: https://github.com/apache/lucene/pull/850#discussion_r860625375


##
lucene/test-framework/src/java/org/apache/lucene/tests/analysis/MockTokenizer.java:
##
@@ -66,11 +67,11 @@ public class MockTokenizer extends Tokenizer {
* Limit the default token length to a size that doesn't cause random 
analyzer failures on
* unpredictable data like the enwiki data set.
*
-   * This value defaults to {@code CharTokenizer.DEFAULT_MAX_WORD_LEN}.
+   * This value defaults to {@link IndexWriter#MAX_TERM_LENGTH}.
*
* @see "https://issues.apache.org/jira/browse/LUCENE-10541";
*/
-  public static final int DEFAULT_MAX_TOKEN_LENGTH = 255;
+  public static final int DEFAULT_MAX_TOKEN_LENGTH = 
IndexWriter.MAX_TERM_LENGTH;

Review Comment:
   We can make it safe - IndexWriter.MAX_TERM_LENGTH/2?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on a diff in pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.

2022-04-28 Thread GitBox



rmuir commented on code in PR #850:
URL: https://github.com/apache/lucene/pull/850#discussion_r860626938


##
lucene/test-framework/src/java/org/apache/lucene/tests/analysis/MockTokenizer.java:
##
@@ -66,11 +67,11 @@ public class MockTokenizer extends Tokenizer {
* Limit the default token length to a size that doesn't cause random 
analyzer failures on
* unpredictable data like the enwiki data set.
*
-   * This value defaults to {@code CharTokenizer.DEFAULT_MAX_WORD_LEN}.
+   * This value defaults to {@link IndexWriter#MAX_TERM_LENGTH}.
*
* @see "https://issues.apache.org/jira/browse/LUCENE-10541";
*/
-  public static final int DEFAULT_MAX_TOKEN_LENGTH = 255;
+  public static final int DEFAULT_MAX_TOKEN_LENGTH = 
IndexWriter.MAX_TERM_LENGTH;

Review Comment:
   This tokenizer counts UTF-32 codepoints i think? Which is another thing 
entirely :)
   
   But still 255 is fine, as 255*4 is still far less than IndexWriter's limit. 
   
   I think it is a good change. *ALL* the "real" analyzers have limits here. 
they have setters to tweak, too. Even KeywordTokenizer! So the problem of 
tokenizer emitting huge term is limited to MockTokenizer... let's fix it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on a diff in pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.

2022-04-28 Thread GitBox



rmuir commented on code in PR #850:
URL: https://github.com/apache/lucene/pull/850#discussion_r860628445


##
lucene/test-framework/src/java/org/apache/lucene/tests/analysis/MockTokenizer.java:
##
@@ -66,11 +67,11 @@ public class MockTokenizer extends Tokenizer {
* Limit the default token length to a size that doesn't cause random 
analyzer failures on
* unpredictable data like the enwiki data set.
*
-   * This value defaults to {@code CharTokenizer.DEFAULT_MAX_WORD_LEN}.
+   * This value defaults to {@link IndexWriter#MAX_TERM_LENGTH}.
*
* @see "https://issues.apache.org/jira/browse/LUCENE-10541";
*/
-  public static final int DEFAULT_MAX_TOKEN_LENGTH = 255;
+  public static final int DEFAULT_MAX_TOKEN_LENGTH = 
IndexWriter.MAX_TERM_LENGTH;

Review Comment:
   I would propose 255 yeah, the constant from StandardAnalyzer. seems to me 
that most of the other tokenizers steal that default.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on a diff in pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.

2022-04-28 Thread GitBox



rmuir commented on code in PR #850:
URL: https://github.com/apache/lucene/pull/850#discussion_r860631960


##
lucene/test-framework/src/java/org/apache/lucene/tests/analysis/MockTokenizer.java:
##
@@ -66,11 +67,11 @@ public class MockTokenizer extends Tokenizer {
* Limit the default token length to a size that doesn't cause random 
analyzer failures on
* unpredictable data like the enwiki data set.
*
-   * This value defaults to {@code CharTokenizer.DEFAULT_MAX_WORD_LEN}.
+   * This value defaults to {@link IndexWriter#MAX_TERM_LENGTH}.
*
* @see "https://issues.apache.org/jira/browse/LUCENE-10541";
*/
-  public static final int DEFAULT_MAX_TOKEN_LENGTH = 255;
+  public static final int DEFAULT_MAX_TOKEN_LENGTH = 
IndexWriter.MAX_TERM_LENGTH;

Review Comment:
   > We can make it safe - IndexWriter.MAX_TERM_LENGTH/2?
   
   I just checked, currently it indeed counts UTF-16 code units. This tokenizer 
is quite slow and processes single codepoint at a time. But the check for this 
looks at String.length(). Possibly correct or incorrect, but it is a check that 
should work.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on a diff in pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.

2022-04-28 Thread GitBox



rmuir commented on code in PR #850:
URL: https://github.com/apache/lucene/pull/850#discussion_r860635303


##
lucene/test-framework/src/java/org/apache/lucene/tests/analysis/MockTokenizer.java:
##
@@ -66,11 +67,11 @@ public class MockTokenizer extends Tokenizer {
* Limit the default token length to a size that doesn't cause random 
analyzer failures on
* unpredictable data like the enwiki data set.
*
-   * This value defaults to {@code CharTokenizer.DEFAULT_MAX_WORD_LEN}.
+   * This value defaults to {@link IndexWriter#MAX_TERM_LENGTH}.
*
* @see "https://issues.apache.org/jira/browse/LUCENE-10541";
*/
-  public static final int DEFAULT_MAX_TOKEN_LENGTH = 255;
+  public static final int DEFAULT_MAX_TOKEN_LENGTH = 
IndexWriter.MAX_TERM_LENGTH;

Review Comment:
   so yeah `IndexWriter.MAX_TERM_LENGTH/2` should be safe due to how this thing 
currently counts, just watch the off-by-one possibility. Maybe we can reproduce 
the original failure with the patch.
   
   Still I think MockTokenizer should have a default limit of something 
reasonable like 255 to prevent debugging other crazy tests. Sometimes stemmers 
make terms longer and stuff. Let's not set ourselves up for failure.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on a diff in pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.

2022-04-28 Thread GitBox



dweiss commented on code in PR #850:
URL: https://github.com/apache/lucene/pull/850#discussion_r860641028


##
lucene/test-framework/src/java/org/apache/lucene/tests/analysis/MockTokenizer.java:
##
@@ -66,11 +67,11 @@ public class MockTokenizer extends Tokenizer {
* Limit the default token length to a size that doesn't cause random 
analyzer failures on
* unpredictable data like the enwiki data set.
*
-   * This value defaults to {@code CharTokenizer.DEFAULT_MAX_WORD_LEN}.
+   * This value defaults to {@link IndexWriter#MAX_TERM_LENGTH}.
*
* @see "https://issues.apache.org/jira/browse/LUCENE-10541";
*/
-  public static final int DEFAULT_MAX_TOKEN_LENGTH = 255;
+  public static final int DEFAULT_MAX_TOKEN_LENGTH = 
IndexWriter.MAX_TERM_LENGTH;

Review Comment:
   Umm... so which one should it be then? IndexWriter.MAX_TERM_LENGTH/2, 
IndexWriter.MAX_TERM_LENGTH/2 - 1 or 255? :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on a diff in pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.

2022-04-28 Thread GitBox



mocobeta commented on code in PR #850:
URL: https://github.com/apache/lucene/pull/850#discussion_r860642306


##
lucene/test-framework/src/java/org/apache/lucene/tests/analysis/MockTokenizer.java:
##
@@ -66,11 +67,11 @@ public class MockTokenizer extends Tokenizer {
* Limit the default token length to a size that doesn't cause random 
analyzer failures on
* unpredictable data like the enwiki data set.
*
-   * This value defaults to {@code CharTokenizer.DEFAULT_MAX_WORD_LEN}.
+   * This value defaults to {@link IndexWriter#MAX_TERM_LENGTH}.
*
* @see "https://issues.apache.org/jira/browse/LUCENE-10541";
*/
-  public static final int DEFAULT_MAX_TOKEN_LENGTH = 255;
+  public static final int DEFAULT_MAX_TOKEN_LENGTH = 
IndexWriter.MAX_TERM_LENGTH;

Review Comment:
   I'm not so knowledgeable on it, but it might be understandable (or 
reasonable, I can't find the right English word) to respect 
`StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH`, in other words, 255 here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on a diff in pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.

2022-04-28 Thread GitBox



mocobeta commented on code in PR #850:
URL: https://github.com/apache/lucene/pull/850#discussion_r860642306


##
lucene/test-framework/src/java/org/apache/lucene/tests/analysis/MockTokenizer.java:
##
@@ -66,11 +67,11 @@ public class MockTokenizer extends Tokenizer {
* Limit the default token length to a size that doesn't cause random 
analyzer failures on
* unpredictable data like the enwiki data set.
*
-   * This value defaults to {@code CharTokenizer.DEFAULT_MAX_WORD_LEN}.
+   * This value defaults to {@link IndexWriter#MAX_TERM_LENGTH}.
*
* @see "https://issues.apache.org/jira/browse/LUCENE-10541";
*/
-  public static final int DEFAULT_MAX_TOKEN_LENGTH = 255;
+  public static final int DEFAULT_MAX_TOKEN_LENGTH = 
IndexWriter.MAX_TERM_LENGTH;

Review Comment:
   I'm not so knowledgeable on it, but it might be understandable (or 
reasonable, I can't find the right English word) to respect 
`StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH`, in other words, 255 here? I mean, 
people would expect the mock tokenizer behaves similarly to the standard 
tokenizer as default.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on a diff in pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.

2022-04-28 Thread GitBox



uschindler commented on code in PR #850:
URL: https://github.com/apache/lucene/pull/850#discussion_r860657767


##
lucene/test-framework/src/java/org/apache/lucene/tests/analysis/MockTokenizer.java:
##
@@ -66,11 +67,11 @@ public class MockTokenizer extends Tokenizer {
* Limit the default token length to a size that doesn't cause random 
analyzer failures on
* unpredictable data like the enwiki data set.
*
-   * This value defaults to {@code CharTokenizer.DEFAULT_MAX_WORD_LEN}.
+   * This value defaults to {@link IndexWriter#MAX_TERM_LENGTH}.
*
* @see "https://issues.apache.org/jira/browse/LUCENE-10541";
*/
-  public static final int DEFAULT_MAX_TOKEN_LENGTH = 255;
+  public static final int DEFAULT_MAX_TOKEN_LENGTH = 
IndexWriter.MAX_TERM_LENGTH;

Review Comment:
   So it looks like some tests are now using a larger default. Maybe we should 
change more tests to use the MockTokenizer with a larger size and keep the 
default with 255. If we have other tests checking the boundaries of IndexWriter 
terms I am fine.
   
   It just came to my idea to not use the "arbitrary" 255 limit but instead use 
something else here. But the UTF-16 vs UTF-8 problem is a bit problematic here.
   
   So I am fine with any solution.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on a diff in pull request #851: LUCENE-10088: backport simple text codec suppression and fs handle limit

2022-04-28 Thread GitBox



uschindler commented on code in PR #851:
URL: https://github.com/apache/lucene/pull/851#discussion_r860662889


##
lucene/test-framework/src/java/org/apache/lucene/tests/mockfile/HandleLimitFS.java:
##
@@ -17,16 +17,38 @@
 package org.apache.lucene.tests.mockfile;
 
 import java.io.IOException;
+import java.lang.annotation.Documented;
+import java.lang.annotation.ElementType;
+import java.lang.annotation.Inherited;
+import java.lang.annotation.Retention;
+import java.lang.annotation.RetentionPolicy;
+import java.lang.annotation.Target;
 import java.nio.file.FileSystem;
 import java.nio.file.FileSystemException;
 import java.nio.file.Path;
 import java.util.concurrent.atomic.AtomicInteger;
 
-/** FileSystem that throws exception if file handles in use exceeds a 
specified limit */
+/**
+ * FileSystem that throws exception if file handles in use exceeds a specified 
limit.
+ *
+ * @see MaxOpenHandles
+ */
 public class HandleLimitFS extends HandleTrackingFS {
   final int limit;
   final AtomicInteger count = new AtomicInteger();
 
+  /** An annotation */
+  @Documented
+  @Inherited
+  @Retention(RetentionPolicy.RUNTIME)
+  @Target(ElementType.TYPE)
+  public static @interface MaxOpenHandles {
+// TODO: can we make the default even lower?
+public static final int MAX_OPEN_FILES = 2048;

Review Comment:
   I changed the limit on the MacOS slave to 8192.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10543) Achieve contribution workflow perfection (with progress)

2022-04-28 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529333#comment-17529333
 ] 

Tomoko Uchida commented on LUCENE-10543:


bq. Creating a PR still requires signing up for Apache JIRA and creating a JIRA 
issue. There is zero value to this additional process. We often end out with 
either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated 
content. This is just an unnecessary dance, can we use github issues instead?

Yes I think it's an obstacle to contributing (in this GitHub era) when 
developers already have a working patch... As far as I know, the only technical 
reason to ask contributors to open a JIRA issue is that our CHANGES (or 
{{changes2html.pl}}) requires it. (I have no opinion about replacing JIRA with 
GitHub issue.)

> Achieve contribution workflow perfection (with progress)
> 
>
> Key: LUCENE-10543
> URL: https://issues.apache.org/jira/browse/LUCENE-10543
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> Inspired by Dawid's build issue which has worked out for us: LUCENE-9871
> He hasn't even linked 10% of the issues/subtasks involved in that work 
> either, but we know.
> I think we need a similar approach for the contribution workflow. There has 
> been some major improvements recently, a couple that come to mind:
> * Tomoko made a CONTRIBUTING.md file which github recognizes and is way 
> better than the wiki stuff
> * Some hazards/error messages/mazes in the build process and so on have 
> gotten fixed.
> But there is more to do in my opinion, here is 3 ideas:
> * Creating a PR still has a massive checklist template. But now this template 
> links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it 
> enough to just link to CONTRIBUTING.md and fix that as needed?
> * Creating a PR still requires signing up for Apache JIRA and creating a JIRA 
> issue. There is zero value to this additional process. We often end out with 
> either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated 
> content. This is just an unnecessary dance, can we use github issues instead?
> * Haven't dug into the github actions or configs very deeply. Maybe there's 
> simple stuff we can do such as give useful notifications if checks fail. Try 
> to guide the user to run ./gradlew check and fix it. It sucks to have to 
> review, look at logs, and manually add comments to do this stuff.
> So let's have an issue to improve this area.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10543) Achieve contribution workflow perfection (with progress)

2022-04-28 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529343#comment-17529343
 ] 

Tomoko Uchida commented on LUCENE-10543:


I just found that it looks like {{changes2html.pl}} supports github PRs, the 
url is obsoleted though. 
https://github.com/apache/lucene/blob/a8d86ea6e8b89ea0324e7f8b6e1d5213254274d5/gradle/documentation/changes-to-html/changes2html.pl#L28

> Achieve contribution workflow perfection (with progress)
> 
>
> Key: LUCENE-10543
> URL: https://issues.apache.org/jira/browse/LUCENE-10543
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> Inspired by Dawid's build issue which has worked out for us: LUCENE-9871
> He hasn't even linked 10% of the issues/subtasks involved in that work 
> either, but we know.
> I think we need a similar approach for the contribution workflow. There has 
> been some major improvements recently, a couple that come to mind:
> * Tomoko made a CONTRIBUTING.md file which github recognizes and is way 
> better than the wiki stuff
> * Some hazards/error messages/mazes in the build process and so on have 
> gotten fixed.
> But there is more to do in my opinion, here is 3 ideas:
> * Creating a PR still has a massive checklist template. But now this template 
> links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it 
> enough to just link to CONTRIBUTING.md and fix that as needed?
> * Creating a PR still requires signing up for Apache JIRA and creating a JIRA 
> issue. There is zero value to this additional process. We often end out with 
> either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated 
> content. This is just an unnecessary dance, can we use github issues instead?
> * Haven't dug into the github actions or configs very deeply. Maybe there's 
> simple stuff we can do such as give useful notifications if checks fail. Try 
> to guide the user to run ./gradlew check and fix it. It sucks to have to 
> review, look at logs, and manually add comments to do this stuff.
> So let's have an issue to improve this area.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on a diff in pull request #842: LUCENE-10518: Relax field consistency check for old indices

2022-04-28 Thread GitBox



jpountz commented on code in PR #842:
URL: https://github.com/apache/lucene/pull/842#discussion_r860789641


##
lucene/core/src/java/org/apache/lucene/index/FieldInfos.java:
##
@@ -350,6 +360,11 @@ static final class FieldNumbers {
   this.omitNorms = new HashMap<>();
   this.storeTermVectors = new HashMap<>();
   this.softDeletesFieldName = softDeletesFieldName;
+  this.strictlyConsistent = indexCreatedVersionMajor >= 9;
+}
+
+FieldNumbers(String softDeletesFieldName) {

Review Comment:
   does this ctor have many call sites, could we always use the 2-args ctor?



##
lucene/CHANGES.txt:
##
@@ -94,8 +94,10 @@ Bug Fixes
   no documents instead of throwing an NPE. (Greg Miller)
 
 * LUCENE-10470: Check if polygon has been successfully tessellated before we 
fail (we are failing some valid
-  tessellations) and allow filtering edges that fold on top of the previous 
one. (Ignacio Vera)  
-  
+  tessellations) and allow filtering edges that fold on top of the previous 
one. (Ignacio Vera)
+

Review Comment:
   remove the space on this empty line?



##
lucene/core/src/java/org/apache/lucene/index/FieldInfos.java:
##
@@ -178,7 +179,15 @@ public static FieldInfos getMergedFieldInfos(IndexReader 
reader) {
   .filter(Objects::nonNull)
   .findAny()
   .orElse(null);
-  final Builder builder = new Builder(new FieldNumbers(softDeletesField));
+  final int indexCreatedVersionMajor =
+  leaves.stream()
+  .map(l -> l.reader().getMetaData())
+  .filter(Objects::nonNull)
+  .mapToInt(r -> r.getCreatedVersionMajor())
+  .min()
+  .orElse(Version.LATEST.major);

Review Comment:
   I'd like it a bit better if we moved to a for-each loop and threw an error 
if two segments happen to have different versions, since we'd be in trouble in 
that case?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #780: LUCENE-10496: avoid unnecessary attempts to evaluate skipping doc if index sort and search sort are in opposite direction

2022-04-28 Thread GitBox



jpountz commented on PR #780:
URL: https://github.com/apache/lucene/pull/780#issuecomment-1112113792

   If you can find an approach that is less intrusive, maybe I would 
reconsider, but it still looks to me like we're adding complexity to work 
around a bad user decisions, which doesn't feel like the right trade-off to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] wjp719 commented on pull request #780: LUCENE-10496: avoid unnecessary attempts to evaluate skipping doc if index sort and search sort are in opposite direction

2022-04-28 Thread GitBox



wjp719 commented on PR #780:
URL: https://github.com/apache/lucene/pull/780#issuecomment-1112123414

   @jpountz Ok, tanks. As users ofter need to search on both directions but 
index data can only be index sorted by one directions. This situation may be 
common for users.
   
   > If you can find an approach that is less intrusive 
   
   can you give me any hits to achieve this? Thanks.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #780: LUCENE-10496: avoid unnecessary attempts to evaluate skipping doc if index sort and search sort are in opposite direction

2022-04-28 Thread GitBox



jpountz commented on PR #780:
URL: https://github.com/apache/lucene/pull/780#issuecomment-1112127776

   Why do you sort the index if you need to sort in both directions? Is it for 
range queries? Could you use points instead?
   
   Sorry I don't have good ideas for making it less intrusive.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on pull request #846: LUCENE-10493: move n-best logic to analysis-common

2022-04-28 Thread GitBox



mocobeta commented on PR #846:
URL: https://github.com/apache/lucene/pull/846#issuecomment-1112127870

   @mikemccand thank you for revewing!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?

2022-04-28 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529416#comment-17529416
 ] 

Michael McCandless commented on LUCENE-10541:
-

{quote}enwiki lines contains 2 million lines. It'd be nice to calculate the 
probability of any of the k faulty (long-term) lines being drawn in n tries and 
distribute it over time - this would address Mike's question about why it took 
so long to discover them. :)
{quote}
LOL this is indeed fun to work out.

There are a couple wrinkles to modeling this though :)

First, it's not really "randomly picking N lines for each test run", it's 
seeking to one spot and then reading N sequential lines from there.  Assuming 
the file is well shuffled (I think it is), this is maybe not changing the 
result over picking N random lines, since those N sequential lines were already 
randomized.

Second, the way the seeking works is to pick a random spot (byte location), 
seek there, scan to the end of that line, and start reading from the following 
line forwards.  Many of the lines are very short, but some of them are longer, 
and even fewer of them are truly massive (and might have an evil Darth Term in 
there).  One wrinkle here is that if you seek into the middle of one of the 
Darth Terms, you'll then seek to end of line and skip that large term entirely. 
 Given that these massive lines take more bytes it seems more likely the 
seeking will then skip the Darth Term lines?

Finally, there is one more crazy wrinkle – the nightly LineFileDocs is no 
longer a simple text file – it also has a pre-chunked "index" so test 
randomization can jump to one the pre-computed known skip points.  Maybe that 
chunking introduced some sort of bias?

Fun to think about the Darth Terms!!

> What to do about massive terms in our Wikipedia EN LineFileDocs?
> 
>
> Key: LUCENE-10541
> URL: https://issues.apache.org/jira/browse/LUCENE-10541
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Spinoff from this fun build failure that [~dweiss] root caused: 
> [https://lucene.markmail.org/thread/pculfuazll4oebra]
> Thank you and sorry [~dweiss]!!
> This test failure happened because the test case randomly indexed a chunk of 
> the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's 
> ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the 
> test.
> It's crazy that it took so long for Lucene's randomized tests to discover 
> this too-massive term in Lucene's nightly benchmarks.  It's like searching 
> for Nessie, or 
> [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence].
> We need to prevent such false failures, somehow, and there are multiple 
> options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" 
> terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix 
> {{MockTokenizer}} to trim such ridiculous terms (I think this is the best 
> option?), ...



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #848: LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations

2022-04-28 Thread GitBox



gsmiller commented on PR #848:
URL: https://github.com/apache/lucene/pull/848#issuecomment-1112187182

   @gautamworah96 thanks for looking! The mismatch in the original Jira was a 
slightly different check, which is the one now set to a delta of `0`. The fix 
proposed here is to ensure the floats are summed in the exact same order, which 
leads to identical results. The mismatch originated because the floats were 
being summed in different orders. Instead of chasing a delta that will work for 
"all" cases (which may be difficult since the test is random), I propose fixing 
it in this way.
   
   There's still a different check that does require a delta because it's 
difficult to ensure the test case can match the same ordering used internally 
by the faceting implementation. That's the one that you might be seeing that's 
still set to 1? I think this is acceptable because it's only 3 floats being 
summed in this case and the "drift" will be much smaller.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a diff in pull request #849: LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations

2022-04-28 Thread GitBox



gsmiller commented on code in PR #849:
URL: https://github.com/apache/lucene/pull/849#discussion_r860876168


##
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/TestTaxonomyFacetAssociations.java:
##
@@ -142,6 +146,34 @@ public static void beforeClass() throws Exception {
 reader = writer.getReader();
 writer.close();
 taxoReader = new DirectoryTaxonomyReader(taxoDir);
+
+// To avoid floating point precision issues, it's useful to keep track of 
the values in the

Review Comment:
   Yeah, it appears so. I'm not an expert on floating point arithmetic by any 
means, but that's what I tracked the issue down to.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10543) Achieve contribution workflow perfection (with progress)

2022-04-28 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529437#comment-17529437
 ] 

Michael McCandless commented on LUCENE-10543:
-

+1 to work out a migration plan to switch to GitHub issues.  Can we preserve 
our whole history here?

The most compelling reason to me is that our Jira instance still does not 
(cannot?) support Markdown.  Maybe all these comments where we all 
optimistically tried to use Markdown will then render correctly on migration to 
GitHub issues!!

I even see attempted Markdown in our CHANGES.txt, but does our 
`changes2html.pl` support rendering/translating MD to HTML?

Hmm then I will have to figure out how to migrate 
[https://jirasearch.mikemccandless.com|https://jirasearch.mikemccandless.com/] 
near-real-time indexing onto GitHub issues too!

> Achieve contribution workflow perfection (with progress)
> 
>
> Key: LUCENE-10543
> URL: https://issues.apache.org/jira/browse/LUCENE-10543
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> Inspired by Dawid's build issue which has worked out for us: LUCENE-9871
> He hasn't even linked 10% of the issues/subtasks involved in that work 
> either, but we know.
> I think we need a similar approach for the contribution workflow. There has 
> been some major improvements recently, a couple that come to mind:
> * Tomoko made a CONTRIBUTING.md file which github recognizes and is way 
> better than the wiki stuff
> * Some hazards/error messages/mazes in the build process and so on have 
> gotten fixed.
> But there is more to do in my opinion, here is 3 ideas:
> * Creating a PR still has a massive checklist template. But now this template 
> links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it 
> enough to just link to CONTRIBUTING.md and fix that as needed?
> * Creating a PR still requires signing up for Apache JIRA and creating a JIRA 
> issue. There is zero value to this additional process. We often end out with 
> either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated 
> content. This is just an unnecessary dance, can we use github issues instead?
> * Haven't dug into the github actions or configs very deeply. Maybe there's 
> simple stuff we can do such as give useful notifications if checks fail. Try 
> to guide the user to run ./gradlew check and fix it. It sucks to have to 
> review, look at logs, and manually add comments to do this stuff.
> So let's have an issue to improve this area.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10543) Achieve contribution workflow perfection (with progress)

2022-04-28 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529452#comment-17529452
 ] 

Michael McCandless commented on LUCENE-10543:
-

Also, I love this new "Achieve XYZ perfection (with progress)" template ;)

> Achieve contribution workflow perfection (with progress)
> 
>
> Key: LUCENE-10543
> URL: https://issues.apache.org/jira/browse/LUCENE-10543
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> Inspired by Dawid's build issue which has worked out for us: LUCENE-9871
> He hasn't even linked 10% of the issues/subtasks involved in that work 
> either, but we know.
> I think we need a similar approach for the contribution workflow. There has 
> been some major improvements recently, a couple that come to mind:
> * Tomoko made a CONTRIBUTING.md file which github recognizes and is way 
> better than the wiki stuff
> * Some hazards/error messages/mazes in the build process and so on have 
> gotten fixed.
> But there is more to do in my opinion, here is 3 ideas:
> * Creating a PR still has a massive checklist template. But now this template 
> links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it 
> enough to just link to CONTRIBUTING.md and fix that as needed?
> * Creating a PR still requires signing up for Apache JIRA and creating a JIRA 
> issue. There is zero value to this additional process. We often end out with 
> either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated 
> content. This is just an unnecessary dance, can we use github issues instead?
> * Haven't dug into the github actions or configs very deeply. Maybe there's 
> simple stuff we can do such as give useful notifications if checks fail. Try 
> to guide the user to run ./gradlew check and fix it. It sucks to have to 
> review, look at logs, and manually add comments to do this stuff.
> So let's have an issue to improve this area.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mikemccand commented on pull request #663: Lucene-10188: Give SortedSetDocValues a docValueCount()?

2022-04-28 Thread GitBox



mikemccand commented on PR #663:
URL: https://github.com/apache/lucene/pull/663#issuecomment-1112261833

   Thanks @spike-liu -- this looks great to me!  I'll leave it open for at 
least 48 hours to see if anyone else wants to approve/comment, else lazy 
consensus kicks in!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?

2022-04-28 Thread Greg Miller (Jira)

Greg Miller created LUCENE-10544:


 Summary: Should ExitableTermsEnum wrap postings and impacts?
 Key: LUCENE-10544
 URL: https://issues.apache.org/jira/browse/LUCENE-10544
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/index
Reporter: Greg Miller


While looking into options for 
[LUCENE-10151|https://issues.apache.org/jira/browse/LUCENE-10151], I noticed 
that {{ExitableDirectoryReader}} doesn't actually do any timeout checking once 
you start iterating postings/impact. The does create a {{ExitableTermsEnum}} 
wrapper when loading a {{TermsEnum}}, but that wrapper doesn't do anything to 
wrap postings or impact. So timeouts will be enforced when moving to the "next" 
term, but not when iterating the postings/impact associated with a term.

I think we ought to wrap the postings/impacts as well with some form of timeout 
checking so timeouts can be enforced on long-running queries. I'm not sure why 
this wasn't done originally (back in 2014), but it was questioned back in 2020 
on the original Jira 
[SOLR-5986|https://issues.apache.org/jira/browse/SOLR-5986?focusedCommentId=17177009&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17177009].
 Does anyone know of a good reason why we shouldn't enforce timeouts in this 
way?

Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} 
given that only {{next}} is being wrapped currently.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?

2022-04-28 Thread Greg Miller (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller updated LUCENE-10544:
-
Description: 
While looking into options for LUCENE-10151, I noticed that 
{{ExitableDirectoryReader}} doesn't actually do any timeout checking once you 
start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} 
wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do anything 
to wrap postings or impacts. So timeouts will be enforced when moving to the 
"next" term, but not when iterating the postings/impacts associated with a term.

I think we ought to wrap the postings/impacts as well with some form of timeout 
checking so timeouts can be enforced on long-running queries. I'm not sure why 
this wasn't done originally (back in 2014), but it was questioned back in 2020 
on the original Jira SOLR-5986. Does anyone know of a good reason why we 
shouldn't enforce timeouts in this way?

Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} 
given that only {{next}} is being wrapped currently.

  was:
While looking into options for 
[LUCENE-10151|https://issues.apache.org/jira/browse/LUCENE-10151], I noticed 
that {{ExitableDirectoryReader}} doesn't actually do any timeout checking once 
you start iterating postings/impact. The does create a {{ExitableTermsEnum}} 
wrapper when loading a {{TermsEnum}}, but that wrapper doesn't do anything to 
wrap postings or impact. So timeouts will be enforced when moving to the "next" 
term, but not when iterating the postings/impact associated with a term.

I think we ought to wrap the postings/impacts as well with some form of timeout 
checking so timeouts can be enforced on long-running queries. I'm not sure why 
this wasn't done originally (back in 2014), but it was questioned back in 2020 
on the original Jira 
[SOLR-5986|https://issues.apache.org/jira/browse/SOLR-5986?focusedCommentId=17177009&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17177009].
 Does anyone know of a good reason why we shouldn't enforce timeouts in this 
way?

Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} 
given that only {{next}} is being wrapped currently.


> Should ExitableTermsEnum wrap postings and impacts?
> ---
>
> Key: LUCENE-10544
> URL: https://issues.apache.org/jira/browse/LUCENE-10544
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Reporter: Greg Miller
>Priority: Major
>
> While looking into options for LUCENE-10151, I noticed that 
> {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you 
> start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} 
> wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do 
> anything to wrap postings or impacts. So timeouts will be enforced when 
> moving to the "next" term, but not when iterating the postings/impacts 
> associated with a term.
> I think we ought to wrap the postings/impacts as well with some form of 
> timeout checking so timeouts can be enforced on long-running queries. I'm not 
> sure why this wasn't done originally (back in 2014), but it was questioned 
> back in 2020 on the original Jira SOLR-5986. Does anyone know of a good 
> reason why we shouldn't enforce timeouts in this way?
> Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} 
> given that only {{next}} is being wrapped currently.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?

2022-04-28 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529470#comment-17529470
 ] 

Adrien Grand commented on LUCENE-10544:
---

I suspect that this is due to the fact that the original implementation for 
timeouts consisted of a collector wrapper that would check the current runtime 
of the query in collect() and throw an exception if exceeded. This approach 
tends to work well to abort long-running queries that are heavy on postings, 
but it does not work for e.g. query rewrites that are heavy on terms like 
FuzzyQuery, so ExitableDirectoryReader tried to make it possible to stop every 
other operation than consumption of postings that could make a query run for a 
long time.

> Should ExitableTermsEnum wrap postings and impacts?
> ---
>
> Key: LUCENE-10544
> URL: https://issues.apache.org/jira/browse/LUCENE-10544
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Reporter: Greg Miller
>Priority: Major
>
> While looking into options for LUCENE-10151, I noticed that 
> {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you 
> start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} 
> wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do 
> anything to wrap postings or impacts. So timeouts will be enforced when 
> moving to the "next" term, but not when iterating the postings/impacts 
> associated with a term.
> I think we ought to wrap the postings/impacts as well with some form of 
> timeout checking so timeouts can be enforced on long-running queries. I'm not 
> sure why this wasn't done originally (back in 2014), but it was questioned 
> back in 2020 on the original Jira SOLR-5986. Does anyone know of a good 
> reason why we shouldn't enforce timeouts in this way?
> Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} 
> given that only {{next}} is being wrapped currently.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?

2022-04-28 Thread Greg Miller (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529481#comment-17529481
 ] 

Greg Miller commented on LUCENE-10544:
--

Thanks [~jpountz]. One issue with the collector approach is that it doesn't 
catch two-phase iterator situations where there are many approximate hits but 
very few confirmed matches, since the collector will only be invoked after a 
match is confirmed. If the second phase check is costly, this can be 
particularly problematic. So it would be nice to enforce the check at a 
lower-level and solve for this issue if possible.

> Should ExitableTermsEnum wrap postings and impacts?
> ---
>
> Key: LUCENE-10544
> URL: https://issues.apache.org/jira/browse/LUCENE-10544
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Reporter: Greg Miller
>Priority: Major
>
> While looking into options for LUCENE-10151, I noticed that 
> {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you 
> start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} 
> wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do 
> anything to wrap postings or impacts. So timeouts will be enforced when 
> moving to the "next" term, but not when iterating the postings/impacts 
> associated with a term.
> I think we ought to wrap the postings/impacts as well with some form of 
> timeout checking so timeouts can be enforced on long-running queries. I'm not 
> sure why this wasn't done originally (back in 2014), but it was questioned 
> back in 2020 on the original Jira SOLR-5986. Does anyone know of a good 
> reason why we shouldn't enforce timeouts in this way?
> Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} 
> given that only {{next}} is being wrapped currently.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on a diff in pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.

2022-04-28 Thread GitBox



mocobeta commented on code in PR #850:
URL: https://github.com/apache/lucene/pull/850#discussion_r861028741


##
lucene/test-framework/src/java/org/apache/lucene/tests/analysis/MockTokenizer.java:
##
@@ -66,11 +67,11 @@ public class MockTokenizer extends Tokenizer {
* Limit the default token length to a size that doesn't cause random 
analyzer failures on
* unpredictable data like the enwiki data set.
*
-   * This value defaults to {@code CharTokenizer.DEFAULT_MAX_WORD_LEN}.
+   * This value defaults to {@link IndexWriter#MAX_TERM_LENGTH}.
*
* @see "https://issues.apache.org/jira/browse/LUCENE-10541";
*/
-  public static final int DEFAULT_MAX_TOKEN_LENGTH = 255;
+  public static final int DEFAULT_MAX_TOKEN_LENGTH = 
IndexWriter.MAX_TERM_LENGTH;

Review Comment:
   I have no strong opinion about which proposal is the best though when the 
conversation falls into the point where "either one is fine", the first 
proposal may often get the point? I browsed through the MockTokenizer usages 
with the default max token length, it looks it's safe to trim the default to 
255. Most tests use fixed-sized short strings for testing, or generate random 
strings by `TestUtil.randomAnalysisString(random, maxWordLength, simple)` with 
`maxWordLength=20` then split them by  `MockTokenizer.WHITESPACE` automaton.
   I'd cast +1 to 255 for the default and encourage people to explicitly set 
larger values when needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on a diff in pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.

2022-04-28 Thread GitBox



uschindler commented on code in PR #850:
URL: https://github.com/apache/lucene/pull/850#discussion_r861032318


##
lucene/test-framework/src/java/org/apache/lucene/tests/analysis/MockTokenizer.java:
##
@@ -66,11 +67,11 @@ public class MockTokenizer extends Tokenizer {
* Limit the default token length to a size that doesn't cause random 
analyzer failures on
* unpredictable data like the enwiki data set.
*
-   * This value defaults to {@code CharTokenizer.DEFAULT_MAX_WORD_LEN}.
+   * This value defaults to {@link IndexWriter#MAX_TERM_LENGTH}.
*
* @see "https://issues.apache.org/jira/browse/LUCENE-10541";
*/
-  public static final int DEFAULT_MAX_TOKEN_LENGTH = 255;
+  public static final int DEFAULT_MAX_TOKEN_LENGTH = 
IndexWriter.MAX_TERM_LENGTH;

Review Comment:
   Thanks for the analysis, so it looks like 255 is a good default.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on a diff in pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.

2022-04-28 Thread GitBox



mocobeta commented on code in PR #850:
URL: https://github.com/apache/lucene/pull/850#discussion_r861050340


##
lucene/test-framework/src/java/org/apache/lucene/tests/analysis/MockTokenizer.java:
##
@@ -66,11 +67,11 @@ public class MockTokenizer extends Tokenizer {
* Limit the default token length to a size that doesn't cause random 
analyzer failures on
* unpredictable data like the enwiki data set.
*
-   * This value defaults to {@code CharTokenizer.DEFAULT_MAX_WORD_LEN}.
+   * This value defaults to {@link IndexWriter#MAX_TERM_LENGTH}.
*
* @see "https://issues.apache.org/jira/browse/LUCENE-10541";
*/
-  public static final int DEFAULT_MAX_TOKEN_LENGTH = 255;
+  public static final int DEFAULT_MAX_TOKEN_LENGTH = 
IndexWriter.MAX_TERM_LENGTH;

Review Comment:
   An excepiton - `TestWordDelimiterFilter.testRandomHugeStrings()` generates 
large random word (maxWordLength=8192), it may be better to change the 
maxTokenLength for it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10534) MinFloatFunction / MaxFloatFunction exists check can be slow

2022-04-28 Thread Kevin Risden (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529513#comment-17529513
 ] 

Kevin Risden commented on LUCENE-10534:
---

I reviewed the jmh tests again and realized I didn't test the new 
maxfloatfunction logic - I had copied the original code over :( Rerunning the 
benchmarks to see if there is an improvement.

> MinFloatFunction / MaxFloatFunction exists check can be slow
> 
>
> Key: LUCENE-10534
> URL: https://issues.apache.org/jira/browse/LUCENE-10534
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>Priority: Minor
> Attachments: flamegraph.png, flamegraph_getValueForDoc.png
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> MinFloatFunction 
> (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java)
>  and MaxFloatFunction 
> (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java)
>  both check if values exist. This is needed since the underlying valuesource 
> returns 0.0f as either a valid value or as a value when the document doesn't 
> have a value.
> Even though this is changed to anyExists and short circuits in the case a 
> value is found in any document, the worst case is that there is no value 
> found and requires checking all the way through to the raw data. This is only 
> needed when 0.0f is returned and need to determine if it is a valid value or 
> the not found case.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dnhatn commented on a diff in pull request #842: LUCENE-10518: Relax field consistency check for old indices

2022-04-28 Thread GitBox



dnhatn commented on code in PR #842:
URL: https://github.com/apache/lucene/pull/842#discussion_r861098734


##
lucene/core/src/java/org/apache/lucene/index/FieldInfos.java:
##
@@ -350,6 +360,11 @@ static final class FieldNumbers {
   this.omitNorms = new HashMap<>();
   this.storeTermVectors = new HashMap<>();
   this.softDeletesFieldName = softDeletesFieldName;
+  this.strictlyConsistent = indexCreatedVersionMajor >= 9;
+}
+
+FieldNumbers(String softDeletesFieldName) {

Review Comment:
   +1, fixed in 
https://github.com/apache/lucene/pull/842/commits/10c110500e4d0f78fcdf0b167e4061025d6b6927



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dnhatn commented on pull request #842: LUCENE-10518: Relax field consistency check for old indices

2022-04-28 Thread GitBox



dnhatn commented on PR #842:
URL: https://github.com/apache/lucene/pull/842#issuecomment-1112420113

   @mayya-sharipova @jpountz Thanks for reviews.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dnhatn merged pull request #842: LUCENE-10518: Relax field consistency check for old indices

2022-04-28 Thread GitBox



dnhatn merged PR #842:
URL: https://github.com/apache/lucene/pull/842


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10518) FieldInfos consistency check can refuse to open Lucene 8 index

2022-04-28 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529522#comment-17529522
 ] 

ASF subversion and git services commented on LUCENE-10518:
--

Commit 67e6e864bdf8489f16aa0b686790e5721794e187 in lucene's branch 
refs/heads/branch_9x from Nhat Nguyen
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=67e6e864bdf ]

LUCENE-10518: Relax field consistency check for old indices (#842)

This change relaxes the field consistency check for old indices as we 
didn't enforce that in the previous versions. This commit also disables
the optimization that relies on the field consistency for old indices.

> FieldInfos consistency check can refuse to open Lucene 8 index
> --
>
> Key: LUCENE-10518
> URL: https://issues.apache.org/jira/browse/LUCENE-10518
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 8.10.1
>Reporter: Nhat Nguyen
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> A field-infos consistency check introduced in Lucene 9 (LUCENE-9334) can 
> refuse to open a Lucene 8 index. Lucene 8 can create a partial FieldInfo if 
> hitting a non-aborting exception (for example [term is too 
> long|https://github.com/apache/lucene-solr/blob/6a6484ba396927727b16e5061384d3cd80d616b2/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L944])
>  during processing fields of a document. We don't have this problem in Lucene 
> 9 as we process fields in two phases with the [first 
> phase|https://github.com/apache/lucene/blob/10ebc099c846c7d96f4ff5f9b7853df850fa8442/lucene/core/src/java/org/apache/lucene/index/IndexingChain.java#L589-L614]
>  processing only FieldInfos. 
> The issue can be reproduced with this snippet.
> {code:java}
> public void testWriteIndexOn8x() throws Exception {
>   FieldType KeywordField = new FieldType();
>   KeywordField.setTokenized(false);
>   KeywordField.setOmitNorms(true);
>   KeywordField.setIndexOptions(IndexOptions.DOCS);
>   KeywordField.freeze();
>   try (Directory dir = newDirectory()) {
> IndexWriterConfig config = new IndexWriterConfig();
> config.setCommitOnClose(false);
> config.setMergePolicy(NoMergePolicy.INSTANCE);
> try (IndexWriter writer = new IndexWriter(dir, config)) {
>   // first segment
>   writer.addDocument(new Document()); // an empty doc
>   Document d1 = new Document();
>   byte[] chars = new byte[IndexWriter.MAX_STORED_STRING_LENGTH + 1];
>   Arrays.fill(chars, (byte) 'a');
>   d1.add(new Field("field", new BytesRef(chars), KeywordField));
>   d1.add(new BinaryDocValuesField("field", new BytesRef(chars)));
>   expectThrows(IllegalArgumentException.class, () -> 
> writer.addDocument(d1));
>   writer.flush();
>   // second segment
>   Document d2 = new Document();
>   d2.add(new Field("field", new BytesRef("hello world"), KeywordField));
>   d2.add(new SortedDocValuesField("field", new BytesRef("hello world")));
>   writer.addDocument(d2);
>   writer.flush();
>   writer.commit();
>   // Check for doc values types consistency
>   Map docValuesTypes = new HashMap<>();
>   try(DirectoryReader reader = DirectoryReader.open(dir)){
> for (LeafReaderContext leaf : reader.leaves()) {
>   for (FieldInfo fi : leaf.reader().getFieldInfos()) {
> DocValuesType current = docValuesTypes.putIfAbsent(fi.name, 
> fi.getDocValuesType());
> if (current != null && current != fi.getDocValuesType()) {
>   fail("cannot change DocValues type from " + current + " to " + 
> fi.getDocValuesType() + " for field \"" + fi.name + "\"");
> }
>   }
> }
>   }
> }
>   }
> }
> {code}
> I would like to propose to:
> - Backport the two-phase fields processing from Lucene9 to Lucene8. The patch 
> should be small and contained.
> - Introduce an option in Lucene9 to skip checking field-infos consistency 
> (i.e., behave like Lucene 8 when the option is enabled).
> /cc [~mayya] and [~jpountz]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dnhatn opened a new pull request, #852: LUCENE-10518: Relax field consistency check for old indices (#842)

2022-04-28 Thread GitBox



dnhatn opened a new pull request, #852:
URL: https://github.com/apache/lucene/pull/852

   This change relaxes the field consistency check for old indices as we
   didn't enforce that in the previous versions. This commit also disables
   the optimization that relies on the field consistency for old indices.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10534) MinFloatFunction / MaxFloatFunction exists check can be slow

2022-04-28 Thread Kevin Risden (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529572#comment-17529572
 ] 

Kevin Risden commented on LUCENE-10534:
---

Updated metrics - there is a benefit to the new maxfloatfunction logic to avoid 
duplicate exists() even with using the new fieldsource stuff in LUCENE-10542. 

| Benchmark   | Mode  | Cnt 
| Score and Error  | Units |
|-|---|-|--|---|
| MyBenchmark.testMaxFloatFunction| thrpt | 25  
| 69.949 ± 4.043   | ops/s |
| MyBenchmark.testMaxFloatFunctionNewFloatFieldSource | thrpt | 25  
| 112.326 ± 3.228  | ops/s |
| MyBenchmark.testNewMaxFloatFunction | thrpt | 25  
| 93.216 ± 2.757   | ops/s |
| MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSource  | thrpt | 25  
| 123.364 ± 7.861  | ops/s |
| MyBenchmark.testMaxFloatFunctionRareField   | thrpt | 25  
| 257.339 ± 33.849 | ops/s |
| MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField| thrpt | 25  
| 287.175 ± 22.840 | ops/s |
| MyBenchmark.testNewMaxFloatFunctionRareField| thrpt | 25  
| 235.268 ± 4.103  | ops/s |
| MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSourceRareField | thrpt | 25  
| 272.397 ± 8.406  | ops/s |

> MinFloatFunction / MaxFloatFunction exists check can be slow
> 
>
> Key: LUCENE-10534
> URL: https://issues.apache.org/jira/browse/LUCENE-10534
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>Priority: Minor
> Attachments: flamegraph.png, flamegraph_getValueForDoc.png
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> MinFloatFunction 
> (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java)
>  and MaxFloatFunction 
> (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java)
>  both check if values exist. This is needed since the underlying valuesource 
> returns 0.0f as either a valid value or as a value when the document doesn't 
> have a value.
> Even though this is changed to anyExists and short circuits in the case a 
> value is found in any document, the worst case is that there is no value 
> found and requires checking all the way through to the raw data. This is only 
> needed when 0.0f is returned and need to determine if it is a valid value or 
> the not found case.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10542) FieldSource exists implementation can avoid value retrieval

2022-04-28 Thread Kevin Risden (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden updated LUCENE-10542:
--
Description: 
While looking at LUCENE-10534, found that *FieldSource exists implementation 
after LUCENE-7407 can avoid value lookup when just checking for exists.

Flamegraphs - x axis = time spent as a percentage of time being profiled, y 
axis = stack trace bottom being first call top being last call

Looking only at the left most getValueForDoc highlight only (and it helps to 
make it bigger or download the original)

!flamegraph_getValueForDoc.png|height=410,width=1000!

LongFieldSource#exists spends MOST of its time doing a 
LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its time 
doing two things primarily:
* FilterNumericDocValues#longValue()
* advance()

This makes sense based on looking at the code (copied below to make it easier 
to see at once) 
https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72

{code:java}
  private long getValueForDoc(int doc) throws IOException {
if (doc < lastDocID) {
  throw new IllegalArgumentException(
  "docs were sent out-of-order: lastDocID=" + lastDocID + " vs 
docID=" + doc);
}
lastDocID = doc;
int curDocID = arr.docID();
if (doc > curDocID) {
  curDocID = arr.advance(doc);
}
if (doc == curDocID) {
  return arr.longValue();
} else {
  return 0;
}
  }
{code}

LongFieldSource#exists - doesn't care about the actual longValue. Just that 
there was a value found when iterating through the doc values.
https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95

{code:java}
  @Override
  public boolean exists(int doc) throws IOException {
getValueForDoc(doc);
return arr.docID() == doc;
  }
{code}

So putting this all together for exists calling getValueForDoc, we spent ~50% 
of the time trying to get the long value when we don't need it in exists. We 
can save that 50% of time making exists not care about the actual value and 
just return if doc == curDocID basically.

This 50% extra is exaggerated in MaxFloatFunction (and other places) since 
exists() is being called a bunch. Eventually the value will be needed from 
longVal(), but if we call exists() say 3 times for every longVal(), we are 
spending a lot of time computing the value when we only need to check for 
existence.

I found the same pattern in DoubleFieldSource, EnumFieldSource, 
FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change 
showing what this would look like:



Simple JMH performance tests comparing the original FloatFieldSource to the new 
ones from PR #847.
 
| Benchmark   | Mode  | Cnt 
| Score and Error  | Units |
|-|---|-|--|---|
| MyBenchmark.testMaxFloatFunction| thrpt | 25  
| 69.949 ± 4.043   | ops/s |
| MyBenchmark.testMaxFloatFunctionNewFloatFieldSource | thrpt | 25  
| 112.326 ± 3.228  | ops/s |
| MyBenchmark.testNewMaxFloatFunction | thrpt | 25  
| 93.216 ± 2.757   | ops/s |
| MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSource  | thrpt | 25  
| 123.364 ± 7.861  | ops/s |
| MyBenchmark.testMaxFloatFunctionRareField   | thrpt | 25  
| 257.339 ± 33.849 | ops/s |
| MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField| thrpt | 25  
| 287.175 ± 22.840 | ops/s |
| MyBenchmark.testNewMaxFloatFunctionRareField| thrpt | 25  
| 235.268 ± 4.103  | ops/s |
| MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSourceRareField | thrpt | 25  
| 272.397 ± 8.406  | ops/s |

Source: https://github.com/risdenk/lucene-jmh

  was:
While looking at LUCENE-10534, found that *FieldSource exists implementation 
after LUCENE-7407 can avoid value lookup when just checking for exists.

Flamegraphs - x axis = time spent as a percentage of time being profiled, y 
axis = stack trace bottom being first call top being last call

Looking only at the left most getValueForDoc highlight only (and it helps to 
make it bigger or download the original)

!flamegraph_getValueForDoc.png|height=410,width=1000!

LongFieldSource#exists spends MOST of its time doing a 
LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its time 
doing two things primarily:
* FilterNumericDocValues#longValue()
* advance()

This makes sense based on looking at the code (copied below to make it easier 
to see at once) 
https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/q

[GitHub] [lucene] dnhatn merged pull request #852: LUCENE-10518: Relax field consistency check for old indices (#842)

2022-04-28 Thread GitBox



dnhatn merged PR #852:
URL: https://github.com/apache/lucene/pull/852


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10518) FieldInfos consistency check can refuse to open Lucene 8 index

2022-04-28 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529578#comment-17529578
 ] 

ASF subversion and git services commented on LUCENE-10518:
--

Commit 62ebb22cc07510ac477d6bc76d17e65710395bd6 in lucene's branch 
refs/heads/branch_9_1 from Nhat Nguyen
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=62ebb22cc07 ]

LUCENE-10518: Relax field consistency check for old indices (#842)

This change relaxes the field consistency check for old indices as we
didn't enforce that in the previous versions. This commit also disables
the optimization that relies on the field consistency for old indices.

> FieldInfos consistency check can refuse to open Lucene 8 index
> --
>
> Key: LUCENE-10518
> URL: https://issues.apache.org/jira/browse/LUCENE-10518
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 8.10.1
>Reporter: Nhat Nguyen
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> A field-infos consistency check introduced in Lucene 9 (LUCENE-9334) can 
> refuse to open a Lucene 8 index. Lucene 8 can create a partial FieldInfo if 
> hitting a non-aborting exception (for example [term is too 
> long|https://github.com/apache/lucene-solr/blob/6a6484ba396927727b16e5061384d3cd80d616b2/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L944])
>  during processing fields of a document. We don't have this problem in Lucene 
> 9 as we process fields in two phases with the [first 
> phase|https://github.com/apache/lucene/blob/10ebc099c846c7d96f4ff5f9b7853df850fa8442/lucene/core/src/java/org/apache/lucene/index/IndexingChain.java#L589-L614]
>  processing only FieldInfos. 
> The issue can be reproduced with this snippet.
> {code:java}
> public void testWriteIndexOn8x() throws Exception {
>   FieldType KeywordField = new FieldType();
>   KeywordField.setTokenized(false);
>   KeywordField.setOmitNorms(true);
>   KeywordField.setIndexOptions(IndexOptions.DOCS);
>   KeywordField.freeze();
>   try (Directory dir = newDirectory()) {
> IndexWriterConfig config = new IndexWriterConfig();
> config.setCommitOnClose(false);
> config.setMergePolicy(NoMergePolicy.INSTANCE);
> try (IndexWriter writer = new IndexWriter(dir, config)) {
>   // first segment
>   writer.addDocument(new Document()); // an empty doc
>   Document d1 = new Document();
>   byte[] chars = new byte[IndexWriter.MAX_STORED_STRING_LENGTH + 1];
>   Arrays.fill(chars, (byte) 'a');
>   d1.add(new Field("field", new BytesRef(chars), KeywordField));
>   d1.add(new BinaryDocValuesField("field", new BytesRef(chars)));
>   expectThrows(IllegalArgumentException.class, () -> 
> writer.addDocument(d1));
>   writer.flush();
>   // second segment
>   Document d2 = new Document();
>   d2.add(new Field("field", new BytesRef("hello world"), KeywordField));
>   d2.add(new SortedDocValuesField("field", new BytesRef("hello world")));
>   writer.addDocument(d2);
>   writer.flush();
>   writer.commit();
>   // Check for doc values types consistency
>   Map docValuesTypes = new HashMap<>();
>   try(DirectoryReader reader = DirectoryReader.open(dir)){
> for (LeafReaderContext leaf : reader.leaves()) {
>   for (FieldInfo fi : leaf.reader().getFieldInfos()) {
> DocValuesType current = docValuesTypes.putIfAbsent(fi.name, 
> fi.getDocValuesType());
> if (current != null && current != fi.getDocValuesType()) {
>   fail("cannot change DocValues type from " + current + " to " + 
> fi.getDocValuesType() + " for field \"" + fi.name + "\"");
> }
>   }
> }
>   }
> }
>   }
> }
> {code}
> I would like to propose to:
> - Backport the two-phase fields processing from Lucene9 to Lucene8. The patch 
> should be small and contained.
> - Introduce an option in Lucene9 to skip checking field-infos consistency 
> (i.e., behave like Lucene 8 when the option is enabled).
> /cc [~mayya] and [~jpountz]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10542) FieldSource exists implementation can avoid value retrieval

2022-04-28 Thread Robert Muir (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529580#comment-17529580
 ] 

Robert Muir commented on LUCENE-10542:
--

Hi [~krisden], I think there might be more optimizations possible?

I'm thinking (common) cases where all documents actually have a value for the 
field, you can use index statistics to detect this situation and avoid 
per-document checks, simply returning true. Similar optimization exists in 
FieldExistsQuery.

There is an explanation of it here:
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/FieldExistsQuery.java#L100-L105

I think it can be a separate issue/PR from what you have going on here, but it 
is related and maybe could be a followup, if you are trying to speed up 
performance of these checks.

> FieldSource exists implementation can avoid value retrieval
> ---
>
> Key: LUCENE-10542
> URL: https://issues.apache.org/jira/browse/LUCENE-10542
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>Priority: Minor
> Attachments: flamegraph_getValueForDoc.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> While looking at LUCENE-10534, found that *FieldSource exists implementation 
> after LUCENE-7407 can avoid value lookup when just checking for exists.
> Flamegraphs - x axis = time spent as a percentage of time being profiled, y 
> axis = stack trace bottom being first call top being last call
> Looking only at the left most getValueForDoc highlight only (and it helps to 
> make it bigger or download the original)
> !flamegraph_getValueForDoc.png|height=410,width=1000!
> LongFieldSource#exists spends MOST of its time doing a 
> LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its 
> time doing two things primarily:
> * FilterNumericDocValues#longValue()
> * advance()
> This makes sense based on looking at the code (copied below to make it easier 
> to see at once) 
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72
> {code:java}
>   private long getValueForDoc(int doc) throws IOException {
> if (doc < lastDocID) {
>   throw new IllegalArgumentException(
>   "docs were sent out-of-order: lastDocID=" + lastDocID + " vs 
> docID=" + doc);
> }
> lastDocID = doc;
> int curDocID = arr.docID();
> if (doc > curDocID) {
>   curDocID = arr.advance(doc);
> }
> if (doc == curDocID) {
>   return arr.longValue();
> } else {
>   return 0;
> }
>   }
> {code}
> LongFieldSource#exists - doesn't care about the actual longValue. Just that 
> there was a value found when iterating through the doc values.
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95
> {code:java}
>   @Override
>   public boolean exists(int doc) throws IOException {
> getValueForDoc(doc);
> return arr.docID() == doc;
>   }
> {code}
> So putting this all together for exists calling getValueForDoc, we spent ~50% 
> of the time trying to get the long value when we don't need it in exists. We 
> can save that 50% of time making exists not care about the actual value and 
> just return if doc == curDocID basically.
> This 50% extra is exaggerated in MaxFloatFunction (and other places) since 
> exists() is being called a bunch. Eventually the value will be needed from 
> longVal(), but if we call exists() say 3 times for every longVal(), we are 
> spending a lot of time computing the value when we only need to check for 
> existence.
> I found the same pattern in DoubleFieldSource, EnumFieldSource, 
> FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change 
> showing what this would look like:
> 
> Simple JMH performance tests comparing the original FloatFieldSource to the 
> new ones from PR #847.
>  
> | Benchmark   | Mode  | 
> Cnt | Score and Error  | Units |
> |-|---|-|--|---|
> | MyBenchmark.testMaxFloatFunction| thrpt | 
> 25  | 69.949 ± 4.043   | ops/s |
> | MyBenchmark.testMaxFloatFunctionNewFloatFieldSource | thrpt | 
> 25  | 112.326 ± 3.228  | ops/s |
> | MyBenchmark.testNewMaxFloatFunction | thrpt | 
> 25  | 93.216 ± 2.757   | ops/s |
> | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSource  | thrpt | 
> 25  | 123.364 ± 7.861  | ops/s |
> | MyBenchmark.t

[jira] [Resolved] (LUCENE-10518) FieldInfos consistency check can refuse to open Lucene 8 index

2022-04-28 Thread Nhat Nguyen (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nhat Nguyen resolved LUCENE-10518.
--
Fix Version/s: 9.1.1
   9.2
   Resolution: Fixed

> FieldInfos consistency check can refuse to open Lucene 8 index
> --
>
> Key: LUCENE-10518
> URL: https://issues.apache.org/jira/browse/LUCENE-10518
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 8.10.1
>Reporter: Nhat Nguyen
>Priority: Major
> Fix For: 9.1.1, 9.2
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> A field-infos consistency check introduced in Lucene 9 (LUCENE-9334) can 
> refuse to open a Lucene 8 index. Lucene 8 can create a partial FieldInfo if 
> hitting a non-aborting exception (for example [term is too 
> long|https://github.com/apache/lucene-solr/blob/6a6484ba396927727b16e5061384d3cd80d616b2/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L944])
>  during processing fields of a document. We don't have this problem in Lucene 
> 9 as we process fields in two phases with the [first 
> phase|https://github.com/apache/lucene/blob/10ebc099c846c7d96f4ff5f9b7853df850fa8442/lucene/core/src/java/org/apache/lucene/index/IndexingChain.java#L589-L614]
>  processing only FieldInfos. 
> The issue can be reproduced with this snippet.
> {code:java}
> public void testWriteIndexOn8x() throws Exception {
>   FieldType KeywordField = new FieldType();
>   KeywordField.setTokenized(false);
>   KeywordField.setOmitNorms(true);
>   KeywordField.setIndexOptions(IndexOptions.DOCS);
>   KeywordField.freeze();
>   try (Directory dir = newDirectory()) {
> IndexWriterConfig config = new IndexWriterConfig();
> config.setCommitOnClose(false);
> config.setMergePolicy(NoMergePolicy.INSTANCE);
> try (IndexWriter writer = new IndexWriter(dir, config)) {
>   // first segment
>   writer.addDocument(new Document()); // an empty doc
>   Document d1 = new Document();
>   byte[] chars = new byte[IndexWriter.MAX_STORED_STRING_LENGTH + 1];
>   Arrays.fill(chars, (byte) 'a');
>   d1.add(new Field("field", new BytesRef(chars), KeywordField));
>   d1.add(new BinaryDocValuesField("field", new BytesRef(chars)));
>   expectThrows(IllegalArgumentException.class, () -> 
> writer.addDocument(d1));
>   writer.flush();
>   // second segment
>   Document d2 = new Document();
>   d2.add(new Field("field", new BytesRef("hello world"), KeywordField));
>   d2.add(new SortedDocValuesField("field", new BytesRef("hello world")));
>   writer.addDocument(d2);
>   writer.flush();
>   writer.commit();
>   // Check for doc values types consistency
>   Map docValuesTypes = new HashMap<>();
>   try(DirectoryReader reader = DirectoryReader.open(dir)){
> for (LeafReaderContext leaf : reader.leaves()) {
>   for (FieldInfo fi : leaf.reader().getFieldInfos()) {
> DocValuesType current = docValuesTypes.putIfAbsent(fi.name, 
> fi.getDocValuesType());
> if (current != null && current != fi.getDocValuesType()) {
>   fail("cannot change DocValues type from " + current + " to " + 
> fi.getDocValuesType() + " for field \"" + fi.name + "\"");
> }
>   }
> }
>   }
> }
>   }
> }
> {code}
> I would like to propose to:
> - Backport the two-phase fields processing from Lucene9 to Lucene8. The patch 
> should be small and contained.
> - Introduce an option in Lucene9 to skip checking field-infos consistency 
> (i.e., behave like Lucene 8 when the option is enabled).
> /cc [~mayya] and [~jpountz]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on a diff in pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.

2022-04-28 Thread GitBox



dweiss commented on code in PR #850:
URL: https://github.com/apache/lucene/pull/850#discussion_r861243269


##
lucene/test-framework/src/java/org/apache/lucene/tests/analysis/MockTokenizer.java:
##
@@ -66,11 +67,11 @@ public class MockTokenizer extends Tokenizer {
* Limit the default token length to a size that doesn't cause random 
analyzer failures on
* unpredictable data like the enwiki data set.
*
-   * This value defaults to {@code CharTokenizer.DEFAULT_MAX_WORD_LEN}.
+   * This value defaults to {@link IndexWriter#MAX_TERM_LENGTH}.
*
* @see "https://issues.apache.org/jira/browse/LUCENE-10541";
*/
-  public static final int DEFAULT_MAX_TOKEN_LENGTH = 255;
+  public static final int DEFAULT_MAX_TOKEN_LENGTH = 
IndexWriter.MAX_TERM_LENGTH;

Review Comment:
   Ok. I'll go with 255 and tweak those tests that need it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss merged pull request #851: LUCENE-10088: backport simple text codec suppression and fs handle limit

2022-04-28 Thread GitBox



dweiss merged PR #851:
URL: https://github.com/apache/lucene/pull/851


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10088) Too many open files in TestIndexWriterMergePolicy.testStressUpdateSameDocumentWithMergeOnGetReader

2022-04-28 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529603#comment-17529603
 ] 

ASF subversion and git services commented on LUCENE-10088:
--

Commit 49c8cee06f01e79a1b890d9b542ada79bacf9dd4 in lucene's branch 
refs/heads/branch_9x from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=49c8cee06f0 ]

LUCENE-10088: allow per-class override in HandleLimitFS. Bump the limit a bit 
for nightlies in TestIndexWriterMergePolicy. (#424). Suppress SimpleTextCodec 
on TestIndexWriterMergePolicy. (#851)



> Too many open files in 
> TestIndexWriterMergePolicy.testStressUpdateSameDocumentWithMergeOnGetReader
> --
>
> Key: LUCENE-10088
> URL: https://issues.apache.org/jira/browse/LUCENE-10088
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> [This build 
> failure|https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/386/]
>  reproduces for me.  I'll try to dig.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10088) Too many open files in TestIndexWriterMergePolicy.testStressUpdateSameDocumentWithMergeOnGetReader

2022-04-28 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529611#comment-17529611
 ] 

ASF subversion and git services commented on LUCENE-10088:
--

Commit cb206196e4c929b6e639c58891950e5d6164f63d in lucene's branch 
refs/heads/branch_9_1 from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=cb206196e4c ]

LUCENE-10088: allow per-class override in HandleLimitFS. Bump the limit a bit 
for nightlies in TestIndexWriterMergePolicy. (#424). Suppress SimpleTextCodec 
on TestIndexWriterMergePolicy. (#851)



> Too many open files in 
> TestIndexWriterMergePolicy.testStressUpdateSameDocumentWithMergeOnGetReader
> --
>
> Key: LUCENE-10088
> URL: https://issues.apache.org/jira/browse/LUCENE-10088
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> [This build 
> failure|https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/386/]
>  reproduces for me.  I'll try to dig.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.

2022-04-28 Thread GitBox



dweiss commented on PR #850:
URL: https://github.com/apache/lucene/pull/850#issuecomment-1112594924

   I ran nightly tests as well, they passed.
   BUILD SUCCESSFUL in 24m 6s
   832 actionable tasks: 480 executed, 352 up-to-date


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren

2022-04-28 Thread GitBox



gsmiller commented on PR #843:
URL: https://github.com/apache/lucene/pull/843#issuecomment-1112602017

   Thanks for all your thoughts @Yuti-G! I think I still disagree with changing 
the current behavior. If we want to implement the contract of `getTopChildren` 
properly, we'd need to sort by count. That's what "top" means in this API, and 
I think it's a bit confusing to start enforcing a "top n" parameter when "top" 
isn't really properly defined here. If you have a use-case that needs to 
actually truncate to `n` ranges while retaining the existing order of ranges 
originally provided as a user, why not do it outside of the facet code? There's 
no reason you can't get back all ranges as it works today and then just 
truncate to the number you want right? That's essentially what you're proposing 
putting in the faceting code, but I'm not sure doing that is the right thing to 
do here.
   
   All this said, I'm in favor of exploring the addition of a `getAllChildren` 
method that would retain this functionality and then properly implementing 
`getTopChildren` that orders by count if there's a use-case for that. I suspect 
there is.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] Yuti-G commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren

2022-04-28 Thread GitBox



Yuti-G commented on PR #843:
URL: https://github.com/apache/lucene/pull/843#issuecomment-1112633027

   Hi @gsmiller, thanks for your feedback! Yes, I was proposing keeping the 
existing behavior of getTopChildren but utilizing the top-n param, as I thought 
it would safe because we would not break the current use cases, but you are 
absolutely right that we should stick with what top-n means here. 
   
   Please correct me if misunderstood your proposal. Are you suggesting that I 
add another getAllChildren API to retain the current behavior of 
getTopChildren, and then fix getTopChildren by returning ranges sorted by 
counts, or do you mean we should just keep this API untouched? Thanks!
   
   Best,
   Yuting


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10292) AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()

2022-04-28 Thread Chris M. Hostetter (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529668#comment-17529668
 ] 

Chris M. Hostetter commented on LUCENE-10292:
-

{quote} I'm still not sure whether these tests make sense without explicitly 
stating ...
{quote}
All I was really trying to do with these tests was demonstrate that data you 
get out of the Lookup before you call build(), can still be gotten from the 
Lookup while build() is incrementally consuming an iterator (which may take a 
long time if you are building up from a long iterator) and that this behavior 
is consistent across Lookup impls (as opposed to before i filed this issue, 
when most Lookups worked that way, but AnalyzingInfixSuggester would throw an 
ugly exception – which was certainly confusing to users who might switch from 
one impl to another).

I didn't set out to make any hard & fast guarantee about the thread safety of 
all lookups – just improve the one that awas obviously inconsistent with the 
others (progress, not perfection)

I'm happy to rename/re-document/remove the test code if you'd like - or confine 
it to just AnalyzingInfixSuggester - but i think (assume?) we can probably 
agree that the src/main code changes I've made in this issue are generally for 
the better?
{quote}I think it'd be a much more user-friendly API if Lookup was actually 
detached entirely from its build process (for example by replacing the current 
build method with a builder() that would return a new immutable Lookup 
instance). ... I'm not saying this should be implemented here - perhaps it's 
worth a new issue to do this refactoring.
{quote}
+1 ... although some of the lookup impls explicitly support update()int 
individual suggestions – so you'd either have to remove that functionality or 
refactor the API in some way that it can still be optional.
{quote}Separately from the above, if the test fails, it'll leak threads ... It 
should be replaced with a try/catch that rethrows an unchecked exception
{quote}
Yeah, that was lazy & sloppy of me – sorry about that. I'll fix it ASAP.

> AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()
> 
>
> Key: LUCENE-10292
> URL: https://issues.apache.org/jira/browse/LUCENE-10292
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Assignee: Chris M. Hostetter
>Priority: Major
> Fix For: 10.0 (main), 9.2
>
> Attachments: LUCENE-10292-1.patch, LUCENE-10292-2.patch, 
> LUCENE-10292-3.patch, LUCENE-10292.patch
>
>
> I'm filing this based on anecdotal information from a Solr user w/o 
> experiencing it first hand (and I don't have a test case to demonstrate it) 
> but based on a reading of the code the underlying problem seems self 
> evident...
> With all other Lookup implementations I've examined, it is possible to call 
> {{lookup()}} regardless of whether another thread is concurrently calling 
> {{build()}} – in all cases I've seen, it is even possible to call 
> {{lookup()}} even if {{build()}} has never been called: the result is just an 
> "empty" {{List}} 
> Typically this is works because the {{build()}} method uses temporary 
> datastructures until it's "build logic" is complete, at which point it 
> atomically replaces the datastructures used by the {{lookup()}} method.   In 
> the case of {{AnalyzingInfixSuggester}} however, the {{build()}} method 
> starts by closing & null'ing out the {{protected SearcherManager 
> searcherMgr}} (which it only populates again once it's completed building up 
> it's index) and then the lookup method starts with...
> {code:java}
> if (searcherMgr == null) {
>   throw new IllegalStateException("suggester was not built");
> }
> {code}
> ... meaning it is unsafe to call {{AnalyzingInfixSuggester.lookup()}} in any 
> situation where another thread may be calling 
> {{AnalyzingInfixSuggester.build()}}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-9476) Add a bulk ordinal->FacetLabel API

2022-04-28 Thread Gautam Worah (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautam Worah resolved LUCENE-9476.
--
Resolution: Fixed

The PR was merged into the Lucene 9 branch and released in version Lucene 9.0. 
Resolving

> Add a bulk ordinal->FacetLabel API
> --
>
> Key: LUCENE-9476
> URL: https://issues.apache.org/jira/browse/LUCENE-9476
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 8.6.1
>Reporter: Gautam Worah
>Priority: Minor
>  Labels: performance
>  Time Spent: 12h 40m
>  Remaining Estimate: 0h
>
> This issue is a spillover from the 
> [PR|https://github.com/apache/lucene-solr/pull/1733/files] for LUCENE 9450
> The idea here is to share a single {{BinaryDocValues}} instance per leaf per 
> query instead of creating a new one each time in the 
> {{DirectoryTaxonomyReader}}.
> Suggested by [~mikemccand]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gautamworah96 opened a new pull request, #853: LUCENE-10524 Add benchmark suite details to CONTRIBUTING.md

2022-04-28 Thread GitBox



gautamworah96 opened a new pull request, #853:
URL: https://github.com/apache/lucene/pull/853

   Today, new contributors are usually unaware of where luceneutil benchmarks 
are and when/how to run them. Committers usually end up pointing contributors 
to the benchmarks package when they make perf impacting changes and then they 
run the benchmarks.
   
   Adding benchmark details to the Lucene repo will also make them more 
accessible to other researchers who want to experiment/benchmark their own 
custom task implementation with Java Lucene.
   
   JIRA: https://issues.apache.org/jira/browse/LUCENE-10524
   
   cc: @mocobeta 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10292) AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()

2022-04-28 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529676#comment-17529676
 ] 

ASF subversion and git services commented on LUCENE-10292:
--

Commit 6afb9bc25a5935a3cba0a06e67b27fe290255467 in lucene's branch 
refs/heads/main from Chris M. Hostetter
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=6afb9bc25a5 ]

LUCENE-10292: prevent thread leak (or test timeout) if exception/assertion 
failure in test iterator


> AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()
> 
>
> Key: LUCENE-10292
> URL: https://issues.apache.org/jira/browse/LUCENE-10292
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Assignee: Chris M. Hostetter
>Priority: Major
> Fix For: 10.0 (main), 9.2
>
> Attachments: LUCENE-10292-1.patch, LUCENE-10292-2.patch, 
> LUCENE-10292-3.patch, LUCENE-10292.patch
>
>
> I'm filing this based on anecdotal information from a Solr user w/o 
> experiencing it first hand (and I don't have a test case to demonstrate it) 
> but based on a reading of the code the underlying problem seems self 
> evident...
> With all other Lookup implementations I've examined, it is possible to call 
> {{lookup()}} regardless of whether another thread is concurrently calling 
> {{build()}} – in all cases I've seen, it is even possible to call 
> {{lookup()}} even if {{build()}} has never been called: the result is just an 
> "empty" {{List}} 
> Typically this is works because the {{build()}} method uses temporary 
> datastructures until it's "build logic" is complete, at which point it 
> atomically replaces the datastructures used by the {{lookup()}} method.   In 
> the case of {{AnalyzingInfixSuggester}} however, the {{build()}} method 
> starts by closing & null'ing out the {{protected SearcherManager 
> searcherMgr}} (which it only populates again once it's completed building up 
> it's index) and then the lookup method starts with...
> {code:java}
> if (searcherMgr == null) {
>   throw new IllegalStateException("suggester was not built");
> }
> {code}
> ... meaning it is unsafe to call {{AnalyzingInfixSuggester.lookup()}} in any 
> situation where another thread may be calling 
> {{AnalyzingInfixSuggester.build()}}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] vigyasharma commented on a diff in pull request #848: LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations

2022-04-28 Thread GitBox



vigyasharma commented on code in PR #848:
URL: https://github.com/apache/lucene/pull/848#discussion_r861353455


##
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/TestTaxonomyFacetAssociations.java:
##
@@ -500,10 +500,7 @@ private void validateFloats(
   assertNull(facetResult);
 } else {
   assertEquals(dim, facetResult.dim);
-  // We can expect the floats to be exactly equal here since we're 
ensuring that we sum them
-  // in the same order when determining expected values and when computing 
facets. See
-  // LUCENE-10530:
-  assertEquals(aggregatedValue, facetResult.value.floatValue(), 0f);
+  assertEquals(aggregatedValue, facetResult.value.floatValue(), 1f);

Review Comment:
   +1, I was going to comment this in the last iteration. There may also be 
some internal drift in the way aggregation function works on floats. Found a 
nice read on it 
[here](https://www.reidatcheson.com/statistics/floating%20point/error/2018/01/15/float-sum-errors.html).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10292) AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()

2022-04-28 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529677#comment-17529677
 ] 

ASF subversion and git services commented on LUCENE-10292:
--

Commit 115bcd9c66d9723a7a1ea5131aa2f5d16a6867b7 in lucene's branch 
refs/heads/branch_9x from Chris M. Hostetter
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=115bcd9c66d ]

LUCENE-10292: prevent thread leak (or test timeout) if exception/assertion 
failure in test iterator

(cherry picked from commit 6afb9bc25a5935a3cba0a06e67b27fe290255467)


> AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()
> 
>
> Key: LUCENE-10292
> URL: https://issues.apache.org/jira/browse/LUCENE-10292
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Assignee: Chris M. Hostetter
>Priority: Major
> Fix For: 10.0 (main), 9.2
>
> Attachments: LUCENE-10292-1.patch, LUCENE-10292-2.patch, 
> LUCENE-10292-3.patch, LUCENE-10292.patch
>
>
> I'm filing this based on anecdotal information from a Solr user w/o 
> experiencing it first hand (and I don't have a test case to demonstrate it) 
> but based on a reading of the code the underlying problem seems self 
> evident...
> With all other Lookup implementations I've examined, it is possible to call 
> {{lookup()}} regardless of whether another thread is concurrently calling 
> {{build()}} – in all cases I've seen, it is even possible to call 
> {{lookup()}} even if {{build()}} has never been called: the result is just an 
> "empty" {{List}} 
> Typically this is works because the {{build()}} method uses temporary 
> datastructures until it's "build logic" is complete, at which point it 
> atomically replaces the datastructures used by the {{lookup()}} method.   In 
> the case of {{AnalyzingInfixSuggester}} however, the {{build()}} method 
> starts by closing & null'ing out the {{protected SearcherManager 
> searcherMgr}} (which it only populates again once it's completed building up 
> it's index) and then the lookup method starts with...
> {code:java}
> if (searcherMgr == null) {
>   throw new IllegalStateException("suggester was not built");
> }
> {code}
> ... meaning it is unsafe to call {{AnalyzingInfixSuggester.lookup()}} in any 
> situation where another thread may be calling 
> {{AnalyzingInfixSuggester.build()}}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10524) Augment CONTRIBUTING.md guide with instructions on how/when to benchmark

2022-04-28 Thread Gautam Worah (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529692#comment-17529692
 ] 

Gautam Worah commented on LUCENE-10524:
---

I've submitted a pretty basic patch. It may be better to redirect users to the 
actual benchmarking package and let that package have help instructions in case 
they get stuck rather than having help instructions in Apache Lucene? The 
Github patch does that at the moment. Looking forward to see what you think..

> Augment CONTRIBUTING.md guide with instructions on how/when to benchmark
> 
>
> Key: LUCENE-10524
> URL: https://issues.apache.org/jira/browse/LUCENE-10524
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Gautam Worah
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This came up when I was trying to think about improving the experience for 
> new contributors.
> Today, new contributors are usually unaware of where luceneutil benchmarks 
> are and when/how to run them. Committers usually end up pointing contributors 
> to the benchmarks package when they make perf impacting changes and then they 
> run the benchmarks.
>  
> Adding benchmark details to the Lucene repo will also make them more 
> accessible to other researchers who want to experiment/benchmark their own 
> custom task implementation with Java Lucene.
>  
> What does the community think?
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.

2022-04-28 Thread GitBox



mocobeta commented on PR #850:
URL: https://github.com/apache/lucene/pull/850#issuecomment-1112791650

   I traced method calls of `BaseTokenStreamTestCase.checkRandomData(Random 
random, Analyzer a, int iterations, int maxWordLength)` with larger values than 
255 for `maxWordLength`.
   Those might also need larger `maxTokenLength`; I don't know whether they 
must be changed or the reduced default value is okay, but just in case...
   
   ```
   TestShingleFilter.testRandomHugeStrings()
   TestICUNormalizer2CharFilter.testRandomStringsNFKC()
   TestICUNormalizer2CharFilter.testRandomStringsNFKC_CF()
   TestICUNormalizer2CharFilter.testRandomStringsNFKD()
   TestSynonymMapFilter.testRandomHuge()
   TestHTMLStripCharFilter.testRandomHugeStrings()
   TestLookaheadTokenFilter.testRandomStrings()
   TestLookaheadTokenFilter.testNeverCallingPeek()
   TestSynonymGraphFilter.testRandomHuge()
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a diff in pull request #848: LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations

2022-04-28 Thread GitBox



gsmiller commented on code in PR #848:
URL: https://github.com/apache/lucene/pull/848#discussion_r861408137


##
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/TestTaxonomyFacetAssociations.java:
##
@@ -500,10 +500,7 @@ private void validateFloats(
   assertNull(facetResult);
 } else {
   assertEquals(dim, facetResult.dim);
-  // We can expect the floats to be exactly equal here since we're 
ensuring that we sum them
-  // in the same order when determining expected values and when computing 
facets. See
-  // LUCENE-10530:
-  assertEquals(aggregatedValue, facetResult.value.floatValue(), 0f);
+  assertEquals(aggregatedValue, facetResult.value.floatValue(), 1f);

Review Comment:
   Thanks for sharing!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #848: LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations

2022-04-28 Thread GitBox



gsmiller commented on PR #848:
URL: https://github.com/apache/lucene/pull/848#issuecomment-1112794825

   @vigyasharma yeah exactly. So the faceting implementation was summing the 
doc values in a different order from the test case. I wish I could do something 
similar for the overall aggregation value as well, but it depends on the order 
that siblings are vended by the taxonomy index. While it's true today that it's 
reverse-ordinal order, that seems pretty fragile to rely on. I'm more confident 
relying on the fact that documents are visited in the order they are collected 
since that feels pretty stable to the internal workings of Lucene's query 
evaluation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren

2022-04-28 Thread GitBox



gsmiller commented on PR #843:
URL: https://github.com/apache/lucene/pull/843#issuecomment-1112797399

   > Are you suggesting that I add another getAllChildren API to retain the 
current behavior of getTopChildren, and then fix getTopChildren by returning 
ranges sorted by counts, or do you mean we should just keep this API untouched? 
Thanks!
   
   I'm suggesting the former, but I think that's a bigger project that you 
meant to capture with this PR and associated Jira. I would recommend opening a 
separate Jira issue to propose the idea of adding `getAllChildren` to `Facets`. 
We have use-cases for that functionality here (but it's just sort of improperly 
implemented under `getTopChildren`) and also in `LongValueFacetCounts` (see 
`#getAllChildrenSortByValue`). I think it's reasonable to add that 
functionality to all faceting implementations and then come up with a migration 
plan to move the current `RangeFacetCounts#getTopChildren` over and properly 
implement `RangeFacetCounts#getTopChildren`. But by opening a Jira, maybe 
others will weigh in and agree/disagree there, which would be a healthy 
conversation.
   
   Thanks again for digging into this and working on improving faceting! This 
is great!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on pull request #853: LUCENE-10524 Add benchmark suite details to CONTRIBUTING.md

2022-04-28 Thread GitBox



mocobeta commented on PR #853:
URL: https://github.com/apache/lucene/pull/853#issuecomment-1112798859

   @gautamworah96 This looks good to me!
   I wonder if it makes sense to mention the off-the-shelf 
https://github.com/apache/lucene/tree/main/lucene/benchmark module here too (I 
didn't find instruction for running it, but a mention still could be helpful).
   
   I'd also suggest letting @mikemccand know the change, thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta merged pull request #846: LUCENE-10493: move n-best logic to analysis-common

2022-04-28 Thread GitBox



mocobeta merged PR #846:
URL: https://github.com/apache/lucene/pull/846


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10493) Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?

2022-04-28 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529710#comment-17529710
 ] 

ASF subversion and git services commented on LUCENE-10493:
--

Commit c28f575b6db1ece837e1cba3fa5526e30135eb5a in lucene's branch 
refs/heads/main from Tomoko Uchida
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=c28f575b6db ]

LUCENE-10493: move n-best logic to analysis-common (#846)



> Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?
> -
>
> Key: LUCENE-10493
> URL: https://issues.apache.org/jira/browse/LUCENE-10493
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> We now have common dictionary interfaces for kuromoji and nori 
> ([LUCENE-10393]). A natural question would be: is it possible to unify the 
> Japanese/Korean tokenizers? 
> The core methods of the two tokenizers are `parse()` and `backtrace()` to 
> calculate the minimum cost path by Viterbi search. I'd set the goal of this 
> issue to factoring out them into a separate class (in analysis-common) that 
> is shared between JapaneseTokenizer and KoreanTokenizer. 
> The algorithm to solve the minimum cost path itself is of course 
> language-agnostic, so I think it should be theoretically possible; the most 
> difficult part here might be the N-best path calculation - which is supported 
> only by JapaneseTokenizer and not by KoreanTokenizer.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] Yuti-G commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren

2022-04-28 Thread GitBox



Yuti-G commented on PR #843:
URL: https://github.com/apache/lucene/pull/843#issuecomment-1112806474

   Thanks @gsmiller for confirming! I will create another Jira issue to propose 
adding getAllChildren to Facets, and revisit this issue after getting feedback 
from the community. Thanks again for the great suggestion!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10545) Allow Github PR link in CHANGES.html

2022-04-28 Thread Tomoko Uchida (Jira)

Tomoko Uchida created LUCENE-10545:
--

 Summary: Allow Github PR link in CHANGES.html
 Key: LUCENE-10545
 URL: https://issues.apache.org/jira/browse/LUCENE-10545
 Project: Lucene - Core
  Issue Type: Sub-task
Reporter: Tomoko Uchida


{\{changes2html}} already supports links to Github PRs, but the link is 
obsoleted so a small modification is needed to make it work again.

With this change, developers can directly link to their pull requests from 
CHANGES.txt, skipping creating Jira issues.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta opened a new pull request, #854: LUCENE-10393: allow to link github pr from changes

2022-04-28 Thread GitBox



mocobeta opened a new pull request, #854:
URL: https://github.com/apache/lucene/pull/854

   changes2html already supports links to Github PRs, but the link is obsoleted 
so a small modification is needed to make it work again.
   With this change, developers can directly link to their pull requests from 
CHANGES.txt, skipping creating Jira issues. I'll also show an example in this 
PR.
   
   Here is the generated CHANGES.html by `changesToHtml` task.
   
   ![Screenshot from 2022-04-29 
11-44-28](https://user-images.githubusercontent.com/1825333/165877985-32dfa213-c6ca-4aee-867e-24e1fc47b4c7.png)
   
   The "GITHUB-740" hyperlink correctly pointers #740.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on a diff in pull request #854: LUCENE-10393: allow to link github pr from changes

2022-04-28 Thread GitBox



mocobeta commented on code in PR #854:
URL: https://github.com/apache/lucene/pull/854#discussion_r861431387


##
lucene/CHANGES.txt:
##
@@ -53,7 +53,7 @@ Other
 * LUCENE-10253: The @BadApple annotation has been removed from the test
   framework. (Adrien Grand)
 
-* LUCENE-10393: Unify binary dictionary and dictionary writer in Kuromoji and 
Nori.
+* GITHUB-740: Unify binary dictionary and dictionary writer in Kuromoji and 
Nori.

Review Comment:
   This mention to Github PR can be also written in a more verbose style like 
`GITHUB Pull Request #740`; I'd just prefer shorter string.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on a diff in pull request #854: LUCENE-10393: allow to link github pr from changes

2022-04-28 Thread GitBox



mocobeta commented on code in PR #854:
URL: https://github.com/apache/lucene/pull/854#discussion_r861432053


##
gradle/documentation/changes-to-html/changes2html.pl:
##
@@ -572,7 +572,7 @@
   $item =~ s{((LUCENE|SOLR|INFRA)\s+(\d{3,}))}
 {$1}gi;
   # Link "[ github | gh ] pull request [ # ] X+" to Github pull request
-  $item =~ 
s{((?:(?:(?:github|gh)\s+)?pull\s+request\s*(?:\#?\s*)?|gh-)(\d+))}
+  $item =~ 
s{((?:(?:(?:github|gh)\s+)?pull\s+request\s*(?:\#?\s*)?|gh-|github-)(\d+))}

Review Comment:
   The acronym `GH-` may be okay, but `GITHUB-` is also good for 
understandability and still succinct, I think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10545) Allow Github PR link in CHANGES.html

2022-04-28 Thread Tomoko Uchida (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-10545:
---
Description: 
{{changes2html}} already supports links to Github PRs ([LUCENE-5383]), but the 
link is obsoleted so a small modification is needed to make it work again.

With this change, developers can directly link to their pull requests from 
CHANGES.txt, skipping creating Jira issues.

  was:
{\{changes2html}} already supports links to Github PRs, but the link is 
obsoleted so a small modification is needed to make it work again.

With this change, developers can directly link to their pull requests from 
CHANGES.txt, skipping creating Jira issues.


> Allow Github PR link in CHANGES.html
> 
>
> Key: LUCENE-10545
> URL: https://issues.apache.org/jira/browse/LUCENE-10545
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Priority: Minor
>
> {{changes2html}} already supports links to Github PRs ([LUCENE-5383]), but 
> the link is obsoleted so a small modification is needed to make it work again.
> With this change, developers can directly link to their pull requests from 
> CHANGES.txt, skipping creating Jira issues.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10545) Allow Github PR link in CHANGES.html

2022-04-28 Thread Tomoko Uchida (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-10545:
---
Description: 
{{changes2html}} already supports links to Github PRs (LUCENE-5383), but the 
link is obsoleted so a small modification is needed to make it work again.

With this change, developers can directly link to their pull requests from 
CHANGES.txt, skipping creating Jira issues.

When the change made (2014) Github was still a "third party" to Apache and the 
integration had been on experimental status; the situation was changed - now 
GitHub is the first-class code hosting service for ASF projects. I think the 
requirement to open a Jira issue can be relaxed.

  was:
{{changes2html}} already supports links to Github PRs ([LUCENE-5383]), but the 
link is obsoleted so a small modification is needed to make it work again.

With this change, developers can directly link to their pull requests from 
CHANGES.txt, skipping creating Jira issues.


> Allow Github PR link in CHANGES.html
> 
>
> Key: LUCENE-10545
> URL: https://issues.apache.org/jira/browse/LUCENE-10545
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Priority: Minor
>
> {{changes2html}} already supports links to Github PRs (LUCENE-5383), but the 
> link is obsoleted so a small modification is needed to make it work again.
> With this change, developers can directly link to their pull requests from 
> CHANGES.txt, skipping creating Jira issues.
> When the change made (2014) Github was still a "third party" to Apache and 
> the integration had been on experimental status; the situation was changed - 
> now GitHub is the first-class code hosting service for ASF projects. I think 
> the requirement to open a Jira issue can be relaxed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10545) Allow Github PR link in CHANGES.html

2022-04-28 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529732#comment-17529732
 ] 

Tomoko Uchida commented on LUCENE-10545:


This PR shows a concrete example to show how it'll work.

https://github.com/apache/lucene/pull/854

> Allow Github PR link in CHANGES.html
> 
>
> Key: LUCENE-10545
> URL: https://issues.apache.org/jira/browse/LUCENE-10545
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Priority: Minor
>
> {{changes2html}} already supports links to Github PRs (LUCENE-5383), but the 
> link is obsoleted so a small modification is needed to make it work again.
> With this change, developers can directly link to their pull requests from 
> CHANGES.txt, skipping creating Jira issues.
> When the change made (2014) Github was still a "third party" to Apache and 
> the integration had been on experimental status; the situation was changed - 
> now GitHub is the first-class code hosting service for ASF projects. I think 
> the requirement to open a Jira issue can be relaxed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on a diff in pull request #854: LUCENE-10393: allow to link github pr from changes

2022-04-28 Thread GitBox



mocobeta commented on code in PR #854:
URL: https://github.com/apache/lucene/pull/854#discussion_r861437113


##
lucene/CHANGES.txt:
##
@@ -53,7 +53,7 @@ Other
 * LUCENE-10253: The @BadApple annotation has been removed from the test
   framework. (Adrien Grand)
 
-* LUCENE-10393: Unify binary dictionary and dictionary writer in Kuromoji and 
Nori.
+* GITHUB-740: Unify binary dictionary and dictionary writer in Kuromoji and 
Nori.

Review Comment:
   Even if we start to use Github issues in the future, I think this will be 
still valid - CHANGES should directly point PRs, not issues in my opinion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on a diff in pull request #854: LUCENE-10393: allow to link github pr from changes

2022-04-28 Thread GitBox



mocobeta commented on code in PR #854:
URL: https://github.com/apache/lucene/pull/854#discussion_r861437113


##
lucene/CHANGES.txt:
##
@@ -53,7 +53,7 @@ Other
 * LUCENE-10253: The @BadApple annotation has been removed from the test
   framework. (Adrien Grand)
 
-* LUCENE-10393: Unify binary dictionary and dictionary writer in Kuromoji and 
Nori.
+* GITHUB-740: Unify binary dictionary and dictionary writer in Kuromoji and 
Nori.

Review Comment:
   Even if we start to use Github issues in the future, I think this is still 
valid - CHANGES should directly point PRs, not issues in my opinion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10543) Achieve contribution workflow perfection (with progress)

2022-04-28 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529735#comment-17529735
 ] 

Tomoko Uchida edited comment on LUCENE-10543 at 4/29/22 3:21 AM:
-

I opened [LUCENE-10545] and a PR for it. Can you please take a look?


was (Author: tomoko uchida):
I opened [LUCENE-10393] and a PR for it. Can you please take a look?

> Achieve contribution workflow perfection (with progress)
> 
>
> Key: LUCENE-10543
> URL: https://issues.apache.org/jira/browse/LUCENE-10543
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> Inspired by Dawid's build issue which has worked out for us: LUCENE-9871
> He hasn't even linked 10% of the issues/subtasks involved in that work 
> either, but we know.
> I think we need a similar approach for the contribution workflow. There has 
> been some major improvements recently, a couple that come to mind:
> * Tomoko made a CONTRIBUTING.md file which github recognizes and is way 
> better than the wiki stuff
> * Some hazards/error messages/mazes in the build process and so on have 
> gotten fixed.
> But there is more to do in my opinion, here is 3 ideas:
> * Creating a PR still has a massive checklist template. But now this template 
> links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it 
> enough to just link to CONTRIBUTING.md and fix that as needed?
> * Creating a PR still requires signing up for Apache JIRA and creating a JIRA 
> issue. There is zero value to this additional process. We often end out with 
> either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated 
> content. This is just an unnecessary dance, can we use github issues instead?
> * Haven't dug into the github actions or configs very deeply. Maybe there's 
> simple stuff we can do such as give useful notifications if checks fail. Try 
> to guide the user to run ./gradlew check and fix it. It sucks to have to 
> review, look at logs, and manually add comments to do this stuff.
> So let's have an issue to improve this area.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10543) Achieve contribution workflow perfection (with progress)

2022-04-28 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529735#comment-17529735
 ] 

Tomoko Uchida commented on LUCENE-10543:


I opened [LUCENE-10393] and a PR for it. Can you please take a look?

> Achieve contribution workflow perfection (with progress)
> 
>
> Key: LUCENE-10543
> URL: https://issues.apache.org/jira/browse/LUCENE-10543
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> Inspired by Dawid's build issue which has worked out for us: LUCENE-9871
> He hasn't even linked 10% of the issues/subtasks involved in that work 
> either, but we know.
> I think we need a similar approach for the contribution workflow. There has 
> been some major improvements recently, a couple that come to mind:
> * Tomoko made a CONTRIBUTING.md file which github recognizes and is way 
> better than the wiki stuff
> * Some hazards/error messages/mazes in the build process and so on have 
> gotten fixed.
> But there is more to do in my opinion, here is 3 ideas:
> * Creating a PR still has a massive checklist template. But now this template 
> links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it 
> enough to just link to CONTRIBUTING.md and fix that as needed?
> * Creating a PR still requires signing up for Apache JIRA and creating a JIRA 
> issue. There is zero value to this additional process. We often end out with 
> either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated 
> content. This is just an unnecessary dance, can we use github issues instead?
> * Haven't dug into the github actions or configs very deeply. Maybe there's 
> simple stuff we can do such as give useful notifications if checks fail. Try 
> to guide the user to run ./gradlew check and fix it. It sucks to have to 
> review, look at logs, and manually add comments to do this stuff.
> So let's have an issue to improve this area.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10545) Allow Github PR link in CHANGES.html

2022-04-28 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529748#comment-17529748
 ] 

Tomoko Uchida commented on LUCENE-10545:


Note that this does not replace Jira issue with Github (we still absolutely 
need issues for discussion), but provides an option for developers who already 
have working patches.

> Allow Github PR link in CHANGES.html
> 
>
> Key: LUCENE-10545
> URL: https://issues.apache.org/jira/browse/LUCENE-10545
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Priority: Minor
>
> {{changes2html}} already supports links to Github PRs (LUCENE-5383), but the 
> link is obsoleted so a small modification is needed to make it work again.
> With this change, developers can directly link to their pull requests from 
> CHANGES.txt, skipping creating Jira issues.
> When the change made (2014) Github was still a "third party" to Apache and 
> the integration had been on experimental status; the situation was changed - 
> now GitHub is the first-class code hosting service for ASF projects. I think 
> the requirement to open a Jira issue can be relaxed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10543) Achieve contribution workflow perfection (with progress)

2022-04-28 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529735#comment-17529735
 ] 

Tomoko Uchida edited comment on LUCENE-10543 at 4/29/22 6:36 AM:
-

I opened [LUCENE-10545] and a PR for it. Can you please take a look?
It does not aim to replace Jira with GitHub, but relaxes the requirement for 
Jira account/issue for contributors who are ready to start the conversation 
with a working PR. This would make some progress I think.

For the future, we could consider replacing Jira with Github issue - I'm not 
against it and it may be technically possible, I think we'd need some 
additional discussion (also with ASF, maybe?) for the big transition. 
I have no personal opinion about the migration to GitHub - Jira works well for 
me so far and I'm already accustomed to concurrent use of Jira and GitHub.


was (Author: tomoko uchida):
I opened [LUCENE-10545] and a PR for it. Can you please take a look?

> Achieve contribution workflow perfection (with progress)
> 
>
> Key: LUCENE-10543
> URL: https://issues.apache.org/jira/browse/LUCENE-10543
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> Inspired by Dawid's build issue which has worked out for us: LUCENE-9871
> He hasn't even linked 10% of the issues/subtasks involved in that work 
> either, but we know.
> I think we need a similar approach for the contribution workflow. There has 
> been some major improvements recently, a couple that come to mind:
> * Tomoko made a CONTRIBUTING.md file which github recognizes and is way 
> better than the wiki stuff
> * Some hazards/error messages/mazes in the build process and so on have 
> gotten fixed.
> But there is more to do in my opinion, here is 3 ideas:
> * Creating a PR still has a massive checklist template. But now this template 
> links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it 
> enough to just link to CONTRIBUTING.md and fix that as needed?
> * Creating a PR still requires signing up for Apache JIRA and creating a JIRA 
> issue. There is zero value to this additional process. We often end out with 
> either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated 
> content. This is just an unnecessary dance, can we use github issues instead?
> * Haven't dug into the github actions or configs very deeply. Maybe there's 
> simple stuff we can do such as give useful notifications if checks fail. Try 
> to guide the user to run ./gradlew check and fix it. It sucks to have to 
> review, look at logs, and manually add comments to do this stuff.
> So let's have an issue to improve this area.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

95 matches

Mail list logo