[jira] [Created] (LUCENE-10543) Achieve contribution workflow perfection (with progress)
Robert Muir created LUCENE-10543: Summary: Achieve contribution workflow perfection (with progress) Key: LUCENE-10543 URL: https://issues.apache.org/jira/browse/LUCENE-10543 Project: Lucene - Core Issue Type: Task Reporter: Robert Muir Inspired by Dawid's build issue which has worked out for us: LUCENE-9871 He hasn't even linked 10% of the issues/subtasks involved in that work either, but we know. I think we need a similar approach for the contribution workflow. There has been some major improvements recently, a couple that come to mind: * Tomoko made a CONTRIBUTING.md file which github recognizes and is way better than the wiki stuff * Some hazards/error messages/mazes in the build process and so on have gotten fixed. But there is more to do in my opinion, here is 3 ideas: * Creating a PR still has a massive checklist template. But now this template links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it enough to just link to CONTRIBUTING.md and fix that as needed? * Creating a PR still requires signing up for Apache JIRA and creating a JIRA issue. There is zero value to this additional process. We often end out with either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated content. This is just an unnecessary dance, can we use github issues instead? * Haven't dug into the github actions or configs very deeply. Maybe there's simple stuff we can do such as give useful notifications if checks fail. Try to guide the user to run ./gradlew check and fix it. It sucks to have to review, look at logs, and manually add comments to do this stuff. So let's have an issue to improve this area. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10543) Achieve contribution workflow perfection (with progress)
[ https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529236#comment-17529236 ] Robert Muir commented on LUCENE-10543: -- another idea is to use github wiki functionality vs the confluence we have today. The confluence is so out of date, and almost nobody has permissions. Somewhere buried there is instructions to ping the mailing list and ask for permissions, and rarely it happens, but come on. I think using github wiki functionality would allow it to stay up to date better? > Achieve contribution workflow perfection (with progress) > > > Key: LUCENE-10543 > URL: https://issues.apache.org/jira/browse/LUCENE-10543 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > Inspired by Dawid's build issue which has worked out for us: LUCENE-9871 > He hasn't even linked 10% of the issues/subtasks involved in that work > either, but we know. > I think we need a similar approach for the contribution workflow. There has > been some major improvements recently, a couple that come to mind: > * Tomoko made a CONTRIBUTING.md file which github recognizes and is way > better than the wiki stuff > * Some hazards/error messages/mazes in the build process and so on have > gotten fixed. > But there is more to do in my opinion, here is 3 ideas: > * Creating a PR still has a massive checklist template. But now this template > links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it > enough to just link to CONTRIBUTING.md and fix that as needed? > * Creating a PR still requires signing up for Apache JIRA and creating a JIRA > issue. There is zero value to this additional process. We often end out with > either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated > content. This is just an unnecessary dance, can we use github issues instead? > * Haven't dug into the github actions or configs very deeply. Maybe there's > simple stuff we can do such as give useful notifications if checks fail. Try > to guide the user to run ./gradlew check and fix it. It sucks to have to > review, look at logs, and manually add comments to do this stuff. > So let's have an issue to improve this area. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10543) Achieve contribution workflow perfection (with progress)
[ https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529240#comment-17529240 ] Robert Muir commented on LUCENE-10543: -- another idea, add a simple "fork me on github" to the website. Just look at our website, getting to the sources is not the easiest task. I have to evaluate a lot of open-source software myself, and always appreciate the quick link to the sources, let's not make this difficult. > Achieve contribution workflow perfection (with progress) > > > Key: LUCENE-10543 > URL: https://issues.apache.org/jira/browse/LUCENE-10543 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > Inspired by Dawid's build issue which has worked out for us: LUCENE-9871 > He hasn't even linked 10% of the issues/subtasks involved in that work > either, but we know. > I think we need a similar approach for the contribution workflow. There has > been some major improvements recently, a couple that come to mind: > * Tomoko made a CONTRIBUTING.md file which github recognizes and is way > better than the wiki stuff > * Some hazards/error messages/mazes in the build process and so on have > gotten fixed. > But there is more to do in my opinion, here is 3 ideas: > * Creating a PR still has a massive checklist template. But now this template > links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it > enough to just link to CONTRIBUTING.md and fix that as needed? > * Creating a PR still requires signing up for Apache JIRA and creating a JIRA > issue. There is zero value to this additional process. We often end out with > either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated > content. This is just an unnecessary dance, can we use github issues instead? > * Haven't dug into the github actions or configs very deeply. Maybe there's > simple stuff we can do such as give useful notifications if checks fail. Try > to guide the user to run ./gradlew check and fix it. It sucks to have to > review, look at logs, and manually add comments to do this stuff. > So let's have an issue to improve this area. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10543) Achieve contribution workflow perfection (with progress)
[ https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529245#comment-17529245 ] Robert Muir commented on LUCENE-10543: -- Believe it or not, there's actually no "link" to the source code anywhere, only commandline instructions to run {{ git clone https://github.com/apache/lucene.git }}. You can contribute changes to stuff like documentation straight from the github UI. no need to do stuff from any commandline. Linking to the sources appropriately (especially given that the repo has moved from lucene-solr to lucene), seems like priority number one. > Achieve contribution workflow perfection (with progress) > > > Key: LUCENE-10543 > URL: https://issues.apache.org/jira/browse/LUCENE-10543 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > Inspired by Dawid's build issue which has worked out for us: LUCENE-9871 > He hasn't even linked 10% of the issues/subtasks involved in that work > either, but we know. > I think we need a similar approach for the contribution workflow. There has > been some major improvements recently, a couple that come to mind: > * Tomoko made a CONTRIBUTING.md file which github recognizes and is way > better than the wiki stuff > * Some hazards/error messages/mazes in the build process and so on have > gotten fixed. > But there is more to do in my opinion, here is 3 ideas: > * Creating a PR still has a massive checklist template. But now this template > links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it > enough to just link to CONTRIBUTING.md and fix that as needed? > * Creating a PR still requires signing up for Apache JIRA and creating a JIRA > issue. There is zero value to this additional process. We often end out with > either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated > content. This is just an unnecessary dance, can we use github issues instead? > * Haven't dug into the github actions or configs very deeply. Maybe there's > simple stuff we can do such as give useful notifications if checks fail. Try > to guide the user to run ./gradlew check and fix it. It sucks to have to > review, look at logs, and manually add comments to do this stuff. > So let's have an issue to improve this area. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10543) Achieve contribution workflow perfection (with progress)
[ https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529255#comment-17529255 ] Dawid Weiss commented on LUCENE-10543: -- ("with progress"... yeah, that's why LUCENE-9871 is still open :) ) > Achieve contribution workflow perfection (with progress) > > > Key: LUCENE-10543 > URL: https://issues.apache.org/jira/browse/LUCENE-10543 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > Inspired by Dawid's build issue which has worked out for us: LUCENE-9871 > He hasn't even linked 10% of the issues/subtasks involved in that work > either, but we know. > I think we need a similar approach for the contribution workflow. There has > been some major improvements recently, a couple that come to mind: > * Tomoko made a CONTRIBUTING.md file which github recognizes and is way > better than the wiki stuff > * Some hazards/error messages/mazes in the build process and so on have > gotten fixed. > But there is more to do in my opinion, here is 3 ideas: > * Creating a PR still has a massive checklist template. But now this template > links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it > enough to just link to CONTRIBUTING.md and fix that as needed? > * Creating a PR still requires signing up for Apache JIRA and creating a JIRA > issue. There is zero value to this additional process. We often end out with > either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated > content. This is just an unnecessary dance, can we use github issues instead? > * Haven't dug into the github actions or configs very deeply. Maybe there's > simple stuff we can do such as give useful notifications if checks fail. Try > to guide the user to run ./gradlew check and fix it. It sucks to have to > review, look at logs, and manually add comments to do this stuff. > So let's have an issue to improve this area. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?
[ https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529261#comment-17529261 ] Dawid Weiss commented on LUCENE-10541: -- Filed a PR at https://github.com/apache/lucene/pull/850. Picked the default from CharTokenizer.DEFAULT_MAX_WORD_LEN, although can't reference that directly (not accessible from the test framework). Had to tweak the defaults in one or two failing tests that expected the tokenizer to return longer tokens, so a second set of eyes would be good. enwiki lines contains 2 million lines. It'd be nice to calculate the probability of any of the k faulty (long-term) lines being drawn in n tries and distribute it over time - this would address Mike's question about why it took so long to discover them. :) > What to do about massive terms in our Wikipedia EN LineFileDocs? > > > Key: LUCENE-10541 > URL: https://issues.apache.org/jira/browse/LUCENE-10541 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > > Spinoff from this fun build failure that [~dweiss] root caused: > [https://lucene.markmail.org/thread/pculfuazll4oebra] > Thank you and sorry [~dweiss]!! > This test failure happened because the test case randomly indexed a chunk of > the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's > ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the > test. > It's crazy that it took so long for Lucene's randomized tests to discover > this too-massive term in Lucene's nightly benchmarks. It's like searching > for Nessie, or > [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence]. > We need to prevent such false failures, somehow, and there are multiple > options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" > terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix > {{MockTokenizer}} to trim such ridiculous terms (I think this is the best > option?), ... -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9871) Achieve build system perfection (with progress)
[ https://issues.apache.org/jira/browse/LUCENE-9871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529263#comment-17529263 ] Robert Muir commented on LUCENE-9871: - just a reminder, haven't forgot about java version constraints here. we've added special logic to make failures more clear, but there are version numbers everywhere. Ideally we could consolidate a lot of them in a simple .properties file that contains the min/max major version numbers. could be then sucked in by: * gradle logic * java logic such as checks done in WrapperDownloader * bash logic such as error messaging in ./gradlew.sh * python smoketester logic? > Achieve build system perfection (with progress) > --- > > Key: LUCENE-9871 > URL: https://issues.apache.org/jira/browse/LUCENE-9871 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > Time Spent: 20m > Remaining Estimate: 0h > > This issue is an aggregate of various build-related improvements I have in > the back of my mind. The current state is not bad... but it's not perfect. > Hello, [~mikemccand]. ;) -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10543) Achieve contribution workflow perfection (with progress)
[ https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529270#comment-17529270 ] Robert Muir commented on LUCENE-10543: -- It was entirely too difficult to find the issue! I knew it had "perfection" in the name too > Achieve contribution workflow perfection (with progress) > > > Key: LUCENE-10543 > URL: https://issues.apache.org/jira/browse/LUCENE-10543 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > Inspired by Dawid's build issue which has worked out for us: LUCENE-9871 > He hasn't even linked 10% of the issues/subtasks involved in that work > either, but we know. > I think we need a similar approach for the contribution workflow. There has > been some major improvements recently, a couple that come to mind: > * Tomoko made a CONTRIBUTING.md file which github recognizes and is way > better than the wiki stuff > * Some hazards/error messages/mazes in the build process and so on have > gotten fixed. > But there is more to do in my opinion, here is 3 ideas: > * Creating a PR still has a massive checklist template. But now this template > links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it > enough to just link to CONTRIBUTING.md and fix that as needed? > * Creating a PR still requires signing up for Apache JIRA and creating a JIRA > issue. There is zero value to this additional process. We often end out with > either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated > content. This is just an unnecessary dance, can we use github issues instead? > * Haven't dug into the github actions or configs very deeply. Maybe there's > simple stuff we can do such as give useful notifications if checks fail. Try > to guide the user to run ./gradlew check and fix it. It sucks to have to > review, look at logs, and manually add comments to do this stuff. > So let's have an issue to improve this area. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.
uschindler commented on PR #850: URL: https://github.com/apache/lucene/pull/850#issuecomment-877517 Maybe we should set the limit to the maximum term size. This would allow us to test also longer terms. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?
[ https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529286#comment-17529286 ] Uwe Schindler commented on LUCENE-10541: I think another option might be to use IndexWriter.MAX_TERM_LENGTH as default. This would allow us to test full range and make sure the block tree terms impl works as expected. > What to do about massive terms in our Wikipedia EN LineFileDocs? > > > Key: LUCENE-10541 > URL: https://issues.apache.org/jira/browse/LUCENE-10541 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Spinoff from this fun build failure that [~dweiss] root caused: > [https://lucene.markmail.org/thread/pculfuazll4oebra] > Thank you and sorry [~dweiss]!! > This test failure happened because the test case randomly indexed a chunk of > the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's > ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the > test. > It's crazy that it took so long for Lucene's randomized tests to discover > this too-massive term in Lucene's nightly benchmarks. It's like searching > for Nessie, or > [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence]. > We need to prevent such false failures, somehow, and there are multiple > options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" > terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix > {{MockTokenizer}} to trim such ridiculous terms (I think this is the best > option?), ... -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #851: LUCENE-10088: backport simple text codec suppression and fs handle limit
uschindler commented on code in PR #851: URL: https://github.com/apache/lucene/pull/851#discussion_r860615393 ## lucene/test-framework/src/java/org/apache/lucene/tests/mockfile/HandleLimitFS.java: ## @@ -17,16 +17,38 @@ package org.apache.lucene.tests.mockfile; import java.io.IOException; +import java.lang.annotation.Documented; +import java.lang.annotation.ElementType; +import java.lang.annotation.Inherited; +import java.lang.annotation.Retention; +import java.lang.annotation.RetentionPolicy; +import java.lang.annotation.Target; import java.nio.file.FileSystem; import java.nio.file.FileSystemException; import java.nio.file.Path; import java.util.concurrent.atomic.AtomicInteger; -/** FileSystem that throws exception if file handles in use exceeds a specified limit */ +/** + * FileSystem that throws exception if file handles in use exceeds a specified limit. + * + * @see MaxOpenHandles + */ public class HandleLimitFS extends HandleTrackingFS { final int limit; final AtomicInteger count = new AtomicInteger(); + /** An annotation */ + @Documented + @Inherited + @Retention(RetentionPolicy.RUNTIME) + @Target(ElementType.TYPE) + public static @interface MaxOpenHandles { +// TODO: can we make the default even lower? +public static final int MAX_OPEN_FILES = 2048; Review Comment: Jenkins set this limit to 2048 on OS level, too. So tests using a greater one and using FSDirectory may fail. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #851: LUCENE-10088: backport simple text codec suppression and fs handle limit
uschindler commented on code in PR #851: URL: https://github.com/apache/lucene/pull/851#discussion_r860615990 ## lucene/test-framework/src/java/org/apache/lucene/tests/mockfile/HandleLimitFS.java: ## @@ -17,16 +17,38 @@ package org.apache.lucene.tests.mockfile; import java.io.IOException; +import java.lang.annotation.Documented; +import java.lang.annotation.ElementType; +import java.lang.annotation.Inherited; +import java.lang.annotation.Retention; +import java.lang.annotation.RetentionPolicy; +import java.lang.annotation.Target; import java.nio.file.FileSystem; import java.nio.file.FileSystemException; import java.nio.file.Path; import java.util.concurrent.atomic.AtomicInteger; -/** FileSystem that throws exception if file handles in use exceeds a specified limit */ +/** + * FileSystem that throws exception if file handles in use exceeds a specified limit. + * + * @see MaxOpenHandles + */ public class HandleLimitFS extends HandleTrackingFS { final int limit; final AtomicInteger count = new AtomicInteger(); + /** An annotation */ + @Documented + @Inherited + @Retention(RetentionPolicy.RUNTIME) + @Target(ElementType.TYPE) + public static @interface MaxOpenHandles { +// TODO: can we make the default even lower? +public static final int MAX_OPEN_FILES = 2048; Review Comment: At least on macosx. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a diff in pull request #851: LUCENE-10088: backport simple text codec suppression and fs handle limit
dweiss commented on code in PR #851: URL: https://github.com/apache/lucene/pull/851#discussion_r860616950 ## lucene/test-framework/src/java/org/apache/lucene/tests/mockfile/HandleLimitFS.java: ## @@ -17,16 +17,38 @@ package org.apache.lucene.tests.mockfile; import java.io.IOException; +import java.lang.annotation.Documented; +import java.lang.annotation.ElementType; +import java.lang.annotation.Inherited; +import java.lang.annotation.Retention; +import java.lang.annotation.RetentionPolicy; +import java.lang.annotation.Target; import java.nio.file.FileSystem; import java.nio.file.FileSystemException; import java.nio.file.Path; import java.util.concurrent.atomic.AtomicInteger; -/** FileSystem that throws exception if file handles in use exceeds a specified limit */ +/** + * FileSystem that throws exception if file handles in use exceeds a specified limit. + * + * @see MaxOpenHandles + */ public class HandleLimitFS extends HandleTrackingFS { final int limit; final AtomicInteger count = new AtomicInteger(); + /** An annotation */ + @Documented + @Inherited + @Retention(RetentionPolicy.RUNTIME) + @Target(ElementType.TYPE) + public static @interface MaxOpenHandles { +// TODO: can we make the default even lower? +public static final int MAX_OPEN_FILES = 2048; Review Comment: I just backported what's on main, didn't touch it otherwise. :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #851: LUCENE-10088: backport simple text codec suppression and fs handle limit
uschindler commented on code in PR #851: URL: https://github.com/apache/lucene/pull/851#discussion_r860618388 ## lucene/test-framework/src/java/org/apache/lucene/tests/mockfile/HandleLimitFS.java: ## @@ -17,16 +17,38 @@ package org.apache.lucene.tests.mockfile; import java.io.IOException; +import java.lang.annotation.Documented; +import java.lang.annotation.ElementType; +import java.lang.annotation.Inherited; +import java.lang.annotation.Retention; +import java.lang.annotation.RetentionPolicy; +import java.lang.annotation.Target; import java.nio.file.FileSystem; import java.nio.file.FileSystemException; import java.nio.file.Path; import java.util.concurrent.atomic.AtomicInteger; -/** FileSystem that throws exception if file handles in use exceeds a specified limit */ +/** + * FileSystem that throws exception if file handles in use exceeds a specified limit. + * + * @see MaxOpenHandles + */ public class HandleLimitFS extends HandleTrackingFS { final int limit; final AtomicInteger count = new AtomicInteger(); + /** An annotation */ + @Documented + @Inherited + @Retention(RetentionPolicy.RUNTIME) + @Target(ElementType.TYPE) + public static @interface MaxOpenHandles { +// TODO: can we make the default even lower? +public static final int MAX_OPEN_FILES = 2048; Review Comment: I know. I can also reply to Robert's mail. I think it did not hit that on macos Jenkins because it does not run nightly tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #851: LUCENE-10088: backport simple text codec suppression and fs handle limit
rmuir commented on code in PR #851: URL: https://github.com/apache/lucene/pull/851#discussion_r860619792 ## lucene/test-framework/src/java/org/apache/lucene/tests/mockfile/HandleLimitFS.java: ## @@ -17,16 +17,38 @@ package org.apache.lucene.tests.mockfile; import java.io.IOException; +import java.lang.annotation.Documented; +import java.lang.annotation.ElementType; +import java.lang.annotation.Inherited; +import java.lang.annotation.Retention; +import java.lang.annotation.RetentionPolicy; +import java.lang.annotation.Target; import java.nio.file.FileSystem; import java.nio.file.FileSystemException; import java.nio.file.Path; import java.util.concurrent.atomic.AtomicInteger; -/** FileSystem that throws exception if file handles in use exceeds a specified limit */ +/** + * FileSystem that throws exception if file handles in use exceeds a specified limit. + * + * @see MaxOpenHandles + */ public class HandleLimitFS extends HandleTrackingFS { final int limit; final AtomicInteger count = new AtomicInteger(); + /** An annotation */ + @Documented + @Inherited + @Retention(RetentionPolicy.RUNTIME) + @Target(ElementType.TYPE) + public static @interface MaxOpenHandles { +// TODO: can we make the default even lower? +public static final int MAX_OPEN_FILES = 2048; Review Comment: yeah, as mentioned on the dev list, i don't like to see that we added this annotation as escape-mechanism to increase the file count to be even higher. personally i would prefer 1024 or smaller. I'm using linux 5.17.4 with all defaults and my system shows 1024 in ulimit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #851: LUCENE-10088: backport simple text codec suppression and fs handle limit
rmuir commented on code in PR #851: URL: https://github.com/apache/lucene/pull/851#discussion_r860621619 ## lucene/test-framework/src/java/org/apache/lucene/tests/mockfile/HandleLimitFS.java: ## @@ -17,16 +17,38 @@ package org.apache.lucene.tests.mockfile; import java.io.IOException; +import java.lang.annotation.Documented; +import java.lang.annotation.ElementType; +import java.lang.annotation.Inherited; +import java.lang.annotation.Retention; +import java.lang.annotation.RetentionPolicy; +import java.lang.annotation.Target; import java.nio.file.FileSystem; import java.nio.file.FileSystemException; import java.nio.file.Path; import java.util.concurrent.atomic.AtomicInteger; -/** FileSystem that throws exception if file handles in use exceeds a specified limit */ +/** + * FileSystem that throws exception if file handles in use exceeds a specified limit. + * + * @see MaxOpenHandles + */ public class HandleLimitFS extends HandleTrackingFS { final int limit; final AtomicInteger count = new AtomicInteger(); + /** An annotation */ + @Documented + @Inherited + @Retention(RetentionPolicy.RUNTIME) + @Target(ElementType.TYPE) + public static @interface MaxOpenHandles { +// TODO: can we make the default even lower? +public static final int MAX_OPEN_FILES = 2048; Review Comment: at the same time, i would prefer we stabilize the tests in jenkins. we can make followup issues for the filehandle stuff. It is just silly, totally defeats the purpose of HandleLimitFS if we allow it to go to huge values. The whole purpose of this delegator is that we can have a limit that works everywhere to equalize it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.
uschindler commented on code in PR #850: URL: https://github.com/apache/lucene/pull/850#discussion_r860622081 ## lucene/test-framework/src/java/org/apache/lucene/tests/analysis/MockTokenizer.java: ## @@ -66,11 +67,11 @@ public class MockTokenizer extends Tokenizer { * Limit the default token length to a size that doesn't cause random analyzer failures on * unpredictable data like the enwiki data set. * - * This value defaults to {@code CharTokenizer.DEFAULT_MAX_WORD_LEN}. + * This value defaults to {@link IndexWriter#MAX_TERM_LENGTH}. * * @see "https://issues.apache.org/jira/browse/LUCENE-10541"; */ - public static final int DEFAULT_MAX_TOKEN_LENGTH = 255; + public static final int DEFAULT_MAX_TOKEN_LENGTH = IndexWriter.MAX_TERM_LENGTH; Review Comment: After thinking about it: we may need to be careful: I think the maximum token length in IW is in bytes. Tokenizer counts utf16 chars. So it may overflow. @rmuir ? Not sure what to do. I like this maximum approach but it may fail. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #851: LUCENE-10088: backport simple text codec suppression and fs handle limit
uschindler commented on code in PR #851: URL: https://github.com/apache/lucene/pull/851#discussion_r860623990 ## lucene/test-framework/src/java/org/apache/lucene/tests/mockfile/HandleLimitFS.java: ## @@ -17,16 +17,38 @@ package org.apache.lucene.tests.mockfile; import java.io.IOException; +import java.lang.annotation.Documented; +import java.lang.annotation.ElementType; +import java.lang.annotation.Inherited; +import java.lang.annotation.Retention; +import java.lang.annotation.RetentionPolicy; +import java.lang.annotation.Target; import java.nio.file.FileSystem; import java.nio.file.FileSystemException; import java.nio.file.Path; import java.util.concurrent.atomic.AtomicInteger; -/** FileSystem that throws exception if file handles in use exceeds a specified limit */ +/** + * FileSystem that throws exception if file handles in use exceeds a specified limit. + * + * @see MaxOpenHandles + */ public class HandleLimitFS extends HandleTrackingFS { final int limit; final AtomicInteger count = new AtomicInteger(); + /** An annotation */ + @Documented + @Inherited + @Retention(RetentionPolicy.RUNTIME) + @Target(ElementType.TYPE) + public static @interface MaxOpenHandles { +// TODO: can we make the default even lower? +public static final int MAX_OPEN_FILES = 2048; Review Comment: On macos it was lower at the time I setup jenkins, too, so I raised it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a diff in pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.
dweiss commented on code in PR #850: URL: https://github.com/apache/lucene/pull/850#discussion_r860625375 ## lucene/test-framework/src/java/org/apache/lucene/tests/analysis/MockTokenizer.java: ## @@ -66,11 +67,11 @@ public class MockTokenizer extends Tokenizer { * Limit the default token length to a size that doesn't cause random analyzer failures on * unpredictable data like the enwiki data set. * - * This value defaults to {@code CharTokenizer.DEFAULT_MAX_WORD_LEN}. + * This value defaults to {@link IndexWriter#MAX_TERM_LENGTH}. * * @see "https://issues.apache.org/jira/browse/LUCENE-10541"; */ - public static final int DEFAULT_MAX_TOKEN_LENGTH = 255; + public static final int DEFAULT_MAX_TOKEN_LENGTH = IndexWriter.MAX_TERM_LENGTH; Review Comment: We can make it safe - IndexWriter.MAX_TERM_LENGTH/2? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.
rmuir commented on code in PR #850: URL: https://github.com/apache/lucene/pull/850#discussion_r860626938 ## lucene/test-framework/src/java/org/apache/lucene/tests/analysis/MockTokenizer.java: ## @@ -66,11 +67,11 @@ public class MockTokenizer extends Tokenizer { * Limit the default token length to a size that doesn't cause random analyzer failures on * unpredictable data like the enwiki data set. * - * This value defaults to {@code CharTokenizer.DEFAULT_MAX_WORD_LEN}. + * This value defaults to {@link IndexWriter#MAX_TERM_LENGTH}. * * @see "https://issues.apache.org/jira/browse/LUCENE-10541"; */ - public static final int DEFAULT_MAX_TOKEN_LENGTH = 255; + public static final int DEFAULT_MAX_TOKEN_LENGTH = IndexWriter.MAX_TERM_LENGTH; Review Comment: This tokenizer counts UTF-32 codepoints i think? Which is another thing entirely :) But still 255 is fine, as 255*4 is still far less than IndexWriter's limit. I think it is a good change. *ALL* the "real" analyzers have limits here. they have setters to tweak, too. Even KeywordTokenizer! So the problem of tokenizer emitting huge term is limited to MockTokenizer... let's fix it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.
rmuir commented on code in PR #850: URL: https://github.com/apache/lucene/pull/850#discussion_r860628445 ## lucene/test-framework/src/java/org/apache/lucene/tests/analysis/MockTokenizer.java: ## @@ -66,11 +67,11 @@ public class MockTokenizer extends Tokenizer { * Limit the default token length to a size that doesn't cause random analyzer failures on * unpredictable data like the enwiki data set. * - * This value defaults to {@code CharTokenizer.DEFAULT_MAX_WORD_LEN}. + * This value defaults to {@link IndexWriter#MAX_TERM_LENGTH}. * * @see "https://issues.apache.org/jira/browse/LUCENE-10541"; */ - public static final int DEFAULT_MAX_TOKEN_LENGTH = 255; + public static final int DEFAULT_MAX_TOKEN_LENGTH = IndexWriter.MAX_TERM_LENGTH; Review Comment: I would propose 255 yeah, the constant from StandardAnalyzer. seems to me that most of the other tokenizers steal that default. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.
rmuir commented on code in PR #850: URL: https://github.com/apache/lucene/pull/850#discussion_r860631960 ## lucene/test-framework/src/java/org/apache/lucene/tests/analysis/MockTokenizer.java: ## @@ -66,11 +67,11 @@ public class MockTokenizer extends Tokenizer { * Limit the default token length to a size that doesn't cause random analyzer failures on * unpredictable data like the enwiki data set. * - * This value defaults to {@code CharTokenizer.DEFAULT_MAX_WORD_LEN}. + * This value defaults to {@link IndexWriter#MAX_TERM_LENGTH}. * * @see "https://issues.apache.org/jira/browse/LUCENE-10541"; */ - public static final int DEFAULT_MAX_TOKEN_LENGTH = 255; + public static final int DEFAULT_MAX_TOKEN_LENGTH = IndexWriter.MAX_TERM_LENGTH; Review Comment: > We can make it safe - IndexWriter.MAX_TERM_LENGTH/2? I just checked, currently it indeed counts UTF-16 code units. This tokenizer is quite slow and processes single codepoint at a time. But the check for this looks at String.length(). Possibly correct or incorrect, but it is a check that should work. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.
rmuir commented on code in PR #850: URL: https://github.com/apache/lucene/pull/850#discussion_r860635303 ## lucene/test-framework/src/java/org/apache/lucene/tests/analysis/MockTokenizer.java: ## @@ -66,11 +67,11 @@ public class MockTokenizer extends Tokenizer { * Limit the default token length to a size that doesn't cause random analyzer failures on * unpredictable data like the enwiki data set. * - * This value defaults to {@code CharTokenizer.DEFAULT_MAX_WORD_LEN}. + * This value defaults to {@link IndexWriter#MAX_TERM_LENGTH}. * * @see "https://issues.apache.org/jira/browse/LUCENE-10541"; */ - public static final int DEFAULT_MAX_TOKEN_LENGTH = 255; + public static final int DEFAULT_MAX_TOKEN_LENGTH = IndexWriter.MAX_TERM_LENGTH; Review Comment: so yeah `IndexWriter.MAX_TERM_LENGTH/2` should be safe due to how this thing currently counts, just watch the off-by-one possibility. Maybe we can reproduce the original failure with the patch. Still I think MockTokenizer should have a default limit of something reasonable like 255 to prevent debugging other crazy tests. Sometimes stemmers make terms longer and stuff. Let's not set ourselves up for failure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a diff in pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.
dweiss commented on code in PR #850: URL: https://github.com/apache/lucene/pull/850#discussion_r860641028 ## lucene/test-framework/src/java/org/apache/lucene/tests/analysis/MockTokenizer.java: ## @@ -66,11 +67,11 @@ public class MockTokenizer extends Tokenizer { * Limit the default token length to a size that doesn't cause random analyzer failures on * unpredictable data like the enwiki data set. * - * This value defaults to {@code CharTokenizer.DEFAULT_MAX_WORD_LEN}. + * This value defaults to {@link IndexWriter#MAX_TERM_LENGTH}. * * @see "https://issues.apache.org/jira/browse/LUCENE-10541"; */ - public static final int DEFAULT_MAX_TOKEN_LENGTH = 255; + public static final int DEFAULT_MAX_TOKEN_LENGTH = IndexWriter.MAX_TERM_LENGTH; Review Comment: Umm... so which one should it be then? IndexWriter.MAX_TERM_LENGTH/2, IndexWriter.MAX_TERM_LENGTH/2 - 1 or 255? :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a diff in pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.
mocobeta commented on code in PR #850: URL: https://github.com/apache/lucene/pull/850#discussion_r860642306 ## lucene/test-framework/src/java/org/apache/lucene/tests/analysis/MockTokenizer.java: ## @@ -66,11 +67,11 @@ public class MockTokenizer extends Tokenizer { * Limit the default token length to a size that doesn't cause random analyzer failures on * unpredictable data like the enwiki data set. * - * This value defaults to {@code CharTokenizer.DEFAULT_MAX_WORD_LEN}. + * This value defaults to {@link IndexWriter#MAX_TERM_LENGTH}. * * @see "https://issues.apache.org/jira/browse/LUCENE-10541"; */ - public static final int DEFAULT_MAX_TOKEN_LENGTH = 255; + public static final int DEFAULT_MAX_TOKEN_LENGTH = IndexWriter.MAX_TERM_LENGTH; Review Comment: I'm not so knowledgeable on it, but it might be understandable (or reasonable, I can't find the right English word) to respect `StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH`, in other words, 255 here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a diff in pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.
mocobeta commented on code in PR #850: URL: https://github.com/apache/lucene/pull/850#discussion_r860642306 ## lucene/test-framework/src/java/org/apache/lucene/tests/analysis/MockTokenizer.java: ## @@ -66,11 +67,11 @@ public class MockTokenizer extends Tokenizer { * Limit the default token length to a size that doesn't cause random analyzer failures on * unpredictable data like the enwiki data set. * - * This value defaults to {@code CharTokenizer.DEFAULT_MAX_WORD_LEN}. + * This value defaults to {@link IndexWriter#MAX_TERM_LENGTH}. * * @see "https://issues.apache.org/jira/browse/LUCENE-10541"; */ - public static final int DEFAULT_MAX_TOKEN_LENGTH = 255; + public static final int DEFAULT_MAX_TOKEN_LENGTH = IndexWriter.MAX_TERM_LENGTH; Review Comment: I'm not so knowledgeable on it, but it might be understandable (or reasonable, I can't find the right English word) to respect `StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH`, in other words, 255 here? I mean, people would expect the mock tokenizer behaves similarly to the standard tokenizer as default. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.
uschindler commented on code in PR #850: URL: https://github.com/apache/lucene/pull/850#discussion_r860657767 ## lucene/test-framework/src/java/org/apache/lucene/tests/analysis/MockTokenizer.java: ## @@ -66,11 +67,11 @@ public class MockTokenizer extends Tokenizer { * Limit the default token length to a size that doesn't cause random analyzer failures on * unpredictable data like the enwiki data set. * - * This value defaults to {@code CharTokenizer.DEFAULT_MAX_WORD_LEN}. + * This value defaults to {@link IndexWriter#MAX_TERM_LENGTH}. * * @see "https://issues.apache.org/jira/browse/LUCENE-10541"; */ - public static final int DEFAULT_MAX_TOKEN_LENGTH = 255; + public static final int DEFAULT_MAX_TOKEN_LENGTH = IndexWriter.MAX_TERM_LENGTH; Review Comment: So it looks like some tests are now using a larger default. Maybe we should change more tests to use the MockTokenizer with a larger size and keep the default with 255. If we have other tests checking the boundaries of IndexWriter terms I am fine. It just came to my idea to not use the "arbitrary" 255 limit but instead use something else here. But the UTF-16 vs UTF-8 problem is a bit problematic here. So I am fine with any solution. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #851: LUCENE-10088: backport simple text codec suppression and fs handle limit
uschindler commented on code in PR #851: URL: https://github.com/apache/lucene/pull/851#discussion_r860662889 ## lucene/test-framework/src/java/org/apache/lucene/tests/mockfile/HandleLimitFS.java: ## @@ -17,16 +17,38 @@ package org.apache.lucene.tests.mockfile; import java.io.IOException; +import java.lang.annotation.Documented; +import java.lang.annotation.ElementType; +import java.lang.annotation.Inherited; +import java.lang.annotation.Retention; +import java.lang.annotation.RetentionPolicy; +import java.lang.annotation.Target; import java.nio.file.FileSystem; import java.nio.file.FileSystemException; import java.nio.file.Path; import java.util.concurrent.atomic.AtomicInteger; -/** FileSystem that throws exception if file handles in use exceeds a specified limit */ +/** + * FileSystem that throws exception if file handles in use exceeds a specified limit. + * + * @see MaxOpenHandles + */ public class HandleLimitFS extends HandleTrackingFS { final int limit; final AtomicInteger count = new AtomicInteger(); + /** An annotation */ + @Documented + @Inherited + @Retention(RetentionPolicy.RUNTIME) + @Target(ElementType.TYPE) + public static @interface MaxOpenHandles { +// TODO: can we make the default even lower? +public static final int MAX_OPEN_FILES = 2048; Review Comment: I changed the limit on the MacOS slave to 8192. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10543) Achieve contribution workflow perfection (with progress)
[ https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529333#comment-17529333 ] Tomoko Uchida commented on LUCENE-10543: bq. Creating a PR still requires signing up for Apache JIRA and creating a JIRA issue. There is zero value to this additional process. We often end out with either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated content. This is just an unnecessary dance, can we use github issues instead? Yes I think it's an obstacle to contributing (in this GitHub era) when developers already have a working patch... As far as I know, the only technical reason to ask contributors to open a JIRA issue is that our CHANGES (or {{changes2html.pl}}) requires it. (I have no opinion about replacing JIRA with GitHub issue.) > Achieve contribution workflow perfection (with progress) > > > Key: LUCENE-10543 > URL: https://issues.apache.org/jira/browse/LUCENE-10543 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > Inspired by Dawid's build issue which has worked out for us: LUCENE-9871 > He hasn't even linked 10% of the issues/subtasks involved in that work > either, but we know. > I think we need a similar approach for the contribution workflow. There has > been some major improvements recently, a couple that come to mind: > * Tomoko made a CONTRIBUTING.md file which github recognizes and is way > better than the wiki stuff > * Some hazards/error messages/mazes in the build process and so on have > gotten fixed. > But there is more to do in my opinion, here is 3 ideas: > * Creating a PR still has a massive checklist template. But now this template > links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it > enough to just link to CONTRIBUTING.md and fix that as needed? > * Creating a PR still requires signing up for Apache JIRA and creating a JIRA > issue. There is zero value to this additional process. We often end out with > either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated > content. This is just an unnecessary dance, can we use github issues instead? > * Haven't dug into the github actions or configs very deeply. Maybe there's > simple stuff we can do such as give useful notifications if checks fail. Try > to guide the user to run ./gradlew check and fix it. It sucks to have to > review, look at logs, and manually add comments to do this stuff. > So let's have an issue to improve this area. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10543) Achieve contribution workflow perfection (with progress)
[ https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529343#comment-17529343 ] Tomoko Uchida commented on LUCENE-10543: I just found that it looks like {{changes2html.pl}} supports github PRs, the url is obsoleted though. https://github.com/apache/lucene/blob/a8d86ea6e8b89ea0324e7f8b6e1d5213254274d5/gradle/documentation/changes-to-html/changes2html.pl#L28 > Achieve contribution workflow perfection (with progress) > > > Key: LUCENE-10543 > URL: https://issues.apache.org/jira/browse/LUCENE-10543 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > Inspired by Dawid's build issue which has worked out for us: LUCENE-9871 > He hasn't even linked 10% of the issues/subtasks involved in that work > either, but we know. > I think we need a similar approach for the contribution workflow. There has > been some major improvements recently, a couple that come to mind: > * Tomoko made a CONTRIBUTING.md file which github recognizes and is way > better than the wiki stuff > * Some hazards/error messages/mazes in the build process and so on have > gotten fixed. > But there is more to do in my opinion, here is 3 ideas: > * Creating a PR still has a massive checklist template. But now this template > links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it > enough to just link to CONTRIBUTING.md and fix that as needed? > * Creating a PR still requires signing up for Apache JIRA and creating a JIRA > issue. There is zero value to this additional process. We often end out with > either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated > content. This is just an unnecessary dance, can we use github issues instead? > * Haven't dug into the github actions or configs very deeply. Maybe there's > simple stuff we can do such as give useful notifications if checks fail. Try > to guide the user to run ./gradlew check and fix it. It sucks to have to > review, look at logs, and manually add comments to do this stuff. > So let's have an issue to improve this area. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #842: LUCENE-10518: Relax field consistency check for old indices
jpountz commented on code in PR #842: URL: https://github.com/apache/lucene/pull/842#discussion_r860789641 ## lucene/core/src/java/org/apache/lucene/index/FieldInfos.java: ## @@ -350,6 +360,11 @@ static final class FieldNumbers { this.omitNorms = new HashMap<>(); this.storeTermVectors = new HashMap<>(); this.softDeletesFieldName = softDeletesFieldName; + this.strictlyConsistent = indexCreatedVersionMajor >= 9; +} + +FieldNumbers(String softDeletesFieldName) { Review Comment: does this ctor have many call sites, could we always use the 2-args ctor? ## lucene/CHANGES.txt: ## @@ -94,8 +94,10 @@ Bug Fixes no documents instead of throwing an NPE. (Greg Miller) * LUCENE-10470: Check if polygon has been successfully tessellated before we fail (we are failing some valid - tessellations) and allow filtering edges that fold on top of the previous one. (Ignacio Vera) - + tessellations) and allow filtering edges that fold on top of the previous one. (Ignacio Vera) + Review Comment: remove the space on this empty line? ## lucene/core/src/java/org/apache/lucene/index/FieldInfos.java: ## @@ -178,7 +179,15 @@ public static FieldInfos getMergedFieldInfos(IndexReader reader) { .filter(Objects::nonNull) .findAny() .orElse(null); - final Builder builder = new Builder(new FieldNumbers(softDeletesField)); + final int indexCreatedVersionMajor = + leaves.stream() + .map(l -> l.reader().getMetaData()) + .filter(Objects::nonNull) + .mapToInt(r -> r.getCreatedVersionMajor()) + .min() + .orElse(Version.LATEST.major); Review Comment: I'd like it a bit better if we moved to a for-each loop and threw an error if two segments happen to have different versions, since we'd be in trouble in that case? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #780: LUCENE-10496: avoid unnecessary attempts to evaluate skipping doc if index sort and search sort are in opposite direction
jpountz commented on PR #780: URL: https://github.com/apache/lucene/pull/780#issuecomment-1112113792 If you can find an approach that is less intrusive, maybe I would reconsider, but it still looks to me like we're adding complexity to work around a bad user decisions, which doesn't feel like the right trade-off to me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] wjp719 commented on pull request #780: LUCENE-10496: avoid unnecessary attempts to evaluate skipping doc if index sort and search sort are in opposite direction
wjp719 commented on PR #780: URL: https://github.com/apache/lucene/pull/780#issuecomment-1112123414 @jpountz Ok, tanks. As users ofter need to search on both directions but index data can only be index sorted by one directions. This situation may be common for users. > If you can find an approach that is less intrusive can you give me any hits to achieve this? Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #780: LUCENE-10496: avoid unnecessary attempts to evaluate skipping doc if index sort and search sort are in opposite direction
jpountz commented on PR #780: URL: https://github.com/apache/lucene/pull/780#issuecomment-1112127776 Why do you sort the index if you need to sort in both directions? Is it for range queries? Could you use points instead? Sorry I don't have good ideas for making it less intrusive. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #846: LUCENE-10493: move n-best logic to analysis-common
mocobeta commented on PR #846: URL: https://github.com/apache/lucene/pull/846#issuecomment-1112127870 @mikemccand thank you for revewing! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?
[ https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529416#comment-17529416 ] Michael McCandless commented on LUCENE-10541: - {quote}enwiki lines contains 2 million lines. It'd be nice to calculate the probability of any of the k faulty (long-term) lines being drawn in n tries and distribute it over time - this would address Mike's question about why it took so long to discover them. :) {quote} LOL this is indeed fun to work out. There are a couple wrinkles to modeling this though :) First, it's not really "randomly picking N lines for each test run", it's seeking to one spot and then reading N sequential lines from there. Assuming the file is well shuffled (I think it is), this is maybe not changing the result over picking N random lines, since those N sequential lines were already randomized. Second, the way the seeking works is to pick a random spot (byte location), seek there, scan to the end of that line, and start reading from the following line forwards. Many of the lines are very short, but some of them are longer, and even fewer of them are truly massive (and might have an evil Darth Term in there). One wrinkle here is that if you seek into the middle of one of the Darth Terms, you'll then seek to end of line and skip that large term entirely. Given that these massive lines take more bytes it seems more likely the seeking will then skip the Darth Term lines? Finally, there is one more crazy wrinkle – the nightly LineFileDocs is no longer a simple text file – it also has a pre-chunked "index" so test randomization can jump to one the pre-computed known skip points. Maybe that chunking introduced some sort of bias? Fun to think about the Darth Terms!! > What to do about massive terms in our Wikipedia EN LineFileDocs? > > > Key: LUCENE-10541 > URL: https://issues.apache.org/jira/browse/LUCENE-10541 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Time Spent: 1h 50m > Remaining Estimate: 0h > > Spinoff from this fun build failure that [~dweiss] root caused: > [https://lucene.markmail.org/thread/pculfuazll4oebra] > Thank you and sorry [~dweiss]!! > This test failure happened because the test case randomly indexed a chunk of > the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's > ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the > test. > It's crazy that it took so long for Lucene's randomized tests to discover > this too-massive term in Lucene's nightly benchmarks. It's like searching > for Nessie, or > [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence]. > We need to prevent such false failures, somehow, and there are multiple > options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" > terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix > {{MockTokenizer}} to trim such ridiculous terms (I think this is the best > option?), ... -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #848: LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations
gsmiller commented on PR #848: URL: https://github.com/apache/lucene/pull/848#issuecomment-1112187182 @gautamworah96 thanks for looking! The mismatch in the original Jira was a slightly different check, which is the one now set to a delta of `0`. The fix proposed here is to ensure the floats are summed in the exact same order, which leads to identical results. The mismatch originated because the floats were being summed in different orders. Instead of chasing a delta that will work for "all" cases (which may be difficult since the test is random), I propose fixing it in this way. There's still a different check that does require a delta because it's difficult to ensure the test case can match the same ordering used internally by the faceting implementation. That's the one that you might be seeing that's still set to 1? I think this is acceptable because it's only 3 floats being summed in this case and the "drift" will be much smaller. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a diff in pull request #849: LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations
gsmiller commented on code in PR #849: URL: https://github.com/apache/lucene/pull/849#discussion_r860876168 ## lucene/facet/src/test/org/apache/lucene/facet/taxonomy/TestTaxonomyFacetAssociations.java: ## @@ -142,6 +146,34 @@ public static void beforeClass() throws Exception { reader = writer.getReader(); writer.close(); taxoReader = new DirectoryTaxonomyReader(taxoDir); + +// To avoid floating point precision issues, it's useful to keep track of the values in the Review Comment: Yeah, it appears so. I'm not an expert on floating point arithmetic by any means, but that's what I tracked the issue down to. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10543) Achieve contribution workflow perfection (with progress)
[ https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529437#comment-17529437 ] Michael McCandless commented on LUCENE-10543: - +1 to work out a migration plan to switch to GitHub issues. Can we preserve our whole history here? The most compelling reason to me is that our Jira instance still does not (cannot?) support Markdown. Maybe all these comments where we all optimistically tried to use Markdown will then render correctly on migration to GitHub issues!! I even see attempted Markdown in our CHANGES.txt, but does our `changes2html.pl` support rendering/translating MD to HTML? Hmm then I will have to figure out how to migrate [https://jirasearch.mikemccandless.com|https://jirasearch.mikemccandless.com/] near-real-time indexing onto GitHub issues too! > Achieve contribution workflow perfection (with progress) > > > Key: LUCENE-10543 > URL: https://issues.apache.org/jira/browse/LUCENE-10543 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > Inspired by Dawid's build issue which has worked out for us: LUCENE-9871 > He hasn't even linked 10% of the issues/subtasks involved in that work > either, but we know. > I think we need a similar approach for the contribution workflow. There has > been some major improvements recently, a couple that come to mind: > * Tomoko made a CONTRIBUTING.md file which github recognizes and is way > better than the wiki stuff > * Some hazards/error messages/mazes in the build process and so on have > gotten fixed. > But there is more to do in my opinion, here is 3 ideas: > * Creating a PR still has a massive checklist template. But now this template > links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it > enough to just link to CONTRIBUTING.md and fix that as needed? > * Creating a PR still requires signing up for Apache JIRA and creating a JIRA > issue. There is zero value to this additional process. We often end out with > either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated > content. This is just an unnecessary dance, can we use github issues instead? > * Haven't dug into the github actions or configs very deeply. Maybe there's > simple stuff we can do such as give useful notifications if checks fail. Try > to guide the user to run ./gradlew check and fix it. It sucks to have to > review, look at logs, and manually add comments to do this stuff. > So let's have an issue to improve this area. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10543) Achieve contribution workflow perfection (with progress)
[ https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529452#comment-17529452 ] Michael McCandless commented on LUCENE-10543: - Also, I love this new "Achieve XYZ perfection (with progress)" template ;) > Achieve contribution workflow perfection (with progress) > > > Key: LUCENE-10543 > URL: https://issues.apache.org/jira/browse/LUCENE-10543 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > Inspired by Dawid's build issue which has worked out for us: LUCENE-9871 > He hasn't even linked 10% of the issues/subtasks involved in that work > either, but we know. > I think we need a similar approach for the contribution workflow. There has > been some major improvements recently, a couple that come to mind: > * Tomoko made a CONTRIBUTING.md file which github recognizes and is way > better than the wiki stuff > * Some hazards/error messages/mazes in the build process and so on have > gotten fixed. > But there is more to do in my opinion, here is 3 ideas: > * Creating a PR still has a massive checklist template. But now this template > links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it > enough to just link to CONTRIBUTING.md and fix that as needed? > * Creating a PR still requires signing up for Apache JIRA and creating a JIRA > issue. There is zero value to this additional process. We often end out with > either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated > content. This is just an unnecessary dance, can we use github issues instead? > * Haven't dug into the github actions or configs very deeply. Maybe there's > simple stuff we can do such as give useful notifications if checks fail. Try > to guide the user to run ./gradlew check and fix it. It sucks to have to > review, look at logs, and manually add comments to do this stuff. > So let's have an issue to improve this area. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on pull request #663: Lucene-10188: Give SortedSetDocValues a docValueCount()?
mikemccand commented on PR #663: URL: https://github.com/apache/lucene/pull/663#issuecomment-1112261833 Thanks @spike-liu -- this looks great to me! I'll leave it open for at least 48 hours to see if anyone else wants to approve/comment, else lazy consensus kicks in! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?
Greg Miller created LUCENE-10544: Summary: Should ExitableTermsEnum wrap postings and impacts? Key: LUCENE-10544 URL: https://issues.apache.org/jira/browse/LUCENE-10544 Project: Lucene - Core Issue Type: Bug Components: core/index Reporter: Greg Miller While looking into options for [LUCENE-10151|https://issues.apache.org/jira/browse/LUCENE-10151], I noticed that {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you start iterating postings/impact. The does create a {{ExitableTermsEnum}} wrapper when loading a {{TermsEnum}}, but that wrapper doesn't do anything to wrap postings or impact. So timeouts will be enforced when moving to the "next" term, but not when iterating the postings/impact associated with a term. I think we ought to wrap the postings/impacts as well with some form of timeout checking so timeouts can be enforced on long-running queries. I'm not sure why this wasn't done originally (back in 2014), but it was questioned back in 2020 on the original Jira [SOLR-5986|https://issues.apache.org/jira/browse/SOLR-5986?focusedCommentId=17177009&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17177009]. Does anyone know of a good reason why we shouldn't enforce timeouts in this way? Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} given that only {{next}} is being wrapped currently. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?
[ https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller updated LUCENE-10544: - Description: While looking into options for LUCENE-10151, I noticed that {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do anything to wrap postings or impacts. So timeouts will be enforced when moving to the "next" term, but not when iterating the postings/impacts associated with a term. I think we ought to wrap the postings/impacts as well with some form of timeout checking so timeouts can be enforced on long-running queries. I'm not sure why this wasn't done originally (back in 2014), but it was questioned back in 2020 on the original Jira SOLR-5986. Does anyone know of a good reason why we shouldn't enforce timeouts in this way? Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} given that only {{next}} is being wrapped currently. was: While looking into options for [LUCENE-10151|https://issues.apache.org/jira/browse/LUCENE-10151], I noticed that {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you start iterating postings/impact. The does create a {{ExitableTermsEnum}} wrapper when loading a {{TermsEnum}}, but that wrapper doesn't do anything to wrap postings or impact. So timeouts will be enforced when moving to the "next" term, but not when iterating the postings/impact associated with a term. I think we ought to wrap the postings/impacts as well with some form of timeout checking so timeouts can be enforced on long-running queries. I'm not sure why this wasn't done originally (back in 2014), but it was questioned back in 2020 on the original Jira [SOLR-5986|https://issues.apache.org/jira/browse/SOLR-5986?focusedCommentId=17177009&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17177009]. Does anyone know of a good reason why we shouldn't enforce timeouts in this way? Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} given that only {{next}} is being wrapped currently. > Should ExitableTermsEnum wrap postings and impacts? > --- > > Key: LUCENE-10544 > URL: https://issues.apache.org/jira/browse/LUCENE-10544 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Reporter: Greg Miller >Priority: Major > > While looking into options for LUCENE-10151, I noticed that > {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you > start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} > wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do > anything to wrap postings or impacts. So timeouts will be enforced when > moving to the "next" term, but not when iterating the postings/impacts > associated with a term. > I think we ought to wrap the postings/impacts as well with some form of > timeout checking so timeouts can be enforced on long-running queries. I'm not > sure why this wasn't done originally (back in 2014), but it was questioned > back in 2020 on the original Jira SOLR-5986. Does anyone know of a good > reason why we shouldn't enforce timeouts in this way? > Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} > given that only {{next}} is being wrapped currently. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?
[ https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529470#comment-17529470 ] Adrien Grand commented on LUCENE-10544: --- I suspect that this is due to the fact that the original implementation for timeouts consisted of a collector wrapper that would check the current runtime of the query in collect() and throw an exception if exceeded. This approach tends to work well to abort long-running queries that are heavy on postings, but it does not work for e.g. query rewrites that are heavy on terms like FuzzyQuery, so ExitableDirectoryReader tried to make it possible to stop every other operation than consumption of postings that could make a query run for a long time. > Should ExitableTermsEnum wrap postings and impacts? > --- > > Key: LUCENE-10544 > URL: https://issues.apache.org/jira/browse/LUCENE-10544 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Reporter: Greg Miller >Priority: Major > > While looking into options for LUCENE-10151, I noticed that > {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you > start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} > wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do > anything to wrap postings or impacts. So timeouts will be enforced when > moving to the "next" term, but not when iterating the postings/impacts > associated with a term. > I think we ought to wrap the postings/impacts as well with some form of > timeout checking so timeouts can be enforced on long-running queries. I'm not > sure why this wasn't done originally (back in 2014), but it was questioned > back in 2020 on the original Jira SOLR-5986. Does anyone know of a good > reason why we shouldn't enforce timeouts in this way? > Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} > given that only {{next}} is being wrapped currently. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?
[ https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529481#comment-17529481 ] Greg Miller commented on LUCENE-10544: -- Thanks [~jpountz]. One issue with the collector approach is that it doesn't catch two-phase iterator situations where there are many approximate hits but very few confirmed matches, since the collector will only be invoked after a match is confirmed. If the second phase check is costly, this can be particularly problematic. So it would be nice to enforce the check at a lower-level and solve for this issue if possible. > Should ExitableTermsEnum wrap postings and impacts? > --- > > Key: LUCENE-10544 > URL: https://issues.apache.org/jira/browse/LUCENE-10544 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Reporter: Greg Miller >Priority: Major > > While looking into options for LUCENE-10151, I noticed that > {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you > start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} > wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do > anything to wrap postings or impacts. So timeouts will be enforced when > moving to the "next" term, but not when iterating the postings/impacts > associated with a term. > I think we ought to wrap the postings/impacts as well with some form of > timeout checking so timeouts can be enforced on long-running queries. I'm not > sure why this wasn't done originally (back in 2014), but it was questioned > back in 2020 on the original Jira SOLR-5986. Does anyone know of a good > reason why we shouldn't enforce timeouts in this way? > Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} > given that only {{next}} is being wrapped currently. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a diff in pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.
mocobeta commented on code in PR #850: URL: https://github.com/apache/lucene/pull/850#discussion_r861028741 ## lucene/test-framework/src/java/org/apache/lucene/tests/analysis/MockTokenizer.java: ## @@ -66,11 +67,11 @@ public class MockTokenizer extends Tokenizer { * Limit the default token length to a size that doesn't cause random analyzer failures on * unpredictable data like the enwiki data set. * - * This value defaults to {@code CharTokenizer.DEFAULT_MAX_WORD_LEN}. + * This value defaults to {@link IndexWriter#MAX_TERM_LENGTH}. * * @see "https://issues.apache.org/jira/browse/LUCENE-10541"; */ - public static final int DEFAULT_MAX_TOKEN_LENGTH = 255; + public static final int DEFAULT_MAX_TOKEN_LENGTH = IndexWriter.MAX_TERM_LENGTH; Review Comment: I have no strong opinion about which proposal is the best though when the conversation falls into the point where "either one is fine", the first proposal may often get the point? I browsed through the MockTokenizer usages with the default max token length, it looks it's safe to trim the default to 255. Most tests use fixed-sized short strings for testing, or generate random strings by `TestUtil.randomAnalysisString(random, maxWordLength, simple)` with `maxWordLength=20` then split them by `MockTokenizer.WHITESPACE` automaton. I'd cast +1 to 255 for the default and encourage people to explicitly set larger values when needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.
uschindler commented on code in PR #850: URL: https://github.com/apache/lucene/pull/850#discussion_r861032318 ## lucene/test-framework/src/java/org/apache/lucene/tests/analysis/MockTokenizer.java: ## @@ -66,11 +67,11 @@ public class MockTokenizer extends Tokenizer { * Limit the default token length to a size that doesn't cause random analyzer failures on * unpredictable data like the enwiki data set. * - * This value defaults to {@code CharTokenizer.DEFAULT_MAX_WORD_LEN}. + * This value defaults to {@link IndexWriter#MAX_TERM_LENGTH}. * * @see "https://issues.apache.org/jira/browse/LUCENE-10541"; */ - public static final int DEFAULT_MAX_TOKEN_LENGTH = 255; + public static final int DEFAULT_MAX_TOKEN_LENGTH = IndexWriter.MAX_TERM_LENGTH; Review Comment: Thanks for the analysis, so it looks like 255 is a good default. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a diff in pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.
mocobeta commented on code in PR #850: URL: https://github.com/apache/lucene/pull/850#discussion_r861050340 ## lucene/test-framework/src/java/org/apache/lucene/tests/analysis/MockTokenizer.java: ## @@ -66,11 +67,11 @@ public class MockTokenizer extends Tokenizer { * Limit the default token length to a size that doesn't cause random analyzer failures on * unpredictable data like the enwiki data set. * - * This value defaults to {@code CharTokenizer.DEFAULT_MAX_WORD_LEN}. + * This value defaults to {@link IndexWriter#MAX_TERM_LENGTH}. * * @see "https://issues.apache.org/jira/browse/LUCENE-10541"; */ - public static final int DEFAULT_MAX_TOKEN_LENGTH = 255; + public static final int DEFAULT_MAX_TOKEN_LENGTH = IndexWriter.MAX_TERM_LENGTH; Review Comment: An excepiton - `TestWordDelimiterFilter.testRandomHugeStrings()` generates large random word (maxWordLength=8192), it may be better to change the maxTokenLength for it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10534) MinFloatFunction / MaxFloatFunction exists check can be slow
[ https://issues.apache.org/jira/browse/LUCENE-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529513#comment-17529513 ] Kevin Risden commented on LUCENE-10534: --- I reviewed the jmh tests again and realized I didn't test the new maxfloatfunction logic - I had copied the original code over :( Rerunning the benchmarks to see if there is an improvement. > MinFloatFunction / MaxFloatFunction exists check can be slow > > > Key: LUCENE-10534 > URL: https://issues.apache.org/jira/browse/LUCENE-10534 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Kevin Risden >Assignee: Kevin Risden >Priority: Minor > Attachments: flamegraph.png, flamegraph_getValueForDoc.png > > Time Spent: 1h 10m > Remaining Estimate: 0h > > MinFloatFunction > (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java) > and MaxFloatFunction > (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java) > both check if values exist. This is needed since the underlying valuesource > returns 0.0f as either a valid value or as a value when the document doesn't > have a value. > Even though this is changed to anyExists and short circuits in the case a > value is found in any document, the worst case is that there is no value > found and requires checking all the way through to the raw data. This is only > needed when 0.0f is returned and need to determine if it is a valid value or > the not found case. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dnhatn commented on a diff in pull request #842: LUCENE-10518: Relax field consistency check for old indices
dnhatn commented on code in PR #842: URL: https://github.com/apache/lucene/pull/842#discussion_r861098734 ## lucene/core/src/java/org/apache/lucene/index/FieldInfos.java: ## @@ -350,6 +360,11 @@ static final class FieldNumbers { this.omitNorms = new HashMap<>(); this.storeTermVectors = new HashMap<>(); this.softDeletesFieldName = softDeletesFieldName; + this.strictlyConsistent = indexCreatedVersionMajor >= 9; +} + +FieldNumbers(String softDeletesFieldName) { Review Comment: +1, fixed in https://github.com/apache/lucene/pull/842/commits/10c110500e4d0f78fcdf0b167e4061025d6b6927 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dnhatn commented on pull request #842: LUCENE-10518: Relax field consistency check for old indices
dnhatn commented on PR #842: URL: https://github.com/apache/lucene/pull/842#issuecomment-1112420113 @mayya-sharipova @jpountz Thanks for reviews. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dnhatn merged pull request #842: LUCENE-10518: Relax field consistency check for old indices
dnhatn merged PR #842: URL: https://github.com/apache/lucene/pull/842 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10518) FieldInfos consistency check can refuse to open Lucene 8 index
[ https://issues.apache.org/jira/browse/LUCENE-10518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529522#comment-17529522 ] ASF subversion and git services commented on LUCENE-10518: -- Commit 67e6e864bdf8489f16aa0b686790e5721794e187 in lucene's branch refs/heads/branch_9x from Nhat Nguyen [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=67e6e864bdf ] LUCENE-10518: Relax field consistency check for old indices (#842) This change relaxes the field consistency check for old indices as we didn't enforce that in the previous versions. This commit also disables the optimization that relies on the field consistency for old indices. > FieldInfos consistency check can refuse to open Lucene 8 index > -- > > Key: LUCENE-10518 > URL: https://issues.apache.org/jira/browse/LUCENE-10518 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 8.10.1 >Reporter: Nhat Nguyen >Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > > A field-infos consistency check introduced in Lucene 9 (LUCENE-9334) can > refuse to open a Lucene 8 index. Lucene 8 can create a partial FieldInfo if > hitting a non-aborting exception (for example [term is too > long|https://github.com/apache/lucene-solr/blob/6a6484ba396927727b16e5061384d3cd80d616b2/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L944]) > during processing fields of a document. We don't have this problem in Lucene > 9 as we process fields in two phases with the [first > phase|https://github.com/apache/lucene/blob/10ebc099c846c7d96f4ff5f9b7853df850fa8442/lucene/core/src/java/org/apache/lucene/index/IndexingChain.java#L589-L614] > processing only FieldInfos. > The issue can be reproduced with this snippet. > {code:java} > public void testWriteIndexOn8x() throws Exception { > FieldType KeywordField = new FieldType(); > KeywordField.setTokenized(false); > KeywordField.setOmitNorms(true); > KeywordField.setIndexOptions(IndexOptions.DOCS); > KeywordField.freeze(); > try (Directory dir = newDirectory()) { > IndexWriterConfig config = new IndexWriterConfig(); > config.setCommitOnClose(false); > config.setMergePolicy(NoMergePolicy.INSTANCE); > try (IndexWriter writer = new IndexWriter(dir, config)) { > // first segment > writer.addDocument(new Document()); // an empty doc > Document d1 = new Document(); > byte[] chars = new byte[IndexWriter.MAX_STORED_STRING_LENGTH + 1]; > Arrays.fill(chars, (byte) 'a'); > d1.add(new Field("field", new BytesRef(chars), KeywordField)); > d1.add(new BinaryDocValuesField("field", new BytesRef(chars))); > expectThrows(IllegalArgumentException.class, () -> > writer.addDocument(d1)); > writer.flush(); > // second segment > Document d2 = new Document(); > d2.add(new Field("field", new BytesRef("hello world"), KeywordField)); > d2.add(new SortedDocValuesField("field", new BytesRef("hello world"))); > writer.addDocument(d2); > writer.flush(); > writer.commit(); > // Check for doc values types consistency > Map docValuesTypes = new HashMap<>(); > try(DirectoryReader reader = DirectoryReader.open(dir)){ > for (LeafReaderContext leaf : reader.leaves()) { > for (FieldInfo fi : leaf.reader().getFieldInfos()) { > DocValuesType current = docValuesTypes.putIfAbsent(fi.name, > fi.getDocValuesType()); > if (current != null && current != fi.getDocValuesType()) { > fail("cannot change DocValues type from " + current + " to " + > fi.getDocValuesType() + " for field \"" + fi.name + "\""); > } > } > } > } > } > } > } > {code} > I would like to propose to: > - Backport the two-phase fields processing from Lucene9 to Lucene8. The patch > should be small and contained. > - Introduce an option in Lucene9 to skip checking field-infos consistency > (i.e., behave like Lucene 8 when the option is enabled). > /cc [~mayya] and [~jpountz] -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dnhatn opened a new pull request, #852: LUCENE-10518: Relax field consistency check for old indices (#842)
dnhatn opened a new pull request, #852: URL: https://github.com/apache/lucene/pull/852 This change relaxes the field consistency check for old indices as we didn't enforce that in the previous versions. This commit also disables the optimization that relies on the field consistency for old indices. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10534) MinFloatFunction / MaxFloatFunction exists check can be slow
[ https://issues.apache.org/jira/browse/LUCENE-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529572#comment-17529572 ] Kevin Risden commented on LUCENE-10534: --- Updated metrics - there is a benefit to the new maxfloatfunction logic to avoid duplicate exists() even with using the new fieldsource stuff in LUCENE-10542. | Benchmark | Mode | Cnt | Score and Error | Units | |-|---|-|--|---| | MyBenchmark.testMaxFloatFunction| thrpt | 25 | 69.949 ± 4.043 | ops/s | | MyBenchmark.testMaxFloatFunctionNewFloatFieldSource | thrpt | 25 | 112.326 ± 3.228 | ops/s | | MyBenchmark.testNewMaxFloatFunction | thrpt | 25 | 93.216 ± 2.757 | ops/s | | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSource | thrpt | 25 | 123.364 ± 7.861 | ops/s | | MyBenchmark.testMaxFloatFunctionRareField | thrpt | 25 | 257.339 ± 33.849 | ops/s | | MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField| thrpt | 25 | 287.175 ± 22.840 | ops/s | | MyBenchmark.testNewMaxFloatFunctionRareField| thrpt | 25 | 235.268 ± 4.103 | ops/s | | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSourceRareField | thrpt | 25 | 272.397 ± 8.406 | ops/s | > MinFloatFunction / MaxFloatFunction exists check can be slow > > > Key: LUCENE-10534 > URL: https://issues.apache.org/jira/browse/LUCENE-10534 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Kevin Risden >Assignee: Kevin Risden >Priority: Minor > Attachments: flamegraph.png, flamegraph_getValueForDoc.png > > Time Spent: 1h 10m > Remaining Estimate: 0h > > MinFloatFunction > (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java) > and MaxFloatFunction > (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java) > both check if values exist. This is needed since the underlying valuesource > returns 0.0f as either a valid value or as a value when the document doesn't > have a value. > Even though this is changed to anyExists and short circuits in the case a > value is found in any document, the worst case is that there is no value > found and requires checking all the way through to the raw data. This is only > needed when 0.0f is returned and need to determine if it is a valid value or > the not found case. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10542) FieldSource exists implementation can avoid value retrieval
[ https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Risden updated LUCENE-10542: -- Description: While looking at LUCENE-10534, found that *FieldSource exists implementation after LUCENE-7407 can avoid value lookup when just checking for exists. Flamegraphs - x axis = time spent as a percentage of time being profiled, y axis = stack trace bottom being first call top being last call Looking only at the left most getValueForDoc highlight only (and it helps to make it bigger or download the original) !flamegraph_getValueForDoc.png|height=410,width=1000! LongFieldSource#exists spends MOST of its time doing a LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its time doing two things primarily: * FilterNumericDocValues#longValue() * advance() This makes sense based on looking at the code (copied below to make it easier to see at once) https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72 {code:java} private long getValueForDoc(int doc) throws IOException { if (doc < lastDocID) { throw new IllegalArgumentException( "docs were sent out-of-order: lastDocID=" + lastDocID + " vs docID=" + doc); } lastDocID = doc; int curDocID = arr.docID(); if (doc > curDocID) { curDocID = arr.advance(doc); } if (doc == curDocID) { return arr.longValue(); } else { return 0; } } {code} LongFieldSource#exists - doesn't care about the actual longValue. Just that there was a value found when iterating through the doc values. https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95 {code:java} @Override public boolean exists(int doc) throws IOException { getValueForDoc(doc); return arr.docID() == doc; } {code} So putting this all together for exists calling getValueForDoc, we spent ~50% of the time trying to get the long value when we don't need it in exists. We can save that 50% of time making exists not care about the actual value and just return if doc == curDocID basically. This 50% extra is exaggerated in MaxFloatFunction (and other places) since exists() is being called a bunch. Eventually the value will be needed from longVal(), but if we call exists() say 3 times for every longVal(), we are spending a lot of time computing the value when we only need to check for existence. I found the same pattern in DoubleFieldSource, EnumFieldSource, FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change showing what this would look like: Simple JMH performance tests comparing the original FloatFieldSource to the new ones from PR #847. | Benchmark | Mode | Cnt | Score and Error | Units | |-|---|-|--|---| | MyBenchmark.testMaxFloatFunction| thrpt | 25 | 69.949 ± 4.043 | ops/s | | MyBenchmark.testMaxFloatFunctionNewFloatFieldSource | thrpt | 25 | 112.326 ± 3.228 | ops/s | | MyBenchmark.testNewMaxFloatFunction | thrpt | 25 | 93.216 ± 2.757 | ops/s | | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSource | thrpt | 25 | 123.364 ± 7.861 | ops/s | | MyBenchmark.testMaxFloatFunctionRareField | thrpt | 25 | 257.339 ± 33.849 | ops/s | | MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField| thrpt | 25 | 287.175 ± 22.840 | ops/s | | MyBenchmark.testNewMaxFloatFunctionRareField| thrpt | 25 | 235.268 ± 4.103 | ops/s | | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSourceRareField | thrpt | 25 | 272.397 ± 8.406 | ops/s | Source: https://github.com/risdenk/lucene-jmh was: While looking at LUCENE-10534, found that *FieldSource exists implementation after LUCENE-7407 can avoid value lookup when just checking for exists. Flamegraphs - x axis = time spent as a percentage of time being profiled, y axis = stack trace bottom being first call top being last call Looking only at the left most getValueForDoc highlight only (and it helps to make it bigger or download the original) !flamegraph_getValueForDoc.png|height=410,width=1000! LongFieldSource#exists spends MOST of its time doing a LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its time doing two things primarily: * FilterNumericDocValues#longValue() * advance() This makes sense based on looking at the code (copied below to make it easier to see at once) https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/q
[GitHub] [lucene] dnhatn merged pull request #852: LUCENE-10518: Relax field consistency check for old indices (#842)
dnhatn merged PR #852: URL: https://github.com/apache/lucene/pull/852 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10518) FieldInfos consistency check can refuse to open Lucene 8 index
[ https://issues.apache.org/jira/browse/LUCENE-10518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529578#comment-17529578 ] ASF subversion and git services commented on LUCENE-10518: -- Commit 62ebb22cc07510ac477d6bc76d17e65710395bd6 in lucene's branch refs/heads/branch_9_1 from Nhat Nguyen [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=62ebb22cc07 ] LUCENE-10518: Relax field consistency check for old indices (#842) This change relaxes the field consistency check for old indices as we didn't enforce that in the previous versions. This commit also disables the optimization that relies on the field consistency for old indices. > FieldInfos consistency check can refuse to open Lucene 8 index > -- > > Key: LUCENE-10518 > URL: https://issues.apache.org/jira/browse/LUCENE-10518 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 8.10.1 >Reporter: Nhat Nguyen >Priority: Major > Time Spent: 1h 20m > Remaining Estimate: 0h > > A field-infos consistency check introduced in Lucene 9 (LUCENE-9334) can > refuse to open a Lucene 8 index. Lucene 8 can create a partial FieldInfo if > hitting a non-aborting exception (for example [term is too > long|https://github.com/apache/lucene-solr/blob/6a6484ba396927727b16e5061384d3cd80d616b2/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L944]) > during processing fields of a document. We don't have this problem in Lucene > 9 as we process fields in two phases with the [first > phase|https://github.com/apache/lucene/blob/10ebc099c846c7d96f4ff5f9b7853df850fa8442/lucene/core/src/java/org/apache/lucene/index/IndexingChain.java#L589-L614] > processing only FieldInfos. > The issue can be reproduced with this snippet. > {code:java} > public void testWriteIndexOn8x() throws Exception { > FieldType KeywordField = new FieldType(); > KeywordField.setTokenized(false); > KeywordField.setOmitNorms(true); > KeywordField.setIndexOptions(IndexOptions.DOCS); > KeywordField.freeze(); > try (Directory dir = newDirectory()) { > IndexWriterConfig config = new IndexWriterConfig(); > config.setCommitOnClose(false); > config.setMergePolicy(NoMergePolicy.INSTANCE); > try (IndexWriter writer = new IndexWriter(dir, config)) { > // first segment > writer.addDocument(new Document()); // an empty doc > Document d1 = new Document(); > byte[] chars = new byte[IndexWriter.MAX_STORED_STRING_LENGTH + 1]; > Arrays.fill(chars, (byte) 'a'); > d1.add(new Field("field", new BytesRef(chars), KeywordField)); > d1.add(new BinaryDocValuesField("field", new BytesRef(chars))); > expectThrows(IllegalArgumentException.class, () -> > writer.addDocument(d1)); > writer.flush(); > // second segment > Document d2 = new Document(); > d2.add(new Field("field", new BytesRef("hello world"), KeywordField)); > d2.add(new SortedDocValuesField("field", new BytesRef("hello world"))); > writer.addDocument(d2); > writer.flush(); > writer.commit(); > // Check for doc values types consistency > Map docValuesTypes = new HashMap<>(); > try(DirectoryReader reader = DirectoryReader.open(dir)){ > for (LeafReaderContext leaf : reader.leaves()) { > for (FieldInfo fi : leaf.reader().getFieldInfos()) { > DocValuesType current = docValuesTypes.putIfAbsent(fi.name, > fi.getDocValuesType()); > if (current != null && current != fi.getDocValuesType()) { > fail("cannot change DocValues type from " + current + " to " + > fi.getDocValuesType() + " for field \"" + fi.name + "\""); > } > } > } > } > } > } > } > {code} > I would like to propose to: > - Backport the two-phase fields processing from Lucene9 to Lucene8. The patch > should be small and contained. > - Introduce an option in Lucene9 to skip checking field-infos consistency > (i.e., behave like Lucene 8 when the option is enabled). > /cc [~mayya] and [~jpountz] -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10542) FieldSource exists implementation can avoid value retrieval
[ https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529580#comment-17529580 ] Robert Muir commented on LUCENE-10542: -- Hi [~krisden], I think there might be more optimizations possible? I'm thinking (common) cases where all documents actually have a value for the field, you can use index statistics to detect this situation and avoid per-document checks, simply returning true. Similar optimization exists in FieldExistsQuery. There is an explanation of it here: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/FieldExistsQuery.java#L100-L105 I think it can be a separate issue/PR from what you have going on here, but it is related and maybe could be a followup, if you are trying to speed up performance of these checks. > FieldSource exists implementation can avoid value retrieval > --- > > Key: LUCENE-10542 > URL: https://issues.apache.org/jira/browse/LUCENE-10542 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Kevin Risden >Assignee: Kevin Risden >Priority: Minor > Attachments: flamegraph_getValueForDoc.png > > Time Spent: 20m > Remaining Estimate: 0h > > While looking at LUCENE-10534, found that *FieldSource exists implementation > after LUCENE-7407 can avoid value lookup when just checking for exists. > Flamegraphs - x axis = time spent as a percentage of time being profiled, y > axis = stack trace bottom being first call top being last call > Looking only at the left most getValueForDoc highlight only (and it helps to > make it bigger or download the original) > !flamegraph_getValueForDoc.png|height=410,width=1000! > LongFieldSource#exists spends MOST of its time doing a > LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its > time doing two things primarily: > * FilterNumericDocValues#longValue() > * advance() > This makes sense based on looking at the code (copied below to make it easier > to see at once) > https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72 > {code:java} > private long getValueForDoc(int doc) throws IOException { > if (doc < lastDocID) { > throw new IllegalArgumentException( > "docs were sent out-of-order: lastDocID=" + lastDocID + " vs > docID=" + doc); > } > lastDocID = doc; > int curDocID = arr.docID(); > if (doc > curDocID) { > curDocID = arr.advance(doc); > } > if (doc == curDocID) { > return arr.longValue(); > } else { > return 0; > } > } > {code} > LongFieldSource#exists - doesn't care about the actual longValue. Just that > there was a value found when iterating through the doc values. > https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95 > {code:java} > @Override > public boolean exists(int doc) throws IOException { > getValueForDoc(doc); > return arr.docID() == doc; > } > {code} > So putting this all together for exists calling getValueForDoc, we spent ~50% > of the time trying to get the long value when we don't need it in exists. We > can save that 50% of time making exists not care about the actual value and > just return if doc == curDocID basically. > This 50% extra is exaggerated in MaxFloatFunction (and other places) since > exists() is being called a bunch. Eventually the value will be needed from > longVal(), but if we call exists() say 3 times for every longVal(), we are > spending a lot of time computing the value when we only need to check for > existence. > I found the same pattern in DoubleFieldSource, EnumFieldSource, > FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change > showing what this would look like: > > Simple JMH performance tests comparing the original FloatFieldSource to the > new ones from PR #847. > > | Benchmark | Mode | > Cnt | Score and Error | Units | > |-|---|-|--|---| > | MyBenchmark.testMaxFloatFunction| thrpt | > 25 | 69.949 ± 4.043 | ops/s | > | MyBenchmark.testMaxFloatFunctionNewFloatFieldSource | thrpt | > 25 | 112.326 ± 3.228 | ops/s | > | MyBenchmark.testNewMaxFloatFunction | thrpt | > 25 | 93.216 ± 2.757 | ops/s | > | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSource | thrpt | > 25 | 123.364 ± 7.861 | ops/s | > | MyBenchmark.t
[jira] [Resolved] (LUCENE-10518) FieldInfos consistency check can refuse to open Lucene 8 index
[ https://issues.apache.org/jira/browse/LUCENE-10518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nhat Nguyen resolved LUCENE-10518. -- Fix Version/s: 9.1.1 9.2 Resolution: Fixed > FieldInfos consistency check can refuse to open Lucene 8 index > -- > > Key: LUCENE-10518 > URL: https://issues.apache.org/jira/browse/LUCENE-10518 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 8.10.1 >Reporter: Nhat Nguyen >Priority: Major > Fix For: 9.1.1, 9.2 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > A field-infos consistency check introduced in Lucene 9 (LUCENE-9334) can > refuse to open a Lucene 8 index. Lucene 8 can create a partial FieldInfo if > hitting a non-aborting exception (for example [term is too > long|https://github.com/apache/lucene-solr/blob/6a6484ba396927727b16e5061384d3cd80d616b2/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L944]) > during processing fields of a document. We don't have this problem in Lucene > 9 as we process fields in two phases with the [first > phase|https://github.com/apache/lucene/blob/10ebc099c846c7d96f4ff5f9b7853df850fa8442/lucene/core/src/java/org/apache/lucene/index/IndexingChain.java#L589-L614] > processing only FieldInfos. > The issue can be reproduced with this snippet. > {code:java} > public void testWriteIndexOn8x() throws Exception { > FieldType KeywordField = new FieldType(); > KeywordField.setTokenized(false); > KeywordField.setOmitNorms(true); > KeywordField.setIndexOptions(IndexOptions.DOCS); > KeywordField.freeze(); > try (Directory dir = newDirectory()) { > IndexWriterConfig config = new IndexWriterConfig(); > config.setCommitOnClose(false); > config.setMergePolicy(NoMergePolicy.INSTANCE); > try (IndexWriter writer = new IndexWriter(dir, config)) { > // first segment > writer.addDocument(new Document()); // an empty doc > Document d1 = new Document(); > byte[] chars = new byte[IndexWriter.MAX_STORED_STRING_LENGTH + 1]; > Arrays.fill(chars, (byte) 'a'); > d1.add(new Field("field", new BytesRef(chars), KeywordField)); > d1.add(new BinaryDocValuesField("field", new BytesRef(chars))); > expectThrows(IllegalArgumentException.class, () -> > writer.addDocument(d1)); > writer.flush(); > // second segment > Document d2 = new Document(); > d2.add(new Field("field", new BytesRef("hello world"), KeywordField)); > d2.add(new SortedDocValuesField("field", new BytesRef("hello world"))); > writer.addDocument(d2); > writer.flush(); > writer.commit(); > // Check for doc values types consistency > Map docValuesTypes = new HashMap<>(); > try(DirectoryReader reader = DirectoryReader.open(dir)){ > for (LeafReaderContext leaf : reader.leaves()) { > for (FieldInfo fi : leaf.reader().getFieldInfos()) { > DocValuesType current = docValuesTypes.putIfAbsent(fi.name, > fi.getDocValuesType()); > if (current != null && current != fi.getDocValuesType()) { > fail("cannot change DocValues type from " + current + " to " + > fi.getDocValuesType() + " for field \"" + fi.name + "\""); > } > } > } > } > } > } > } > {code} > I would like to propose to: > - Backport the two-phase fields processing from Lucene9 to Lucene8. The patch > should be small and contained. > - Introduce an option in Lucene9 to skip checking field-infos consistency > (i.e., behave like Lucene 8 when the option is enabled). > /cc [~mayya] and [~jpountz] -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a diff in pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.
dweiss commented on code in PR #850: URL: https://github.com/apache/lucene/pull/850#discussion_r861243269 ## lucene/test-framework/src/java/org/apache/lucene/tests/analysis/MockTokenizer.java: ## @@ -66,11 +67,11 @@ public class MockTokenizer extends Tokenizer { * Limit the default token length to a size that doesn't cause random analyzer failures on * unpredictable data like the enwiki data set. * - * This value defaults to {@code CharTokenizer.DEFAULT_MAX_WORD_LEN}. + * This value defaults to {@link IndexWriter#MAX_TERM_LENGTH}. * * @see "https://issues.apache.org/jira/browse/LUCENE-10541"; */ - public static final int DEFAULT_MAX_TOKEN_LENGTH = 255; + public static final int DEFAULT_MAX_TOKEN_LENGTH = IndexWriter.MAX_TERM_LENGTH; Review Comment: Ok. I'll go with 255 and tweak those tests that need it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss merged pull request #851: LUCENE-10088: backport simple text codec suppression and fs handle limit
dweiss merged PR #851: URL: https://github.com/apache/lucene/pull/851 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10088) Too many open files in TestIndexWriterMergePolicy.testStressUpdateSameDocumentWithMergeOnGetReader
[ https://issues.apache.org/jira/browse/LUCENE-10088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529603#comment-17529603 ] ASF subversion and git services commented on LUCENE-10088: -- Commit 49c8cee06f01e79a1b890d9b542ada79bacf9dd4 in lucene's branch refs/heads/branch_9x from Dawid Weiss [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=49c8cee06f0 ] LUCENE-10088: allow per-class override in HandleLimitFS. Bump the limit a bit for nightlies in TestIndexWriterMergePolicy. (#424). Suppress SimpleTextCodec on TestIndexWriterMergePolicy. (#851) > Too many open files in > TestIndexWriterMergePolicy.testStressUpdateSameDocumentWithMergeOnGetReader > -- > > Key: LUCENE-10088 > URL: https://issues.apache.org/jira/browse/LUCENE-10088 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Time Spent: 1h 40m > Remaining Estimate: 0h > > [This build > failure|https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/386/] > reproduces for me. I'll try to dig. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10088) Too many open files in TestIndexWriterMergePolicy.testStressUpdateSameDocumentWithMergeOnGetReader
[ https://issues.apache.org/jira/browse/LUCENE-10088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529611#comment-17529611 ] ASF subversion and git services commented on LUCENE-10088: -- Commit cb206196e4c929b6e639c58891950e5d6164f63d in lucene's branch refs/heads/branch_9_1 from Dawid Weiss [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=cb206196e4c ] LUCENE-10088: allow per-class override in HandleLimitFS. Bump the limit a bit for nightlies in TestIndexWriterMergePolicy. (#424). Suppress SimpleTextCodec on TestIndexWriterMergePolicy. (#851) > Too many open files in > TestIndexWriterMergePolicy.testStressUpdateSameDocumentWithMergeOnGetReader > -- > > Key: LUCENE-10088 > URL: https://issues.apache.org/jira/browse/LUCENE-10088 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Time Spent: 1h 40m > Remaining Estimate: 0h > > [This build > failure|https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/386/] > reproduces for me. I'll try to dig. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.
dweiss commented on PR #850: URL: https://github.com/apache/lucene/pull/850#issuecomment-1112594924 I ran nightly tests as well, they passed. BUILD SUCCESSFUL in 24m 6s 832 actionable tasks: 480 executed, 352 up-to-date -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren
gsmiller commented on PR #843: URL: https://github.com/apache/lucene/pull/843#issuecomment-1112602017 Thanks for all your thoughts @Yuti-G! I think I still disagree with changing the current behavior. If we want to implement the contract of `getTopChildren` properly, we'd need to sort by count. That's what "top" means in this API, and I think it's a bit confusing to start enforcing a "top n" parameter when "top" isn't really properly defined here. If you have a use-case that needs to actually truncate to `n` ranges while retaining the existing order of ranges originally provided as a user, why not do it outside of the facet code? There's no reason you can't get back all ranges as it works today and then just truncate to the number you want right? That's essentially what you're proposing putting in the faceting code, but I'm not sure doing that is the right thing to do here. All this said, I'm in favor of exploring the addition of a `getAllChildren` method that would retain this functionality and then properly implementing `getTopChildren` that orders by count if there's a use-case for that. I suspect there is. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Yuti-G commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren
Yuti-G commented on PR #843: URL: https://github.com/apache/lucene/pull/843#issuecomment-1112633027 Hi @gsmiller, thanks for your feedback! Yes, I was proposing keeping the existing behavior of getTopChildren but utilizing the top-n param, as I thought it would safe because we would not break the current use cases, but you are absolutely right that we should stick with what top-n means here. Please correct me if misunderstood your proposal. Are you suggesting that I add another getAllChildren API to retain the current behavior of getTopChildren, and then fix getTopChildren by returning ranges sorted by counts, or do you mean we should just keep this API untouched? Thanks! Best, Yuting -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10292) AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()
[ https://issues.apache.org/jira/browse/LUCENE-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529668#comment-17529668 ] Chris M. Hostetter commented on LUCENE-10292: - {quote} I'm still not sure whether these tests make sense without explicitly stating ... {quote} All I was really trying to do with these tests was demonstrate that data you get out of the Lookup before you call build(), can still be gotten from the Lookup while build() is incrementally consuming an iterator (which may take a long time if you are building up from a long iterator) and that this behavior is consistent across Lookup impls (as opposed to before i filed this issue, when most Lookups worked that way, but AnalyzingInfixSuggester would throw an ugly exception – which was certainly confusing to users who might switch from one impl to another). I didn't set out to make any hard & fast guarantee about the thread safety of all lookups – just improve the one that awas obviously inconsistent with the others (progress, not perfection) I'm happy to rename/re-document/remove the test code if you'd like - or confine it to just AnalyzingInfixSuggester - but i think (assume?) we can probably agree that the src/main code changes I've made in this issue are generally for the better? {quote}I think it'd be a much more user-friendly API if Lookup was actually detached entirely from its build process (for example by replacing the current build method with a builder() that would return a new immutable Lookup instance). ... I'm not saying this should be implemented here - perhaps it's worth a new issue to do this refactoring. {quote} +1 ... although some of the lookup impls explicitly support update()int individual suggestions – so you'd either have to remove that functionality or refactor the API in some way that it can still be optional. {quote}Separately from the above, if the test fails, it'll leak threads ... It should be replaced with a try/catch that rethrows an unchecked exception {quote} Yeah, that was lazy & sloppy of me – sorry about that. I'll fix it ASAP. > AnalyzingInfixSuggester thread safety: lookup() fails during (re)build() > > > Key: LUCENE-10292 > URL: https://issues.apache.org/jira/browse/LUCENE-10292 > Project: Lucene - Core > Issue Type: Bug >Reporter: Chris M. Hostetter >Assignee: Chris M. Hostetter >Priority: Major > Fix For: 10.0 (main), 9.2 > > Attachments: LUCENE-10292-1.patch, LUCENE-10292-2.patch, > LUCENE-10292-3.patch, LUCENE-10292.patch > > > I'm filing this based on anecdotal information from a Solr user w/o > experiencing it first hand (and I don't have a test case to demonstrate it) > but based on a reading of the code the underlying problem seems self > evident... > With all other Lookup implementations I've examined, it is possible to call > {{lookup()}} regardless of whether another thread is concurrently calling > {{build()}} – in all cases I've seen, it is even possible to call > {{lookup()}} even if {{build()}} has never been called: the result is just an > "empty" {{List}} > Typically this is works because the {{build()}} method uses temporary > datastructures until it's "build logic" is complete, at which point it > atomically replaces the datastructures used by the {{lookup()}} method. In > the case of {{AnalyzingInfixSuggester}} however, the {{build()}} method > starts by closing & null'ing out the {{protected SearcherManager > searcherMgr}} (which it only populates again once it's completed building up > it's index) and then the lookup method starts with... > {code:java} > if (searcherMgr == null) { > throw new IllegalStateException("suggester was not built"); > } > {code} > ... meaning it is unsafe to call {{AnalyzingInfixSuggester.lookup()}} in any > situation where another thread may be calling > {{AnalyzingInfixSuggester.build()}} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9476) Add a bulk ordinal->FacetLabel API
[ https://issues.apache.org/jira/browse/LUCENE-9476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gautam Worah resolved LUCENE-9476. -- Resolution: Fixed The PR was merged into the Lucene 9 branch and released in version Lucene 9.0. Resolving > Add a bulk ordinal->FacetLabel API > -- > > Key: LUCENE-9476 > URL: https://issues.apache.org/jira/browse/LUCENE-9476 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: 8.6.1 >Reporter: Gautam Worah >Priority: Minor > Labels: performance > Time Spent: 12h 40m > Remaining Estimate: 0h > > This issue is a spillover from the > [PR|https://github.com/apache/lucene-solr/pull/1733/files] for LUCENE 9450 > The idea here is to share a single {{BinaryDocValues}} instance per leaf per > query instead of creating a new one each time in the > {{DirectoryTaxonomyReader}}. > Suggested by [~mikemccand] > > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gautamworah96 opened a new pull request, #853: LUCENE-10524 Add benchmark suite details to CONTRIBUTING.md
gautamworah96 opened a new pull request, #853: URL: https://github.com/apache/lucene/pull/853 Today, new contributors are usually unaware of where luceneutil benchmarks are and when/how to run them. Committers usually end up pointing contributors to the benchmarks package when they make perf impacting changes and then they run the benchmarks. Adding benchmark details to the Lucene repo will also make them more accessible to other researchers who want to experiment/benchmark their own custom task implementation with Java Lucene. JIRA: https://issues.apache.org/jira/browse/LUCENE-10524 cc: @mocobeta -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10292) AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()
[ https://issues.apache.org/jira/browse/LUCENE-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529676#comment-17529676 ] ASF subversion and git services commented on LUCENE-10292: -- Commit 6afb9bc25a5935a3cba0a06e67b27fe290255467 in lucene's branch refs/heads/main from Chris M. Hostetter [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=6afb9bc25a5 ] LUCENE-10292: prevent thread leak (or test timeout) if exception/assertion failure in test iterator > AnalyzingInfixSuggester thread safety: lookup() fails during (re)build() > > > Key: LUCENE-10292 > URL: https://issues.apache.org/jira/browse/LUCENE-10292 > Project: Lucene - Core > Issue Type: Bug >Reporter: Chris M. Hostetter >Assignee: Chris M. Hostetter >Priority: Major > Fix For: 10.0 (main), 9.2 > > Attachments: LUCENE-10292-1.patch, LUCENE-10292-2.patch, > LUCENE-10292-3.patch, LUCENE-10292.patch > > > I'm filing this based on anecdotal information from a Solr user w/o > experiencing it first hand (and I don't have a test case to demonstrate it) > but based on a reading of the code the underlying problem seems self > evident... > With all other Lookup implementations I've examined, it is possible to call > {{lookup()}} regardless of whether another thread is concurrently calling > {{build()}} – in all cases I've seen, it is even possible to call > {{lookup()}} even if {{build()}} has never been called: the result is just an > "empty" {{List}} > Typically this is works because the {{build()}} method uses temporary > datastructures until it's "build logic" is complete, at which point it > atomically replaces the datastructures used by the {{lookup()}} method. In > the case of {{AnalyzingInfixSuggester}} however, the {{build()}} method > starts by closing & null'ing out the {{protected SearcherManager > searcherMgr}} (which it only populates again once it's completed building up > it's index) and then the lookup method starts with... > {code:java} > if (searcherMgr == null) { > throw new IllegalStateException("suggester was not built"); > } > {code} > ... meaning it is unsafe to call {{AnalyzingInfixSuggester.lookup()}} in any > situation where another thread may be calling > {{AnalyzingInfixSuggester.build()}} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vigyasharma commented on a diff in pull request #848: LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations
vigyasharma commented on code in PR #848: URL: https://github.com/apache/lucene/pull/848#discussion_r861353455 ## lucene/facet/src/test/org/apache/lucene/facet/taxonomy/TestTaxonomyFacetAssociations.java: ## @@ -500,10 +500,7 @@ private void validateFloats( assertNull(facetResult); } else { assertEquals(dim, facetResult.dim); - // We can expect the floats to be exactly equal here since we're ensuring that we sum them - // in the same order when determining expected values and when computing facets. See - // LUCENE-10530: - assertEquals(aggregatedValue, facetResult.value.floatValue(), 0f); + assertEquals(aggregatedValue, facetResult.value.floatValue(), 1f); Review Comment: +1, I was going to comment this in the last iteration. There may also be some internal drift in the way aggregation function works on floats. Found a nice read on it [here](https://www.reidatcheson.com/statistics/floating%20point/error/2018/01/15/float-sum-errors.html). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10292) AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()
[ https://issues.apache.org/jira/browse/LUCENE-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529677#comment-17529677 ] ASF subversion and git services commented on LUCENE-10292: -- Commit 115bcd9c66d9723a7a1ea5131aa2f5d16a6867b7 in lucene's branch refs/heads/branch_9x from Chris M. Hostetter [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=115bcd9c66d ] LUCENE-10292: prevent thread leak (or test timeout) if exception/assertion failure in test iterator (cherry picked from commit 6afb9bc25a5935a3cba0a06e67b27fe290255467) > AnalyzingInfixSuggester thread safety: lookup() fails during (re)build() > > > Key: LUCENE-10292 > URL: https://issues.apache.org/jira/browse/LUCENE-10292 > Project: Lucene - Core > Issue Type: Bug >Reporter: Chris M. Hostetter >Assignee: Chris M. Hostetter >Priority: Major > Fix For: 10.0 (main), 9.2 > > Attachments: LUCENE-10292-1.patch, LUCENE-10292-2.patch, > LUCENE-10292-3.patch, LUCENE-10292.patch > > > I'm filing this based on anecdotal information from a Solr user w/o > experiencing it first hand (and I don't have a test case to demonstrate it) > but based on a reading of the code the underlying problem seems self > evident... > With all other Lookup implementations I've examined, it is possible to call > {{lookup()}} regardless of whether another thread is concurrently calling > {{build()}} – in all cases I've seen, it is even possible to call > {{lookup()}} even if {{build()}} has never been called: the result is just an > "empty" {{List}} > Typically this is works because the {{build()}} method uses temporary > datastructures until it's "build logic" is complete, at which point it > atomically replaces the datastructures used by the {{lookup()}} method. In > the case of {{AnalyzingInfixSuggester}} however, the {{build()}} method > starts by closing & null'ing out the {{protected SearcherManager > searcherMgr}} (which it only populates again once it's completed building up > it's index) and then the lookup method starts with... > {code:java} > if (searcherMgr == null) { > throw new IllegalStateException("suggester was not built"); > } > {code} > ... meaning it is unsafe to call {{AnalyzingInfixSuggester.lookup()}} in any > situation where another thread may be calling > {{AnalyzingInfixSuggester.build()}} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10524) Augment CONTRIBUTING.md guide with instructions on how/when to benchmark
[ https://issues.apache.org/jira/browse/LUCENE-10524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529692#comment-17529692 ] Gautam Worah commented on LUCENE-10524: --- I've submitted a pretty basic patch. It may be better to redirect users to the actual benchmarking package and let that package have help instructions in case they get stuck rather than having help instructions in Apache Lucene? The Github patch does that at the moment. Looking forward to see what you think.. > Augment CONTRIBUTING.md guide with instructions on how/when to benchmark > > > Key: LUCENE-10524 > URL: https://issues.apache.org/jira/browse/LUCENE-10524 > Project: Lucene - Core > Issue Type: Wish >Reporter: Gautam Worah >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > This came up when I was trying to think about improving the experience for > new contributors. > Today, new contributors are usually unaware of where luceneutil benchmarks > are and when/how to run them. Committers usually end up pointing contributors > to the benchmarks package when they make perf impacting changes and then they > run the benchmarks. > > Adding benchmark details to the Lucene repo will also make them more > accessible to other researchers who want to experiment/benchmark their own > custom task implementation with Java Lucene. > > What does the community think? > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.
mocobeta commented on PR #850: URL: https://github.com/apache/lucene/pull/850#issuecomment-1112791650 I traced method calls of `BaseTokenStreamTestCase.checkRandomData(Random random, Analyzer a, int iterations, int maxWordLength)` with larger values than 255 for `maxWordLength`. Those might also need larger `maxTokenLength`; I don't know whether they must be changed or the reduced default value is okay, but just in case... ``` TestShingleFilter.testRandomHugeStrings() TestICUNormalizer2CharFilter.testRandomStringsNFKC() TestICUNormalizer2CharFilter.testRandomStringsNFKC_CF() TestICUNormalizer2CharFilter.testRandomStringsNFKD() TestSynonymMapFilter.testRandomHuge() TestHTMLStripCharFilter.testRandomHugeStrings() TestLookaheadTokenFilter.testRandomStrings() TestLookaheadTokenFilter.testNeverCallingPeek() TestSynonymGraphFilter.testRandomHuge() ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a diff in pull request #848: LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations
gsmiller commented on code in PR #848: URL: https://github.com/apache/lucene/pull/848#discussion_r861408137 ## lucene/facet/src/test/org/apache/lucene/facet/taxonomy/TestTaxonomyFacetAssociations.java: ## @@ -500,10 +500,7 @@ private void validateFloats( assertNull(facetResult); } else { assertEquals(dim, facetResult.dim); - // We can expect the floats to be exactly equal here since we're ensuring that we sum them - // in the same order when determining expected values and when computing facets. See - // LUCENE-10530: - assertEquals(aggregatedValue, facetResult.value.floatValue(), 0f); + assertEquals(aggregatedValue, facetResult.value.floatValue(), 1f); Review Comment: Thanks for sharing! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #848: LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations
gsmiller commented on PR #848: URL: https://github.com/apache/lucene/pull/848#issuecomment-1112794825 @vigyasharma yeah exactly. So the faceting implementation was summing the doc values in a different order from the test case. I wish I could do something similar for the overall aggregation value as well, but it depends on the order that siblings are vended by the taxonomy index. While it's true today that it's reverse-ordinal order, that seems pretty fragile to rely on. I'm more confident relying on the fact that documents are visited in the order they are collected since that feels pretty stable to the internal workings of Lucene's query evaluation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren
gsmiller commented on PR #843: URL: https://github.com/apache/lucene/pull/843#issuecomment-1112797399 > Are you suggesting that I add another getAllChildren API to retain the current behavior of getTopChildren, and then fix getTopChildren by returning ranges sorted by counts, or do you mean we should just keep this API untouched? Thanks! I'm suggesting the former, but I think that's a bigger project that you meant to capture with this PR and associated Jira. I would recommend opening a separate Jira issue to propose the idea of adding `getAllChildren` to `Facets`. We have use-cases for that functionality here (but it's just sort of improperly implemented under `getTopChildren`) and also in `LongValueFacetCounts` (see `#getAllChildrenSortByValue`). I think it's reasonable to add that functionality to all faceting implementations and then come up with a migration plan to move the current `RangeFacetCounts#getTopChildren` over and properly implement `RangeFacetCounts#getTopChildren`. But by opening a Jira, maybe others will weigh in and agree/disagree there, which would be a healthy conversation. Thanks again for digging into this and working on improving faceting! This is great! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #853: LUCENE-10524 Add benchmark suite details to CONTRIBUTING.md
mocobeta commented on PR #853: URL: https://github.com/apache/lucene/pull/853#issuecomment-1112798859 @gautamworah96 This looks good to me! I wonder if it makes sense to mention the off-the-shelf https://github.com/apache/lucene/tree/main/lucene/benchmark module here too (I didn't find instruction for running it, but a mention still could be helpful). I'd also suggest letting @mikemccand know the change, thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta merged pull request #846: LUCENE-10493: move n-best logic to analysis-common
mocobeta merged PR #846: URL: https://github.com/apache/lucene/pull/846 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10493) Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?
[ https://issues.apache.org/jira/browse/LUCENE-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529710#comment-17529710 ] ASF subversion and git services commented on LUCENE-10493: -- Commit c28f575b6db1ece837e1cba3fa5526e30135eb5a in lucene's branch refs/heads/main from Tomoko Uchida [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=c28f575b6db ] LUCENE-10493: move n-best logic to analysis-common (#846) > Can we unify the viterbi search logic in the tokenizers of kuromoji and nori? > - > > Key: LUCENE-10493 > URL: https://issues.apache.org/jira/browse/LUCENE-10493 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Tomoko Uchida >Priority: Major > Time Spent: 4h 20m > Remaining Estimate: 0h > > We now have common dictionary interfaces for kuromoji and nori > ([LUCENE-10393]). A natural question would be: is it possible to unify the > Japanese/Korean tokenizers? > The core methods of the two tokenizers are `parse()` and `backtrace()` to > calculate the minimum cost path by Viterbi search. I'd set the goal of this > issue to factoring out them into a separate class (in analysis-common) that > is shared between JapaneseTokenizer and KoreanTokenizer. > The algorithm to solve the minimum cost path itself is of course > language-agnostic, so I think it should be theoretically possible; the most > difficult part here might be the N-best path calculation - which is supported > only by JapaneseTokenizer and not by KoreanTokenizer. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Yuti-G commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren
Yuti-G commented on PR #843: URL: https://github.com/apache/lucene/pull/843#issuecomment-1112806474 Thanks @gsmiller for confirming! I will create another Jira issue to propose adding getAllChildren to Facets, and revisit this issue after getting feedback from the community. Thanks again for the great suggestion! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10545) Allow Github PR link in CHANGES.html
Tomoko Uchida created LUCENE-10545: -- Summary: Allow Github PR link in CHANGES.html Key: LUCENE-10545 URL: https://issues.apache.org/jira/browse/LUCENE-10545 Project: Lucene - Core Issue Type: Sub-task Reporter: Tomoko Uchida {\{changes2html}} already supports links to Github PRs, but the link is obsoleted so a small modification is needed to make it work again. With this change, developers can directly link to their pull requests from CHANGES.txt, skipping creating Jira issues. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta opened a new pull request, #854: LUCENE-10393: allow to link github pr from changes
mocobeta opened a new pull request, #854: URL: https://github.com/apache/lucene/pull/854 changes2html already supports links to Github PRs, but the link is obsoleted so a small modification is needed to make it work again. With this change, developers can directly link to their pull requests from CHANGES.txt, skipping creating Jira issues. I'll also show an example in this PR. Here is the generated CHANGES.html by `changesToHtml` task.  The "GITHUB-740" hyperlink correctly pointers #740. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a diff in pull request #854: LUCENE-10393: allow to link github pr from changes
mocobeta commented on code in PR #854: URL: https://github.com/apache/lucene/pull/854#discussion_r861431387 ## lucene/CHANGES.txt: ## @@ -53,7 +53,7 @@ Other * LUCENE-10253: The @BadApple annotation has been removed from the test framework. (Adrien Grand) -* LUCENE-10393: Unify binary dictionary and dictionary writer in Kuromoji and Nori. +* GITHUB-740: Unify binary dictionary and dictionary writer in Kuromoji and Nori. Review Comment: This mention to Github PR can be also written in a more verbose style like `GITHUB Pull Request #740`; I'd just prefer shorter string. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a diff in pull request #854: LUCENE-10393: allow to link github pr from changes
mocobeta commented on code in PR #854: URL: https://github.com/apache/lucene/pull/854#discussion_r861432053 ## gradle/documentation/changes-to-html/changes2html.pl: ## @@ -572,7 +572,7 @@ $item =~ s{((LUCENE|SOLR|INFRA)\s+(\d{3,}))} {$1}gi; # Link "[ github | gh ] pull request [ # ] X+" to Github pull request - $item =~ s{((?:(?:(?:github|gh)\s+)?pull\s+request\s*(?:\#?\s*)?|gh-)(\d+))} + $item =~ s{((?:(?:(?:github|gh)\s+)?pull\s+request\s*(?:\#?\s*)?|gh-|github-)(\d+))} Review Comment: The acronym `GH-` may be okay, but `GITHUB-` is also good for understandability and still succinct, I think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10545) Allow Github PR link in CHANGES.html
[ https://issues.apache.org/jira/browse/LUCENE-10545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida updated LUCENE-10545: --- Description: {{changes2html}} already supports links to Github PRs ([LUCENE-5383]), but the link is obsoleted so a small modification is needed to make it work again. With this change, developers can directly link to their pull requests from CHANGES.txt, skipping creating Jira issues. was: {\{changes2html}} already supports links to Github PRs, but the link is obsoleted so a small modification is needed to make it work again. With this change, developers can directly link to their pull requests from CHANGES.txt, skipping creating Jira issues. > Allow Github PR link in CHANGES.html > > > Key: LUCENE-10545 > URL: https://issues.apache.org/jira/browse/LUCENE-10545 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Tomoko Uchida >Priority: Minor > > {{changes2html}} already supports links to Github PRs ([LUCENE-5383]), but > the link is obsoleted so a small modification is needed to make it work again. > With this change, developers can directly link to their pull requests from > CHANGES.txt, skipping creating Jira issues. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10545) Allow Github PR link in CHANGES.html
[ https://issues.apache.org/jira/browse/LUCENE-10545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida updated LUCENE-10545: --- Description: {{changes2html}} already supports links to Github PRs (LUCENE-5383), but the link is obsoleted so a small modification is needed to make it work again. With this change, developers can directly link to their pull requests from CHANGES.txt, skipping creating Jira issues. When the change made (2014) Github was still a "third party" to Apache and the integration had been on experimental status; the situation was changed - now GitHub is the first-class code hosting service for ASF projects. I think the requirement to open a Jira issue can be relaxed. was: {{changes2html}} already supports links to Github PRs ([LUCENE-5383]), but the link is obsoleted so a small modification is needed to make it work again. With this change, developers can directly link to their pull requests from CHANGES.txt, skipping creating Jira issues. > Allow Github PR link in CHANGES.html > > > Key: LUCENE-10545 > URL: https://issues.apache.org/jira/browse/LUCENE-10545 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Tomoko Uchida >Priority: Minor > > {{changes2html}} already supports links to Github PRs (LUCENE-5383), but the > link is obsoleted so a small modification is needed to make it work again. > With this change, developers can directly link to their pull requests from > CHANGES.txt, skipping creating Jira issues. > When the change made (2014) Github was still a "third party" to Apache and > the integration had been on experimental status; the situation was changed - > now GitHub is the first-class code hosting service for ASF projects. I think > the requirement to open a Jira issue can be relaxed. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10545) Allow Github PR link in CHANGES.html
[ https://issues.apache.org/jira/browse/LUCENE-10545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529732#comment-17529732 ] Tomoko Uchida commented on LUCENE-10545: This PR shows a concrete example to show how it'll work. https://github.com/apache/lucene/pull/854 > Allow Github PR link in CHANGES.html > > > Key: LUCENE-10545 > URL: https://issues.apache.org/jira/browse/LUCENE-10545 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Tomoko Uchida >Priority: Minor > > {{changes2html}} already supports links to Github PRs (LUCENE-5383), but the > link is obsoleted so a small modification is needed to make it work again. > With this change, developers can directly link to their pull requests from > CHANGES.txt, skipping creating Jira issues. > When the change made (2014) Github was still a "third party" to Apache and > the integration had been on experimental status; the situation was changed - > now GitHub is the first-class code hosting service for ASF projects. I think > the requirement to open a Jira issue can be relaxed. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a diff in pull request #854: LUCENE-10393: allow to link github pr from changes
mocobeta commented on code in PR #854: URL: https://github.com/apache/lucene/pull/854#discussion_r861437113 ## lucene/CHANGES.txt: ## @@ -53,7 +53,7 @@ Other * LUCENE-10253: The @BadApple annotation has been removed from the test framework. (Adrien Grand) -* LUCENE-10393: Unify binary dictionary and dictionary writer in Kuromoji and Nori. +* GITHUB-740: Unify binary dictionary and dictionary writer in Kuromoji and Nori. Review Comment: Even if we start to use Github issues in the future, I think this will be still valid - CHANGES should directly point PRs, not issues in my opinion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a diff in pull request #854: LUCENE-10393: allow to link github pr from changes
mocobeta commented on code in PR #854: URL: https://github.com/apache/lucene/pull/854#discussion_r861437113 ## lucene/CHANGES.txt: ## @@ -53,7 +53,7 @@ Other * LUCENE-10253: The @BadApple annotation has been removed from the test framework. (Adrien Grand) -* LUCENE-10393: Unify binary dictionary and dictionary writer in Kuromoji and Nori. +* GITHUB-740: Unify binary dictionary and dictionary writer in Kuromoji and Nori. Review Comment: Even if we start to use Github issues in the future, I think this is still valid - CHANGES should directly point PRs, not issues in my opinion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10543) Achieve contribution workflow perfection (with progress)
[ https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529735#comment-17529735 ] Tomoko Uchida edited comment on LUCENE-10543 at 4/29/22 3:21 AM: - I opened [LUCENE-10545] and a PR for it. Can you please take a look? was (Author: tomoko uchida): I opened [LUCENE-10393] and a PR for it. Can you please take a look? > Achieve contribution workflow perfection (with progress) > > > Key: LUCENE-10543 > URL: https://issues.apache.org/jira/browse/LUCENE-10543 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > Inspired by Dawid's build issue which has worked out for us: LUCENE-9871 > He hasn't even linked 10% of the issues/subtasks involved in that work > either, but we know. > I think we need a similar approach for the contribution workflow. There has > been some major improvements recently, a couple that come to mind: > * Tomoko made a CONTRIBUTING.md file which github recognizes and is way > better than the wiki stuff > * Some hazards/error messages/mazes in the build process and so on have > gotten fixed. > But there is more to do in my opinion, here is 3 ideas: > * Creating a PR still has a massive checklist template. But now this template > links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it > enough to just link to CONTRIBUTING.md and fix that as needed? > * Creating a PR still requires signing up for Apache JIRA and creating a JIRA > issue. There is zero value to this additional process. We often end out with > either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated > content. This is just an unnecessary dance, can we use github issues instead? > * Haven't dug into the github actions or configs very deeply. Maybe there's > simple stuff we can do such as give useful notifications if checks fail. Try > to guide the user to run ./gradlew check and fix it. It sucks to have to > review, look at logs, and manually add comments to do this stuff. > So let's have an issue to improve this area. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10543) Achieve contribution workflow perfection (with progress)
[ https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529735#comment-17529735 ] Tomoko Uchida commented on LUCENE-10543: I opened [LUCENE-10393] and a PR for it. Can you please take a look? > Achieve contribution workflow perfection (with progress) > > > Key: LUCENE-10543 > URL: https://issues.apache.org/jira/browse/LUCENE-10543 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > Inspired by Dawid's build issue which has worked out for us: LUCENE-9871 > He hasn't even linked 10% of the issues/subtasks involved in that work > either, but we know. > I think we need a similar approach for the contribution workflow. There has > been some major improvements recently, a couple that come to mind: > * Tomoko made a CONTRIBUTING.md file which github recognizes and is way > better than the wiki stuff > * Some hazards/error messages/mazes in the build process and so on have > gotten fixed. > But there is more to do in my opinion, here is 3 ideas: > * Creating a PR still has a massive checklist template. But now this template > links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it > enough to just link to CONTRIBUTING.md and fix that as needed? > * Creating a PR still requires signing up for Apache JIRA and creating a JIRA > issue. There is zero value to this additional process. We often end out with > either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated > content. This is just an unnecessary dance, can we use github issues instead? > * Haven't dug into the github actions or configs very deeply. Maybe there's > simple stuff we can do such as give useful notifications if checks fail. Try > to guide the user to run ./gradlew check and fix it. It sucks to have to > review, look at logs, and manually add comments to do this stuff. > So let's have an issue to improve this area. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10545) Allow Github PR link in CHANGES.html
[ https://issues.apache.org/jira/browse/LUCENE-10545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529748#comment-17529748 ] Tomoko Uchida commented on LUCENE-10545: Note that this does not replace Jira issue with Github (we still absolutely need issues for discussion), but provides an option for developers who already have working patches. > Allow Github PR link in CHANGES.html > > > Key: LUCENE-10545 > URL: https://issues.apache.org/jira/browse/LUCENE-10545 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Tomoko Uchida >Priority: Minor > > {{changes2html}} already supports links to Github PRs (LUCENE-5383), but the > link is obsoleted so a small modification is needed to make it work again. > With this change, developers can directly link to their pull requests from > CHANGES.txt, skipping creating Jira issues. > When the change made (2014) Github was still a "third party" to Apache and > the integration had been on experimental status; the situation was changed - > now GitHub is the first-class code hosting service for ASF projects. I think > the requirement to open a Jira issue can be relaxed. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10543) Achieve contribution workflow perfection (with progress)
[ https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529735#comment-17529735 ] Tomoko Uchida edited comment on LUCENE-10543 at 4/29/22 6:36 AM: - I opened [LUCENE-10545] and a PR for it. Can you please take a look? It does not aim to replace Jira with GitHub, but relaxes the requirement for Jira account/issue for contributors who are ready to start the conversation with a working PR. This would make some progress I think. For the future, we could consider replacing Jira with Github issue - I'm not against it and it may be technically possible, I think we'd need some additional discussion (also with ASF, maybe?) for the big transition. I have no personal opinion about the migration to GitHub - Jira works well for me so far and I'm already accustomed to concurrent use of Jira and GitHub. was (Author: tomoko uchida): I opened [LUCENE-10545] and a PR for it. Can you please take a look? > Achieve contribution workflow perfection (with progress) > > > Key: LUCENE-10543 > URL: https://issues.apache.org/jira/browse/LUCENE-10543 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > Inspired by Dawid's build issue which has worked out for us: LUCENE-9871 > He hasn't even linked 10% of the issues/subtasks involved in that work > either, but we know. > I think we need a similar approach for the contribution workflow. There has > been some major improvements recently, a couple that come to mind: > * Tomoko made a CONTRIBUTING.md file which github recognizes and is way > better than the wiki stuff > * Some hazards/error messages/mazes in the build process and so on have > gotten fixed. > But there is more to do in my opinion, here is 3 ideas: > * Creating a PR still has a massive checklist template. But now this template > links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it > enough to just link to CONTRIBUTING.md and fix that as needed? > * Creating a PR still requires signing up for Apache JIRA and creating a JIRA > issue. There is zero value to this additional process. We often end out with > either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated > content. This is just an unnecessary dance, can we use github issues instead? > * Haven't dug into the github actions or configs very deeply. Maybe there's > simple stuff we can do such as give useful notifications if checks fail. Try > to guide the user to run ./gradlew check and fix it. It sucks to have to > review, look at logs, and manually add comments to do this stuff. > So let's have an issue to improve this area. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org