[jira] [Commented] (LUCENE-9204) Move span queries to the queries module
[ https://issues.apache.org/jira/browse/LUCENE-9204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353119#comment-17353119 ] Jim Ferenczi commented on LUCENE-9204: -- The rewriting of disjunctions that contains duplicate is simple to explain and to reason about. It is slower indeed but there's nothing else to compare with since it's the key for correctness. Taking aside the difference between spans and intervals, I think this example demonstrates how we can approach this kind of issue. We have a complex semantic based on minimum intervals that comes with limitations. We can try to break these limitations but most of the time that will change another layer that expects correctness So in this case, we apply a simple transformation that doesn't require to change the semantic. I don't think we can/should allow for leniency here, it is correct or it's not. We can evaluate the cost to keep correctness in some situations but I don't think there's any trade-offs to find. If it's not correct, it's a bug, not a feature ;). > Move span queries to the queries module > --- > > Key: LUCENE-9204 > URL: https://issues.apache.org/jira/browse/LUCENE-9204 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: main (9.0) > > Time Spent: 1h > Remaining Estimate: 0h > > We have a slightly odd situation currently, with two parallel query > structures for building complex positional queries: the long-standing span > queries, in core; and interval queries, in the queries module. Given that > interval queries solve at least some of the problems we've had with Spans, I > think we should be pushing users more towards these implementations. It's > counter-intuitive to do that when Spans are in core though. I've opened this > issue to discuss moving the spans package as a whole to the queries module. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #156: LUCENE-9975: don't require signing of 'unsignedJars' publication
dweiss commented on pull request #156: URL: https://github.com/apache/lucene/pull/156#issuecomment-850298459 bq. I don't know how this works It's simple but not straightforward. The two entries under "publications" (jars and unsignedJars) cause gradle to create a bunch of convention-named additional tasks actually connecting these publications to their target. So, when you run 'gradlew tasks' you'll see them and figure out the pattern: ``` generateMetadataFileForJarsPublication - Generates the Gradle metadata file for publication 'jars'. generateMetadataFileForUnsignedJarsPublication - Generates the Gradle metadata file for publication 'unsignedJars'. generatePomFileForJarsPublication - Generates the Maven POM file for publication 'jars'. generatePomFileForUnsignedJarsPublication - Generates the Maven POM file for publication 'unsignedJars'. mavenToApacheSnapshots - Publish Maven JARs and POMs to Apache Snapshots repository: https://repository.apache.org/content/repositories/snapshots mavenToLocalFolder - Publish Maven JARs and POMs locally to C:\Work\apache\lucene\main\build\maven-local mavenToLocalRepo - Publish Maven JARs and POMs to current user's local maven repository. publish - Publishes all publications produced by this project. publishAllPublicationsToApacheSnapshotsRepository - Publishes all Maven publications produced by this project to the ApacheSnapshots repository. publishAllPublicationsToBuildRepository - Publishes all Maven publications produced by this project to the build repository. publishJarsPublicationToApacheSnapshotsRepository - Publishes Maven publication 'jars' to Maven repository 'ApacheSnapshots'. publishJarsPublicationToBuildRepository - Publishes Maven publication 'jars' to Maven repository 'build'. publishJarsPublicationToMavenLocal - Publishes Maven publication 'jars' to the local Maven repository. publishToMavenLocal - Publishes all Maven publications produced by this project to the local Maven cache. publishUnsignedJarsPublicationToApacheSnapshotsRepository - Publishes Maven publication 'unsignedJars' to Maven repository 'ApacheSnapshots'. publishUnsignedJarsPublicationToBuildRepository - Publishes Maven publication 'unsignedJars' to Maven repository 'build'. publishUnsignedJarsPublicationToMavenLocal - Publishes Maven publication 'unsignedJars' to the local Maven repository. ``` I personally don't like typing so much so the rest of the code adds links between shorter aliases (mavenToLocalFolder, mavenToLocalRepo) and the right "convention" task. There is also a bit that configures both publications using the same closure to avoid some repetition. I agree this isn't super-clean but it's how gradle works. Perhaps there is a way to make it nicer, I'm not sure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss merged pull request #156: LUCENE-9975: don't require signing of 'unsignedJars' publication
dweiss merged pull request #156: URL: https://github.com/apache/lucene/pull/156 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9975) Don't require artifact signing for local maven artifact publishing
[ https://issues.apache.org/jira/browse/LUCENE-9975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss resolved LUCENE-9975. - Fix Version/s: main (9.0) Resolution: Fixed > Don't require artifact signing for local maven artifact publishing > -- > > Key: LUCENE-9975 > URL: https://issues.apache.org/jira/browse/LUCENE-9975 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > Fix For: main (9.0) > > Attachments: image-2021-05-26-11-58-45-713.png > > Time Spent: 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9975) Don't require artifact signing for local maven artifact publishing
[ https://issues.apache.org/jira/browse/LUCENE-9975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353136#comment-17353136 ] ASF subversion and git services commented on LUCENE-9975: - Commit 0a316b24950645cb5aca66b18d59a042ab1819e8 in lucene's branch refs/heads/main from Dawid Weiss [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=0a316b2 ] LUCENE-9975: don't require signing of 'unsignedJars' publication (maven artifacts published to the user's maven local repository, build folder and apache nexus). (#156) > Don't require artifact signing for local maven artifact publishing > -- > > Key: LUCENE-9975 > URL: https://issues.apache.org/jira/browse/LUCENE-9975 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > Attachments: image-2021-05-26-11-58-45-713.png > > Time Spent: 0.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9975) Don't require artifact signing for local maven artifact publishing
[ https://issues.apache.org/jira/browse/LUCENE-9975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353231#comment-17353231 ] Uwe Schindler commented on LUCENE-9975: --- Hi, I removed the {{-x signJarsPublication}} from the command line on Jenkins. Thanks, Uwe > Don't require artifact signing for local maven artifact publishing > -- > > Key: LUCENE-9975 > URL: https://issues.apache.org/jira/browse/LUCENE-9975 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > Fix For: main (9.0) > > Attachments: image-2021-05-26-11-58-45-713.png > > Time Spent: 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #128: LUCENE-9662: CheckIndex should be concurrent - parallelizing index parts check within each segment
rmuir commented on pull request #128: URL: https://github.com/apache/lucene/pull/128#issuecomment-850330270 > I was finally able to rebuild my local index with wikibigall generating 15 segments, and performed 2 test runs with different threadCount. With 11 threadCount, it took 359.293 sec in total to finish, and with 1 threadCount it took 378.583 sec in total, so about 5% time saving. I feel faster machine will have better time saving, but in general the speed up seems to be limited given the skewed distribution of checking speed of different segment parts (e.g. posting check can account for around 85% ~ 90% of the total time spent in the first segment check). I really don't think we should add this complexity to checkindex if it only improves it 5%. Let's take a step back and do this issue differently (e.g. just check whole segments concurrently with different threads). This should also require less complexity and be less invasive to the checkindex tool. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #128: LUCENE-9662: CheckIndex should be concurrent - parallelizing index parts check within each segment
rmuir commented on pull request #128: URL: https://github.com/apache/lucene/pull/128#issuecomment-850335925 > The python tool looks very cool and thanks for testing it! One issue though is that this bit flipping is causing checksum integrity check failures before the concurrent segment part checks kick in, so it may not test the changes here? I think we may actually need to write a semantically buggy segment file with good checksum verification to see the error still gets detected and propagated correctly? This is also why checkindex has its `-fast` option, to only verify file checksums, which is probably what most end-users actually want when trying to verify that their index is intact and correct. The current (historical) slow defaults are really only useful for debugging a bug in lucene itself. And there is even a `-slow` that can do even more of that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #128: LUCENE-9662: CheckIndex should be concurrent - parallelizing index parts check within each segment
rmuir commented on pull request #128: URL: https://github.com/apache/lucene/pull/128#issuecomment-850340290 I'm gonna throw out the crazy idea to make `-fast` the new default. The previous `-slow` could be moved to `-slower` and the previous current behavior could be activated by `-slow`. I think the tool's defaults are unnecessarily slow just for historical reasons? (not having checksums originally) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] janhoy merged pull request #136: LUCENE-9589 Swedish Minimal Stemmer
janhoy merged pull request #136: URL: https://github.com/apache/lucene/pull/136 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9589) Swedish Minimal Stemmer
[ https://issues.apache.org/jira/browse/LUCENE-9589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl resolved LUCENE-9589. - Fix Version/s: main (9.0) Resolution: Fixed > Swedish Minimal Stemmer > --- > > Key: LUCENE-9589 > URL: https://issues.apache.org/jira/browse/LUCENE-9589 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Major > Fix For: main (9.0) > > Time Spent: 3h > Remaining Estimate: 0h > > Swedish has a {{SwedishLightStemmer}} but lacks a Minimal stemmer that would > only stem singular/plural. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9589) Swedish Minimal Stemmer
[ https://issues.apache.org/jira/browse/LUCENE-9589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353316#comment-17353316 ] ASF subversion and git services commented on LUCENE-9589: - Commit 5fdff6eabb1e6452320800805355938ce8b903ff in lucene's branch refs/heads/main from Jan Høydahl [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=5fdff6e ] LUCENE-9589 Swedish Minimal Stemmer (#136) > Swedish Minimal Stemmer > --- > > Key: LUCENE-9589 > URL: https://issues.apache.org/jira/browse/LUCENE-9589 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Major > Time Spent: 3h > Remaining Estimate: 0h > > Swedish has a {{SwedishLightStemmer}} but lacks a Minimal stemmer that would > only stem singular/plural. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] gerlowskija merged pull request #2501: SOLR-15090: Allow backup storage in GCS
gerlowskija merged pull request #2501: URL: https://github.com/apache/lucene-solr/pull/2501 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dweiss commented on pull request #2499: LUCENE-9687: Hunspell support improvements
dweiss commented on pull request #2499: URL: https://github.com/apache/lucene-solr/pull/2499#issuecomment-850444097 Hi @donnerpeter ! I think it's a change that is beneficial for folks who use Hunspell and if you've already put the effort to backport, why not apply it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] donnerpeter merged pull request #2499: LUCENE-9687: Hunspell support improvements
donnerpeter merged pull request #2499: URL: https://github.com/apache/lucene-solr/pull/2499 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9687) Hunspell support improvements
[ https://issues.apache.org/jira/browse/LUCENE-9687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353510#comment-17353510 ] ASF subversion and git services commented on LUCENE-9687: - Commit c6b964a11576af65091e5f9a3bc819cd34020422 in lucene-solr's branch refs/heads/branch_8x from Peter Gromov [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c6b964a ] LUCENE-9687: Hunspell support improvements (#2499) add API for spell-checking and suggestions, support compound words, fix various behavior differences between Java and C++ implementations, improve performance backported from "main" branch > Hunspell support improvements > - > > Key: LUCENE-9687 > URL: https://issues.apache.org/jira/browse/LUCENE-9687 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Peter Gromov >Priority: Major > Fix For: main (9.0) > > Time Spent: 1h 10m > Remaining Estimate: 0h > > I'd like Lucene's Hunspell support to be on a par with the native C++ > Hunspell for spellchecking and suggestions, at least for some languages. So I > propose to: > * support the affix rules necessary for English, German, French, Spanish and > Russian dictionaries, possibly more languages later > * mirror Hunspell's suggestion algorithm in Lucene > * provide a public APIs for spellchecking, suggestion, stemming, > morphological data > * check corpora for specific languages to find and fix > spellchecking/suggestion discrepancices between Lucene's implementation and > Hunspell/C++ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9980) Do not expose deleted commits in IndexDeletionPolicy#onCommit
Nhat Nguyen created LUCENE-9980: --- Summary: Do not expose deleted commits in IndexDeletionPolicy#onCommit Key: LUCENE-9980 URL: https://issues.apache.org/jira/browse/LUCENE-9980 Project: Lucene - Core Issue Type: Bug Components: core/index Affects Versions: 8.8.1 Reporter: Nhat Nguyen If we fail to delete files that belong to a commit point, then we will expose that deleted commit in the next calls of IndexDeletionPolicy#onCommit(). I think we should never expose those deleted commit points as some of their files might have been deleted already. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dnhatn opened a new pull request #158: LUCENE-9980: Do not expose deleted commit points in IndexDeletionPolicy
dnhatn opened a new pull request #158: URL: https://github.com/apache/lucene/pull/158 If we fail to delete files that belong to a commit point, then we will expose that deleted commit in the next calls of `IndexDeletionPolicy#onCommit()`. I think we should never expose those deleted commit points as some of their files might have been deleted already. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dnhatn commented on pull request #158: LUCENE-9980: Do not expose deleted commit points in IndexDeletionPolicy
dnhatn commented on pull request #158: URL: https://github.com/apache/lucene/pull/158#issuecomment-850650666 @mikemccand Would you mind taking a look? Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #138: LUCENE-9956: Make getBaseQuery, getDrillDownQueries API from DrillDownQuery public
gautamworah96 commented on a change in pull request #138: URL: https://github.com/apache/lucene/pull/138#discussion_r641822305 ## File path: lucene/facet/src/java/org/apache/lucene/facet/DrillDownQuery.java ## @@ -170,15 +184,43 @@ private BooleanQuery getBooleanQuery() { return bq.build(); } - Query getBaseQuery() { + /** + * Returns the internal baseQuery of the DrillDownQuery + * + * @return The baseQuery used on initialization of DrillDownQuery + */ + public Query getBaseQuery() { return baseQuery; } - Query[] getDrillDownQueries() { + /** + * Returns the dimension queries added either via {@link #add(String, Query)} or {@link + * #add(String, String...)} + * + * @return The array of dimQueries + */ + public Query[] getDrillDownQueries() { +if (isAnyDimQueryDirty == false) { + // returns previously built dimQueries + Query[] builtDimQueriesCopy = new Query[builtDimQueries.size()]; + return builtDimQueries.toArray(builtDimQueriesCopy); +} Query[] dimQueries = new Query[this.dimQueries.size()]; for (int i = 0; i < dimQueries.length; ++i) { - dimQueries[i] = this.dimQueries.get(i).build(); + if (isDimQueryDirty.get(i) == true) { Review comment: Hmmm. That makes sense. I've implemented a `Set ` approach in the next commit -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] rmuir commented on pull request #2499: LUCENE-9687: Hunspell support improvements
rmuir commented on pull request #2499: URL: https://github.com/apache/lucene-solr/pull/2499#issuecomment-850699447 I'm sorry i've been nonresponsive, been crazy busy, just came up for air now. Thank you for backporting this to 8.x! Great improvements to get out there to the users now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9945) Extend DrillSideways to support exposing FacetCollectors directly
[ https://issues.apache.org/jira/browse/LUCENE-9945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sejal Pawar updated LUCENE-9945: Attachment: LUCENE-9945.patch > Extend DrillSideways to support exposing FacetCollectors directly > - > > Key: LUCENE-9945 > URL: https://issues.apache.org/jira/browse/LUCENE-9945 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: main (9.0) >Reporter: Greg Miller >Priority: Minor > Attachments: LUCENE-9945.patch > > > The {{DrillSideways}} logic currently encapsulates, 1) the creation of > multiple {{FacetsCollector}} instances, and 2) the processing of those > {{FacetsCollectors into a single Facets}} instance. While I suspect this > works well for most common cases, and is simple to understand, it's difficult > to extend to more advanced cases. > I propose extending {{DrillSideways}} to support exposing the underlying > {{FacetsCollector}} instances if the user needs them, in addition to > maintaining the current functionality for all of the more common cases. > Specifically, I'd like to add both the "drill down" {{FacetsCollector}} and > map of dim -> {{FacetsCollector}} for "drill sideways" to the > {{DrillSidewaysResult}} and {{ConcurrentDrillSidewaysResult}} classes. While > it's true that a user can extend {{DrillSideways}} and override > {{buildFacetsResult}} to keep track of these, it seems reasonable to provide > this in {{DrillSideways}} itself so users don't need to sub-class for only > this purpose. > Here are two use-cases illustrating the desire for this: > # For certain cases, instead of actually creating a {{Facets}} instance from > the {{FacetsCollector}}, I'd like to intersect the {{FacetsCollector}} > matching docs with a {{TermEnum}} to do determine whether-or-not at least one > match exists. This is useful for a "facet" that can only take on the values > "true"/"false", and I want to make sure at least one hit has the value "true". > # In another case, I only care about some aggregate statistics for a given > facet field. For example, I want to find the min and max values. For this, I > want to intersect some doc value field with a {{FacetsCollector}} and only > track the min/max values I observe while iterating. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9945) Extend DrillSideways to support exposing FacetCollectors directly
[ https://issues.apache.org/jira/browse/LUCENE-9945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sejal Pawar updated LUCENE-9945: Attachment: (was: LUCENE-9945.patch) > Extend DrillSideways to support exposing FacetCollectors directly > - > > Key: LUCENE-9945 > URL: https://issues.apache.org/jira/browse/LUCENE-9945 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: main (9.0) >Reporter: Greg Miller >Priority: Minor > > The {{DrillSideways}} logic currently encapsulates, 1) the creation of > multiple {{FacetsCollector}} instances, and 2) the processing of those > {{FacetsCollectors into a single Facets}} instance. While I suspect this > works well for most common cases, and is simple to understand, it's difficult > to extend to more advanced cases. > I propose extending {{DrillSideways}} to support exposing the underlying > {{FacetsCollector}} instances if the user needs them, in addition to > maintaining the current functionality for all of the more common cases. > Specifically, I'd like to add both the "drill down" {{FacetsCollector}} and > map of dim -> {{FacetsCollector}} for "drill sideways" to the > {{DrillSidewaysResult}} and {{ConcurrentDrillSidewaysResult}} classes. While > it's true that a user can extend {{DrillSideways}} and override > {{buildFacetsResult}} to keep track of these, it seems reasonable to provide > this in {{DrillSideways}} itself so users don't need to sub-class for only > this purpose. > Here are two use-cases illustrating the desire for this: > # For certain cases, instead of actually creating a {{Facets}} instance from > the {{FacetsCollector}}, I'd like to intersect the {{FacetsCollector}} > matching docs with a {{TermEnum}} to do determine whether-or-not at least one > match exists. This is useful for a "facet" that can only take on the values > "true"/"false", and I want to make sure at least one hit has the value "true". > # In another case, I only care about some aggregate statistics for a given > facet field. For example, I want to find the min and max values. For this, I > want to intersect some doc value field with a {{FacetsCollector}} and only > track the min/max values I observe while iterating. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9964) FacetResult.labelValues.value is not accurate for duplicate labels in a document
[ https://issues.apache.org/jira/browse/LUCENE-9964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gautam Worah updated LUCENE-9964: - Issue Type: Bug (was: Improvement) > FacetResult.labelValues.value is not accurate for duplicate labels in a > document > > > Key: LUCENE-9964 > URL: https://issues.apache.org/jira/browse/LUCENE-9964 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet >Affects Versions: 8.8.1 >Reporter: Gautam Worah >Priority: Minor > > As part of a separate [bug|https://github.com/apache/lucene/pull/131] in > FacetResult#value we discovered that FacetResult.labelValues.value is not > accurate for duplicate labels in a document that uses > SortedNumericDocValuesFields. > In theory, each label should only be counted once from a document when > returning the labelValues, but today, each duplicate label in a document is > counted uniquely. > A test case showing the current (inaccurate) output is > [here|https://github.com/gautamworah96/lucene/commit/042878117308f76629a27b0bcf83e25f074dc8b1#diff-6fda2d7520edb7e11d4af3f5b6e80c073a0e20790ec42692573137959837d742] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] sejal-pawar opened a new pull request #159: extend DrillSideways to expose FacetCollector and drill-down dimensions
sejal-pawar opened a new pull request #159: URL: https://github.com/apache/lucene/pull/159 # Description (From [LUCENE-9945](https://issues.apache.org/jira/browse/LUCENE-9945)) The DrillSideways logic currently encapsulates, 1) the creation of multiple FacetsCollector instances, and 2) the processing of those FacetsCollectors into a single Facets instance. While I suspect this works well for most common cases, and is simple to understand, it's difficult to extend to more advanced cases. LUCENE-9945 proposes extending DrillSideways to support exposing the underlying FacetsCollector instances if the user needs them, in addition to maintaining the current functionality for all of the more common cases. Specifically, I'd like to add both the "drill down" FacetsCollector and map of dim -> FacetsCollector for "drill sideways" to the DrillSidewaysResult and ConcurrentDrillSidewaysResult classes. While it's possible that a user can extend DrillSideways and override buildFacetsResult to keep track of these, it seems reasonable to provide this in DrillSideways itself so users don't need to sub-class for only this purpose. # Solution Modified DrillSideways itself to expose FacetCollector map of dimensions. # Tests ./gradlew test # Checklist Please review the following and check all that apply: - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [x] I have created a Jira issue and added the issue ID to my pull request title. - [x] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [x] I have developed this patch against the `main` branch. - [x] I have run `./gradlew check`. - [ ] I have added tests for my changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a change in pull request #138: LUCENE-9956: Make getBaseQuery, getDrillDownQueries API from DrillDownQuery public
gsmiller commented on a change in pull request #138: URL: https://github.com/apache/lucene/pull/138#discussion_r641848546 ## File path: lucene/facet/src/java/org/apache/lucene/facet/DrillDownQuery.java ## @@ -170,16 +185,36 @@ private BooleanQuery getBooleanQuery() { return bq.build(); } - Query getBaseQuery() { + /** + * Returns the internal baseQuery of the DrillDownQuery + * + * @return The baseQuery used on initialization of DrillDownQuery + */ + public Query getBaseQuery() { return baseQuery; } - Query[] getDrillDownQueries() { -Query[] dimQueries = new Query[this.dimQueries.size()]; -for (int i = 0; i < dimQueries.length; ++i) { - dimQueries[i] = this.dimQueries.get(i).build(); + /** + * Returns the dimension queries added either via {@link #add(String, Query)} or {@link + * #add(String, String...)} + * + * @return The array of dimQueries + */ + public Query[] getDrillDownQueries() { +if (dirtyDimQueryIndex.isEmpty()) { + // returns previously built dimQueries + Query[] builtDimQueriesCopy = new Query[builtDimQueries.size()]; + return builtDimQueries.toArray(builtDimQueriesCopy); +} +for (int i = 0; i < this.dimQueries.size(); ++i) { + if (dirtyDimQueryIndex.contains(i)) { +builtDimQueries.set(i, this.dimQueries.get(i).build()); +dirtyDimQueryIndex.remove(i); + } } -return dimQueries; +assert dirtyDimQueryIndex.isEmpty(); +// by this time builtDimQueries has all the built queries and dirtyDimQueryIndex is empty +return getDrillDownQueries(); Review comment: What about something like: ``` public Query[] getDrillDownQueries() { for (Integer dirtyIndex : dirtyDimQueryIndex) { builtDimQueries.set(dirtyIndex, dimQueries.get(dirtyIndex).build()); } dirtyDimQueryIndex.clear(); return builtDimQueries.toArray(new Query[0]); } ``` ## File path: lucene/facet/src/java/org/apache/lucene/facet/DrillDownQuery.java ## @@ -170,16 +185,36 @@ private BooleanQuery getBooleanQuery() { return bq.build(); } - Query getBaseQuery() { + /** + * Returns the internal baseQuery of the DrillDownQuery + * + * @return The baseQuery used on initialization of DrillDownQuery + */ + public Query getBaseQuery() { return baseQuery; } - Query[] getDrillDownQueries() { -Query[] dimQueries = new Query[this.dimQueries.size()]; -for (int i = 0; i < dimQueries.length; ++i) { - dimQueries[i] = this.dimQueries.get(i).build(); + /** + * Returns the dimension queries added either via {@link #add(String, Query)} or {@link + * #add(String, String...)} + * + * @return The array of dimQueries + */ + public Query[] getDrillDownQueries() { Review comment: Let's take advantage of the built query caching in `getBooleanQuery` as well! (Sorry for missing this earlier). We should be able to replace the explicit building of the drill down queries with something like: ``` for (Query query : getDrillDownQueries()) { bq.add(query, Occur.FILTER); } ``` ## File path: lucene/facet/src/java/org/apache/lucene/facet/DrillDownQuery.java ## @@ -53,6 +55,8 @@ public static Term term(String field, String dim, String... path) { private final Query baseQuery; private final List dimQueries = new ArrayList<>(); private final Map drillDownDims = new LinkedHashMap<>(); + private List builtDimQueries = new ArrayList<>(); + private Set dirtyDimQueryIndex = new HashSet<>(); Review comment: Can you make these final please? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a change in pull request #159: LUCENE-9945: extend DrillSideways to expose FacetCollector and drill-down dimensions
gsmiller commented on a change in pull request #159: URL: https://github.com/apache/lucene/pull/159#discussion_r641852899 ## File path: lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java ## @@ -335,10 +342,22 @@ protected boolean scoreSubDocsAtOnce() { /** Hits. */ public final TopDocs hits; +/** Collector to compute Facets */ +FacetsCollector drillDowns; + +/** drill-down dimensions */ +Map drillDownDims; Review comment: The idea here is actually a little bit different. What we'd like to do is expose the same inputs available to `buildFacetsResult` inside of `DrillSidewaysResult`. So the idea is that calling code has direct access to the `FacetsCollector` instances, and also knows which dimensions they belong to. So what you did here for the `drillDowns` collector is exactly what I was thinking (except make it `public` so it has visibility to the calling code outside the package). But for the "sideways" collectors, I think we want a `Map` that maps each drill down dimension to the `FacetsCollector` associated with it. Essentially we want calling code to get all the same information that `buildFacetsResult` gets, but not force the user to produce an instance of `Facets` as a result. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] sejal-pawar commented on a change in pull request #159: LUCENE-9945: extend DrillSideways to expose FacetCollector and drill-down dimensions
sejal-pawar commented on a change in pull request #159: URL: https://github.com/apache/lucene/pull/159#discussion_r641856799 ## File path: lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java ## @@ -335,10 +342,22 @@ protected boolean scoreSubDocsAtOnce() { /** Hits. */ public final TopDocs hits; +/** Collector to compute Facets */ +FacetsCollector drillDowns; + +/** drill-down dimensions */ +Map drillDownDims; Review comment: Oh okay, understood. In that case why not keep `DrillSidewaysResult` consistent with the signature of `buildFacetResult` and have `FacetsCollector drillDowns, FacetsCollector[] drillSideways`, String[] drillSidewaysDims` exposed as is instead of `Map`? Or alternatively is it possible to consider revising signature of `buildFacetResult`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9945) Extend DrillSideways to support exposing FacetCollectors directly
[ https://issues.apache.org/jira/browse/LUCENE-9945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353635#comment-17353635 ] Sejal Pawar commented on LUCENE-9945: - Pull request: https://github.com/apache/lucene/pull/159 > Extend DrillSideways to support exposing FacetCollectors directly > - > > Key: LUCENE-9945 > URL: https://issues.apache.org/jira/browse/LUCENE-9945 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: main (9.0) >Reporter: Greg Miller >Priority: Minor > Time Spent: 20m > Remaining Estimate: 0h > > The {{DrillSideways}} logic currently encapsulates, 1) the creation of > multiple {{FacetsCollector}} instances, and 2) the processing of those > {{FacetsCollectors into a single Facets}} instance. While I suspect this > works well for most common cases, and is simple to understand, it's difficult > to extend to more advanced cases. > I propose extending {{DrillSideways}} to support exposing the underlying > {{FacetsCollector}} instances if the user needs them, in addition to > maintaining the current functionality for all of the more common cases. > Specifically, I'd like to add both the "drill down" {{FacetsCollector}} and > map of dim -> {{FacetsCollector}} for "drill sideways" to the > {{DrillSidewaysResult}} and {{ConcurrentDrillSidewaysResult}} classes. While > it's true that a user can extend {{DrillSideways}} and override > {{buildFacetsResult}} to keep track of these, it seems reasonable to provide > this in {{DrillSideways}} itself so users don't need to sub-class for only > this purpose. > Here are two use-cases illustrating the desire for this: > # For certain cases, instead of actually creating a {{Facets}} instance from > the {{FacetsCollector}}, I'd like to intersect the {{FacetsCollector}} > matching docs with a {{TermEnum}} to do determine whether-or-not at least one > match exists. This is useful for a "facet" that can only take on the values > "true"/"false", and I want to make sure at least one hit has the value "true". > # In another case, I only care about some aggregate statistics for a given > facet field. For example, I want to find the min and max values. For this, I > want to intersect some doc value field with a {{FacetsCollector}} and only > track the min/max values I observe while iterating. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] sejal-pawar commented on a change in pull request #159: LUCENE-9945: extend DrillSideways to expose FacetCollector and drill-down dimensions
sejal-pawar commented on a change in pull request #159: URL: https://github.com/apache/lucene/pull/159#discussion_r641856799 ## File path: lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java ## @@ -335,10 +342,22 @@ protected boolean scoreSubDocsAtOnce() { /** Hits. */ public final TopDocs hits; +/** Collector to compute Facets */ +FacetsCollector drillDowns; + +/** drill-down dimensions */ +Map drillDownDims; Review comment: Oh okay, understood. In that case why not keep `DrillSidewaysResult` consistent with the signature of `buildFacetResult` and have `FacetsCollector drillDowns, FacetsCollector[] drillSideways, String[] drillSidewaysDims` exposed as is instead of `Map`? Or alternatively is it possible to consider revising signature of `buildFacetResult`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9981) CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit
Robert Muir created LUCENE-9981: --- Summary: CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit Key: LUCENE-9981 URL: https://issues.apache.org/jira/browse/LUCENE-9981 Project: Lucene - Core Issue Type: Task Reporter: Robert Muir We have a {{maxDeterminizedStates = 1}} limit designed to keep regexp-type queries from blowing up. But we have an adversary that will run for 268s on my laptop before hitting exception, first reported here: https://github.com/opensearch-project/OpenSearch/issues/687 When I run the test and jstack the threads, this what I see: {noformat} "TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x7fff7006ca80 nid=0x231c8 runnable [0x7fff8b7f] java.lang.Thread.State: RUNNABLE at org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106) at org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769) at org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155) at org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247) at org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:104) at org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:82) at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:138) at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:114) at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:72) at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:62) at org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42) {noformat} This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be an "up-front" optimization to make the actual subsequent terms-intensive part of the query faster. But it makes the whole query run for nearly 5 minutes before it does anything. So I definitely think we should improve {{getCommonSuffixBytesRef}} to be more "best-effort". For example, it can reduce the lower bound to {{1000}} and catch the exception like such: {code} try { // this is slow, and just an opto anyway, so don't burn cycles on it for some crazy worst-case. // if we don't set this common suffix, the query will just run a bit slower, that's all. int limit = Math.min(1000, maxDeterminizedStates); BytesRef suffix = Operations.getCommonSuffixBytesRef(binary, limit); ... (setting commonSuffixRef) } catch (TooComplexTooDeterminizeException notWorthIt) { commonSuffixRef = null; } {code} Another, maybe simpler option, is to just check that input state/transitions accounts don't exceed some low limit N. Basically this opto is geared at stuff like leading wildcard query of "*foo". By computing that the common suffix is "foo" we can spend less CPU in the terms dictionary because we can first do a memcmp before having to run data thru any finite state machine. It's really a microopt and we shouldn't be spending whole seconds of cpu on it, ever. But I still don't quite understand how the current limits are giving the behavior today, maybe there is a bigger issue and I don't want to shove something under the rug. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a change in pull request #159: LUCENE-9945: extend DrillSideways to expose FacetCollector and drill-down dimensions
gsmiller commented on a change in pull request #159: URL: https://github.com/apache/lucene/pull/159#discussion_r641859710 ## File path: lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java ## @@ -335,10 +342,22 @@ protected boolean scoreSubDocsAtOnce() { /** Hits. */ public final TopDocs hits; +/** Collector to compute Facets */ +FacetsCollector drillDowns; + +/** drill-down dimensions */ +Map drillDownDims; Review comment: Great question! Both of these are reasonable suggestions with different "ripple effects" I think. Changing the method signature of `buildFacetResult` is possible, but is non-backwards compatible. So there would be some users impacted when upgrading if they've subclassed `DrillSideways` with this method overridden. I thought of putting a `Map` in the response since it seemed like a "friendlier" API for users to work with, but maybe that's a silly thought since `buildFacetsResult` is also effectively a public API. I think it's totally valid to be consistent with the `buildFacetResult` method signature here and provide parallel arrays for the dimensions and associated `FacetsCollector`s. I would suggest adding some nice javadoc to those fields describing them (which I would recommend regardless of the approach I suppose). Thanks again! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9981) CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit
[ https://issues.apache.org/jira/browse/LUCENE-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-9981: Attachment: LUCENE-9981_test.patch > CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with > default maxDeterminizedStates limit > > > Key: LUCENE-9981 > URL: https://issues.apache.org/jira/browse/LUCENE-9981 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-9981_test.patch > > > We have a {{maxDeterminizedStates = 1}} limit designed to keep > regexp-type queries from blowing up. > But we have an adversary that will run for 268s on my laptop before hitting > exception, first reported here: > https://github.com/opensearch-project/OpenSearch/issues/687 > When I run the test and jstack the threads, this what I see: > {noformat} > "TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 > os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x7fff7006ca80 nid=0x231c8 > runnable [0x7fff8b7f] >java.lang.Thread.State: RUNNABLE > at > org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106) > at > org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769) > at > org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155) > at > org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247) > at > org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:104) > at > org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:82) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:138) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:114) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:72) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:62) > at > org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42) > {noformat} > This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be > an "up-front" optimization to make the actual subsequent terms-intensive part > of the query faster. But it makes the whole query run for nearly 5 minutes > before it does anything. > So I definitely think we should improve {{getCommonSuffixBytesRef}} to be > more "best-effort". For example, it can reduce the lower bound to {{1000}} > and catch the exception like such: > {code} > try { >// this is slow, and just an opto anyway, so don't burn cycles on it for > some crazy worst-case. >// if we don't set this common suffix, the query will just run a bit > slower, that's all. >int limit = Math.min(1000, maxDeterminizedStates); >BytesRef suffix = Operations.getCommonSuffixBytesRef(binary, limit); >... (setting commonSuffixRef) > } catch (TooComplexTooDeterminizeException notWorthIt) { > commonSuffixRef = null; > } > {code} > Another, maybe simpler option, is to just check that input state/transitions > accounts don't exceed some low limit N. > Basically this opto is geared at stuff like leading wildcard query of "*foo". > By computing that the common suffix is "foo" we can spend less CPU in the > terms dictionary because we can first do a memcmp before having to run data > thru any finite state machine. It's really a microopt and we shouldn't be > spending whole seconds of cpu on it, ever. > But I still don't quite understand how the current limits are giving the > behavior today, maybe there is a bigger issue and I don't want to shove > something under the rug. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9981) CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit
[ https://issues.apache.org/jira/browse/LUCENE-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353637#comment-17353637 ] Robert Muir commented on LUCENE-9981: - Before we {{det(reverse())}} this automaton to compute the common suffix for the optimization, note that it has: * 52000 states * 84000 transitions I think we should just not even bother to do this optimization if the automaton is "big" (say more than 1,000 states or more than 1,000 transitions). Something like this is a 1-line change: {noformat} --- a/lucene/core/src/java/org/apache/lucene/util/automaton/CompiledAutomaton.java +++ b/lucene/core/src/java/org/apache/lucene/util/automaton/CompiledAutomaton.java @@ -237,7 +237,7 @@ public class CompiledAutomaton implements Accountable { binary = new UTF32ToUTF8().convert(automaton); } -if (this.finite) { +if (this.finite || binary.getNumStates() > 1000 || binary.getNumTransitions() > 1000) { commonSuffixRef = null; } else { // NOTE: this is a very costly operation! We should test if it's really warranted in {noformat} This means the test passes in 5 seconds instead of failing with exception after 268 seconds. I think its a good general fix for this specific method. But still, I think there might be more interesting problems here: 5 seconds still sucks, especially on a 0-document index. And you can be sure I don't want to add a unit test taking 5 seconds of cpu time :) > CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with > default maxDeterminizedStates limit > > > Key: LUCENE-9981 > URL: https://issues.apache.org/jira/browse/LUCENE-9981 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-9981_test.patch > > > We have a {{maxDeterminizedStates = 1}} limit designed to keep > regexp-type queries from blowing up. > But we have an adversary that will run for 268s on my laptop before hitting > exception, first reported here: > https://github.com/opensearch-project/OpenSearch/issues/687 > When I run the test and jstack the threads, this what I see: > {noformat} > "TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 > os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x7fff7006ca80 nid=0x231c8 > runnable [0x7fff8b7f] >java.lang.Thread.State: RUNNABLE > at > org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106) > at > org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769) > at > org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155) > at > org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247) > at > org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:104) > at > org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:82) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:138) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:114) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:72) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:62) > at > org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42) > {noformat} > This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be > an "up-front" optimization to make the actual subsequent terms-intensive part > of the query faster. But it makes the whole query run for nearly 5 minutes > before it does anything. > So I definitely think we should improve {{getCommonSuffixBytesRef}} to be > more "best-effort". For example, it can reduce the lower bound to {{1000}} > and catch the exception like such: > {code} > try { >// this is slow, and just an opto anyway, so don't burn cycles on it for > some crazy worst-case. >// if we don't set this common suffix, the query will just run a bit > slower, that's all. >int limit = Math.min(1000, maxDeterminizedStates); >BytesRef suffix = Operations.getCommonSuffixBytesRef(binary, limit); >... (setting commonSuffixRef) > } catch (TooComplexTooDeterminizeException notWorthIt) { > commonSuffixRef = null; > } > {code} > Another, maybe simpler option, is to just check that input state/transitions > accounts don't exceed some low limit N. > Basically this opto is geared at stuff like leading wildcard query of "*foo". > By computing that the common suffix is "foo" we can spend less CPU in the > terms dictionary because we can first do a memcmp before having to run data > thru any finite state machine. It's really a microopt and we shouldn't be > spending whole seconds of cpu on it, ever. > B
[jira] [Commented] (LUCENE-9379) Directory based approach for index encryption
[ https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353644#comment-17353644 ] Robert Muir commented on LUCENE-9379: - Sorry, the above comment is really wrong. Please see my comments on linked issues. You can definitely manage encryption at multiple levels in the os: * block level * filesystem level Please understand the options available and be educated about this, see: https://www.kernel.org/doc/html/latest/filesystems/fscrypt.html This FS-level crypto subsystem is usable with e.g. ext4 and f2fs filesystems, among others. So you can definitely do different stuff per-directory, which makes multitenant use-cases easily possible (and from my understanding, was the intent of the changes in the first place) I won't drop my {{-1}} vote on this because folks won't read the documentation for their operating system. > Directory based approach for index encryption > - > > Key: LUCENE-9379 > URL: https://issues.apache.org/jira/browse/LUCENE-9379 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Bruno Roustant >Assignee: Bruno Roustant >Priority: Major > Time Spent: 2.5h > Remaining Estimate: 0h > > +Important+: This Lucene Directory wrapper approach is to be considered only > if an OS level encryption is not possible. OS level encryption better fits > Lucene usage of OS cache, and thus is more performant. > But there are some use-case where OS level encryption is not possible. This > Jira issue was created to address those. > > > The goal is to provide optional encryption of the index, with a scope limited > to an encryptable Lucene Directory wrapper. > Encryption is at rest on disk, not in memory. > This simple approach should fit any Codec as it would be orthogonal, without > modifying APIs as much as possible. > Use a standard encryption method. Limit perf/memory impact as much as > possible. > Determine how callers provide encryption keys. They must not be stored on > disk. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9379) Directory based approach for index encryption
[ https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353645#comment-17353645 ] Robert Muir commented on LUCENE-9379: - As always, you can count on arch to have some good user-level wiki docs on how to do this: https://wiki.archlinux.org/title/Fscrypt > Directory based approach for index encryption > - > > Key: LUCENE-9379 > URL: https://issues.apache.org/jira/browse/LUCENE-9379 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Bruno Roustant >Assignee: Bruno Roustant >Priority: Major > Time Spent: 2.5h > Remaining Estimate: 0h > > +Important+: This Lucene Directory wrapper approach is to be considered only > if an OS level encryption is not possible. OS level encryption better fits > Lucene usage of OS cache, and thus is more performant. > But there are some use-case where OS level encryption is not possible. This > Jira issue was created to address those. > > > The goal is to provide optional encryption of the index, with a scope limited > to an encryptable Lucene Directory wrapper. > Encryption is at rest on disk, not in memory. > This simple approach should fit any Codec as it would be orthogonal, without > modifying APIs as much as possible. > Use a standard encryption method. Limit perf/memory impact as much as > possible. > Determine how callers provide encryption keys. They must not be stored on > disk. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org