[jira] [Commented] (LUCENE-9204) Move span queries to the queries module

2021-05-28 Thread Jim Ferenczi (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353119#comment-17353119
 ] 

Jim Ferenczi commented on LUCENE-9204:
--

The rewriting of disjunctions that contains duplicate is simple to explain and 
to reason about. It is slower indeed but there's nothing else to compare with 
since it's the key for correctness. Taking aside the difference between spans 
and intervals, I think this example demonstrates how we can approach this kind 
of issue. We have a complex semantic based on minimum intervals that comes with 
limitations. We can try to break these limitations but most of the time that 
will change another layer that expects correctness
So in this case, we apply a simple transformation that doesn't require to 
change the semantic.
I don't think we can/should allow for leniency here, it is correct or it's not. 
We can evaluate the cost to keep correctness in some situations but I don't 
think there's any trade-offs to find. If it's not correct, it's a bug, not a 
feature ;).



> Move span queries to the queries module
> ---
>
> Key: LUCENE-9204
> URL: https://issues.apache.org/jira/browse/LUCENE-9204
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: main (9.0)
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> We have a slightly odd situation currently, with two parallel query 
> structures for building complex positional queries: the long-standing span 
> queries, in core; and interval queries, in the queries module.  Given that 
> interval queries solve at least some of the problems we've had with Spans, I 
> think we should be pushing users more towards these implementations.  It's 
> counter-intuitive to do that when Spans are in core though.  I've opened this 
> issue to discuss moving the spans package as a whole to the queries module.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #156: LUCENE-9975: don't require signing of 'unsignedJars' publication

2021-05-28 Thread GitBox


dweiss commented on pull request #156:
URL: https://github.com/apache/lucene/pull/156#issuecomment-850298459


   bq. I don't know how this works
   
   It's simple but not straightforward. The two entries under "publications" 
(jars and unsignedJars) cause gradle to create a bunch of convention-named 
additional tasks actually connecting these publications to their target. So, 
when you run 'gradlew tasks' you'll see them and figure out the pattern:
   ```
   generateMetadataFileForJarsPublication - Generates the Gradle metadata file 
for publication 'jars'.
   generateMetadataFileForUnsignedJarsPublication - Generates the Gradle 
metadata file for publication 'unsignedJars'.
   generatePomFileForJarsPublication - Generates the Maven POM file for 
publication 'jars'.
   generatePomFileForUnsignedJarsPublication - Generates the Maven POM file for 
publication 'unsignedJars'.
   mavenToApacheSnapshots - Publish Maven JARs and POMs to Apache Snapshots 
repository: https://repository.apache.org/content/repositories/snapshots
   mavenToLocalFolder - Publish Maven JARs and POMs locally to 
C:\Work\apache\lucene\main\build\maven-local
   mavenToLocalRepo - Publish Maven JARs and POMs to current user's local maven 
repository.
   publish - Publishes all publications produced by this project.
   publishAllPublicationsToApacheSnapshotsRepository - Publishes all Maven 
publications produced by this project to the ApacheSnapshots repository.
   publishAllPublicationsToBuildRepository - Publishes all Maven publications 
produced by this project to the build repository.
   publishJarsPublicationToApacheSnapshotsRepository - Publishes Maven 
publication 'jars' to Maven repository 'ApacheSnapshots'.
   publishJarsPublicationToBuildRepository - Publishes Maven publication 'jars' 
to Maven repository 'build'.
   publishJarsPublicationToMavenLocal - Publishes Maven publication 'jars' to 
the local Maven repository.
   publishToMavenLocal - Publishes all Maven publications produced by this 
project to the local Maven cache.
   publishUnsignedJarsPublicationToApacheSnapshotsRepository - Publishes Maven 
publication 'unsignedJars' to Maven repository 'ApacheSnapshots'.
   publishUnsignedJarsPublicationToBuildRepository - Publishes Maven 
publication 'unsignedJars' to Maven repository 'build'.
   publishUnsignedJarsPublicationToMavenLocal - Publishes Maven publication 
'unsignedJars' to the local Maven repository.
   ```
   
   I personally don't like typing so much so the rest of the code adds links 
between shorter aliases (mavenToLocalFolder, mavenToLocalRepo) and the right 
"convention" task. There is also a bit that configures both publications using 
the same closure to avoid some repetition.
   
   I agree this isn't super-clean but it's how gradle works. Perhaps there is a 
way to make it nicer, I'm not sure.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss merged pull request #156: LUCENE-9975: don't require signing of 'unsignedJars' publication

2021-05-28 Thread GitBox


dweiss merged pull request #156:
URL: https://github.com/apache/lucene/pull/156


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9975) Don't require artifact signing for local maven artifact publishing

2021-05-28 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss resolved LUCENE-9975.
-
Fix Version/s: main (9.0)
   Resolution: Fixed

> Don't require artifact signing for local maven artifact publishing
> --
>
> Key: LUCENE-9975
> URL: https://issues.apache.org/jira/browse/LUCENE-9975
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
> Fix For: main (9.0)
>
> Attachments: image-2021-05-26-11-58-45-713.png
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9975) Don't require artifact signing for local maven artifact publishing

2021-05-28 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353136#comment-17353136
 ] 

ASF subversion and git services commented on LUCENE-9975:
-

Commit 0a316b24950645cb5aca66b18d59a042ab1819e8 in lucene's branch 
refs/heads/main from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=0a316b2 ]

LUCENE-9975: don't require signing of 'unsignedJars' publication (maven 
artifacts published to the user's maven local repository, build folder and 
apache nexus). (#156)



> Don't require artifact signing for local maven artifact publishing
> --
>
> Key: LUCENE-9975
> URL: https://issues.apache.org/jira/browse/LUCENE-9975
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
> Attachments: image-2021-05-26-11-58-45-713.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9975) Don't require artifact signing for local maven artifact publishing

2021-05-28 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353231#comment-17353231
 ] 

Uwe Schindler commented on LUCENE-9975:
---

Hi,
I removed the {{-x signJarsPublication}} from the command line on Jenkins.
Thanks, Uwe

> Don't require artifact signing for local maven artifact publishing
> --
>
> Key: LUCENE-9975
> URL: https://issues.apache.org/jira/browse/LUCENE-9975
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
> Fix For: main (9.0)
>
> Attachments: image-2021-05-26-11-58-45-713.png
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #128: LUCENE-9662: CheckIndex should be concurrent - parallelizing index parts check within each segment

2021-05-28 Thread GitBox


rmuir commented on pull request #128:
URL: https://github.com/apache/lucene/pull/128#issuecomment-850330270


   > I was finally able to rebuild my local index with wikibigall generating 15 
segments, and performed 2 test runs with different threadCount. With 11 
threadCount, it took 359.293 sec in total to finish, and with 1 threadCount it 
took 378.583 sec in total, so about 5% time saving. I feel faster machine will 
have better time saving, but in general the speed up seems to be limited given 
the skewed distribution of checking speed of different segment parts (e.g. 
posting check can account for around 85% ~ 90% of the total time spent in the 
first segment check).
   
   I really don't think we should add this complexity to checkindex if it only 
improves it 5%. Let's take a step back and do this issue differently (e.g. just 
check whole segments concurrently with different threads). This should also 
require less complexity and be less invasive to the checkindex tool.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #128: LUCENE-9662: CheckIndex should be concurrent - parallelizing index parts check within each segment

2021-05-28 Thread GitBox


rmuir commented on pull request #128:
URL: https://github.com/apache/lucene/pull/128#issuecomment-850335925


   > The python tool looks very cool and thanks for testing it! One issue 
though is that this bit flipping is causing checksum integrity check failures 
before the concurrent segment part checks kick in, so it may not test the 
changes here? I think we may actually need to write a semantically buggy 
segment file with good checksum verification to see the error still gets 
detected and propagated correctly?
   
   This is also why checkindex has its `-fast` option, to only verify file 
checksums, which is probably what most end-users actually want when trying to 
verify that their index is intact and correct. The current (historical) slow 
defaults are really only useful for debugging a bug in lucene itself. And there 
is even a `-slow` that can do even more of that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #128: LUCENE-9662: CheckIndex should be concurrent - parallelizing index parts check within each segment

2021-05-28 Thread GitBox


rmuir commented on pull request #128:
URL: https://github.com/apache/lucene/pull/128#issuecomment-850340290


   I'm gonna throw out the crazy idea to make `-fast` the new default. The 
previous `-slow` could be moved to `-slower` and the previous current behavior 
could be activated by `-slow`.
   
   I think the tool's defaults are unnecessarily slow just for historical 
reasons? (not having checksums originally)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] janhoy merged pull request #136: LUCENE-9589 Swedish Minimal Stemmer

2021-05-28 Thread GitBox


janhoy merged pull request #136:
URL: https://github.com/apache/lucene/pull/136


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9589) Swedish Minimal Stemmer

2021-05-28 Thread Jira


 [ 
https://issues.apache.org/jira/browse/LUCENE-9589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl resolved LUCENE-9589.
-
Fix Version/s: main (9.0)
   Resolution: Fixed

> Swedish Minimal Stemmer
> ---
>
> Key: LUCENE-9589
> URL: https://issues.apache.org/jira/browse/LUCENE-9589
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Major
> Fix For: main (9.0)
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Swedish has a {{SwedishLightStemmer}} but lacks a Minimal stemmer that would 
> only stem singular/plural.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9589) Swedish Minimal Stemmer

2021-05-28 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353316#comment-17353316
 ] 

ASF subversion and git services commented on LUCENE-9589:
-

Commit 5fdff6eabb1e6452320800805355938ce8b903ff in lucene's branch 
refs/heads/main from Jan Høydahl
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=5fdff6e ]

LUCENE-9589 Swedish Minimal Stemmer (#136)



> Swedish Minimal Stemmer
> ---
>
> Key: LUCENE-9589
> URL: https://issues.apache.org/jira/browse/LUCENE-9589
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Major
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Swedish has a {{SwedishLightStemmer}} but lacks a Minimal stemmer that would 
> only stem singular/plural.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] gerlowskija merged pull request #2501: SOLR-15090: Allow backup storage in GCS

2021-05-28 Thread GitBox


gerlowskija merged pull request #2501:
URL: https://github.com/apache/lucene-solr/pull/2501


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dweiss commented on pull request #2499: LUCENE-9687: Hunspell support improvements

2021-05-28 Thread GitBox


dweiss commented on pull request #2499:
URL: https://github.com/apache/lucene-solr/pull/2499#issuecomment-850444097


   Hi @donnerpeter ! I think it's a change that is beneficial for folks who use 
Hunspell and if you've already put the effort to backport, why not apply it? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] donnerpeter merged pull request #2499: LUCENE-9687: Hunspell support improvements

2021-05-28 Thread GitBox


donnerpeter merged pull request #2499:
URL: https://github.com/apache/lucene-solr/pull/2499


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9687) Hunspell support improvements

2021-05-28 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353510#comment-17353510
 ] 

ASF subversion and git services commented on LUCENE-9687:
-

Commit c6b964a11576af65091e5f9a3bc819cd34020422 in lucene-solr's branch 
refs/heads/branch_8x from Peter Gromov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c6b964a ]

LUCENE-9687: Hunspell support improvements (#2499)

add API for spell-checking and suggestions, support compound words,
  fix various behavior differences between Java and C++ implementations, 
improve performance

backported from "main" branch

> Hunspell support improvements
> -
>
> Key: LUCENE-9687
> URL: https://issues.apache.org/jira/browse/LUCENE-9687
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Peter Gromov
>Priority: Major
> Fix For: main (9.0)
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> I'd like Lucene's Hunspell support to be on a par with the native C++ 
> Hunspell for spellchecking and suggestions, at least for some languages. So I 
> propose to:
> * support the affix rules necessary for English, German, French, Spanish and
> Russian dictionaries, possibly more languages later
> * mirror Hunspell's suggestion algorithm in Lucene
> * provide a public APIs for spellchecking, suggestion, stemming, 
> morphological data
> * check corpora for specific languages to find and fix 
> spellchecking/suggestion discrepancices between Lucene's implementation and 
> Hunspell/C++



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9980) Do not expose deleted commits in IndexDeletionPolicy#onCommit

2021-05-28 Thread Nhat Nguyen (Jira)
Nhat Nguyen created LUCENE-9980:
---

 Summary: Do not expose deleted commits in 
IndexDeletionPolicy#onCommit
 Key: LUCENE-9980
 URL: https://issues.apache.org/jira/browse/LUCENE-9980
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/index
Affects Versions: 8.8.1
Reporter: Nhat Nguyen


If we fail to delete files that belong to a commit point, then we will expose 
that deleted commit in the next calls of IndexDeletionPolicy#onCommit(). I 
think we should never expose those deleted commit points as some of their files 
might have been deleted already.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dnhatn opened a new pull request #158: LUCENE-9980: Do not expose deleted commit points in IndexDeletionPolicy

2021-05-28 Thread GitBox


dnhatn opened a new pull request #158:
URL: https://github.com/apache/lucene/pull/158


   If we fail to delete files that belong to a commit point, then we will 
expose that deleted commit in the next calls of 
`IndexDeletionPolicy#onCommit()`. I think we should never expose those deleted 
commit points as some of their files might have been deleted already.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dnhatn commented on pull request #158: LUCENE-9980: Do not expose deleted commit points in IndexDeletionPolicy

2021-05-28 Thread GitBox


dnhatn commented on pull request #158:
URL: https://github.com/apache/lucene/pull/158#issuecomment-850650666


   @mikemccand Would you mind taking a look? Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a change in pull request #138: LUCENE-9956: Make getBaseQuery, getDrillDownQueries API from DrillDownQuery public

2021-05-28 Thread GitBox


gautamworah96 commented on a change in pull request #138:
URL: https://github.com/apache/lucene/pull/138#discussion_r641822305



##
File path: lucene/facet/src/java/org/apache/lucene/facet/DrillDownQuery.java
##
@@ -170,15 +184,43 @@ private BooleanQuery getBooleanQuery() {
 return bq.build();
   }
 
-  Query getBaseQuery() {
+  /**
+   * Returns the internal baseQuery of the DrillDownQuery
+   *
+   * @return The baseQuery used on initialization of DrillDownQuery
+   */
+  public Query getBaseQuery() {
 return baseQuery;
   }
 
-  Query[] getDrillDownQueries() {
+  /**
+   * Returns the dimension queries added either via {@link #add(String, 
Query)} or {@link
+   * #add(String, String...)}
+   *
+   * @return The array of dimQueries
+   */
+  public Query[] getDrillDownQueries() {
+if (isAnyDimQueryDirty == false) {
+  // returns previously built dimQueries
+  Query[] builtDimQueriesCopy = new Query[builtDimQueries.size()];
+  return builtDimQueries.toArray(builtDimQueriesCopy);
+}
 Query[] dimQueries = new Query[this.dimQueries.size()];
 for (int i = 0; i < dimQueries.length; ++i) {
-  dimQueries[i] = this.dimQueries.get(i).build();
+  if (isDimQueryDirty.get(i) == true) {

Review comment:
   Hmmm. That makes sense. I've implemented a `Set ` approach in the next commit




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] rmuir commented on pull request #2499: LUCENE-9687: Hunspell support improvements

2021-05-28 Thread GitBox


rmuir commented on pull request #2499:
URL: https://github.com/apache/lucene-solr/pull/2499#issuecomment-850699447


   I'm sorry i've been nonresponsive, been crazy busy, just came up for air 
now. Thank you for backporting this to 8.x! Great improvements to get out there 
to the users now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9945) Extend DrillSideways to support exposing FacetCollectors directly

2021-05-28 Thread Sejal Pawar (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sejal Pawar updated LUCENE-9945:

Attachment: LUCENE-9945.patch

> Extend DrillSideways to support exposing FacetCollectors directly
> -
>
> Key: LUCENE-9945
> URL: https://issues.apache.org/jira/browse/LUCENE-9945
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: main (9.0)
>Reporter: Greg Miller
>Priority: Minor
> Attachments: LUCENE-9945.patch
>
>
> The {{DrillSideways}} logic currently encapsulates, 1) the creation of 
> multiple {{FacetsCollector}} instances, and 2) the processing of those 
> {{FacetsCollectors into a single Facets}} instance. While I suspect this 
> works well for most common cases, and is simple to understand, it's difficult 
> to extend to more advanced cases.
> I propose extending {{DrillSideways}} to support exposing the underlying 
> {{FacetsCollector}} instances if the user needs them, in addition to 
> maintaining the current functionality for all of the more common cases. 
> Specifically, I'd like to add both the "drill down" {{FacetsCollector}} and 
> map of dim -> {{FacetsCollector}} for "drill sideways" to the 
> {{DrillSidewaysResult}} and {{ConcurrentDrillSidewaysResult}} classes. While 
> it's true that a user can extend {{DrillSideways}} and override 
> {{buildFacetsResult}} to keep track of these, it seems reasonable to provide 
> this in {{DrillSideways}} itself so users don't need to sub-class for only 
> this purpose.
> Here are two use-cases illustrating the desire for this:
>  # For certain cases, instead of actually creating a {{Facets}} instance from 
> the {{FacetsCollector}}, I'd like to intersect the {{FacetsCollector}} 
> matching docs with a {{TermEnum}} to do determine whether-or-not at least one 
> match exists. This is useful for a "facet" that can only take on the values 
> "true"/"false", and I want to make sure at least one hit has the value "true".
>  # In another case, I only care about some aggregate statistics for a given 
> facet field. For example, I want to find the min and max values. For this, I 
> want to intersect some doc value field with a {{FacetsCollector}} and only 
> track the min/max values I observe while iterating.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9945) Extend DrillSideways to support exposing FacetCollectors directly

2021-05-28 Thread Sejal Pawar (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sejal Pawar updated LUCENE-9945:

Attachment: (was: LUCENE-9945.patch)

> Extend DrillSideways to support exposing FacetCollectors directly
> -
>
> Key: LUCENE-9945
> URL: https://issues.apache.org/jira/browse/LUCENE-9945
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: main (9.0)
>Reporter: Greg Miller
>Priority: Minor
>
> The {{DrillSideways}} logic currently encapsulates, 1) the creation of 
> multiple {{FacetsCollector}} instances, and 2) the processing of those 
> {{FacetsCollectors into a single Facets}} instance. While I suspect this 
> works well for most common cases, and is simple to understand, it's difficult 
> to extend to more advanced cases.
> I propose extending {{DrillSideways}} to support exposing the underlying 
> {{FacetsCollector}} instances if the user needs them, in addition to 
> maintaining the current functionality for all of the more common cases. 
> Specifically, I'd like to add both the "drill down" {{FacetsCollector}} and 
> map of dim -> {{FacetsCollector}} for "drill sideways" to the 
> {{DrillSidewaysResult}} and {{ConcurrentDrillSidewaysResult}} classes. While 
> it's true that a user can extend {{DrillSideways}} and override 
> {{buildFacetsResult}} to keep track of these, it seems reasonable to provide 
> this in {{DrillSideways}} itself so users don't need to sub-class for only 
> this purpose.
> Here are two use-cases illustrating the desire for this:
>  # For certain cases, instead of actually creating a {{Facets}} instance from 
> the {{FacetsCollector}}, I'd like to intersect the {{FacetsCollector}} 
> matching docs with a {{TermEnum}} to do determine whether-or-not at least one 
> match exists. This is useful for a "facet" that can only take on the values 
> "true"/"false", and I want to make sure at least one hit has the value "true".
>  # In another case, I only care about some aggregate statistics for a given 
> facet field. For example, I want to find the min and max values. For this, I 
> want to intersect some doc value field with a {{FacetsCollector}} and only 
> track the min/max values I observe while iterating.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9964) FacetResult.labelValues.value is not accurate for duplicate labels in a document

2021-05-28 Thread Gautam Worah (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautam Worah updated LUCENE-9964:
-
Issue Type: Bug  (was: Improvement)

> FacetResult.labelValues.value is not accurate for duplicate labels in a 
> document
> 
>
> Key: LUCENE-9964
> URL: https://issues.apache.org/jira/browse/LUCENE-9964
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/facet
>Affects Versions: 8.8.1
>Reporter: Gautam Worah
>Priority: Minor
>
> As part of a separate [bug|https://github.com/apache/lucene/pull/131] in 
> FacetResult#value we discovered that FacetResult.labelValues.value is not 
> accurate for duplicate labels in a document that uses 
> SortedNumericDocValuesFields.
> In theory, each label should only be counted once from a document when 
> returning the labelValues, but today, each duplicate label in a document is 
> counted uniquely.
> A test case showing the current (inaccurate) output is 
> [here|https://github.com/gautamworah96/lucene/commit/042878117308f76629a27b0bcf83e25f074dc8b1#diff-6fda2d7520edb7e11d4af3f5b6e80c073a0e20790ec42692573137959837d742]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] sejal-pawar opened a new pull request #159: extend DrillSideways to expose FacetCollector and drill-down dimensions

2021-05-28 Thread GitBox


sejal-pawar opened a new pull request #159:
URL: https://github.com/apache/lucene/pull/159


   
   
   # Description
   
   (From [LUCENE-9945](https://issues.apache.org/jira/browse/LUCENE-9945))
   The DrillSideways logic currently encapsulates, 1) the creation of multiple 
FacetsCollector instances, and 2) the processing of those FacetsCollectors into 
a single Facets instance. While I suspect this works well for most common 
cases, and is simple to understand, it's difficult to extend to more advanced 
cases.
   
   LUCENE-9945 proposes extending DrillSideways to support exposing the 
underlying FacetsCollector instances if the user needs them, in addition to 
maintaining the current functionality for all of the more common cases. 
Specifically, I'd like to add both the "drill down" FacetsCollector and map of 
dim -> FacetsCollector for "drill sideways" to the DrillSidewaysResult and 
ConcurrentDrillSidewaysResult classes. While it's possible that a user can 
extend DrillSideways and override buildFacetsResult to keep track of these, it 
seems reasonable to provide this in DrillSideways itself so users don't need to 
sub-class for only this purpose.
   
   # Solution
   
   Modified DrillSideways itself to expose FacetCollector map of dimensions.
   
   # Tests
   
   ./gradlew test
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code 
conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [x] I have given Lucene maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `main` branch.
   - [x] I have run `./gradlew check`.
   - [ ] I have added tests for my changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a change in pull request #138: LUCENE-9956: Make getBaseQuery, getDrillDownQueries API from DrillDownQuery public

2021-05-28 Thread GitBox


gsmiller commented on a change in pull request #138:
URL: https://github.com/apache/lucene/pull/138#discussion_r641848546



##
File path: lucene/facet/src/java/org/apache/lucene/facet/DrillDownQuery.java
##
@@ -170,16 +185,36 @@ private BooleanQuery getBooleanQuery() {
 return bq.build();
   }
 
-  Query getBaseQuery() {
+  /**
+   * Returns the internal baseQuery of the DrillDownQuery
+   *
+   * @return The baseQuery used on initialization of DrillDownQuery
+   */
+  public Query getBaseQuery() {
 return baseQuery;
   }
 
-  Query[] getDrillDownQueries() {
-Query[] dimQueries = new Query[this.dimQueries.size()];
-for (int i = 0; i < dimQueries.length; ++i) {
-  dimQueries[i] = this.dimQueries.get(i).build();
+  /**
+   * Returns the dimension queries added either via {@link #add(String, 
Query)} or {@link
+   * #add(String, String...)}
+   *
+   * @return The array of dimQueries
+   */
+  public Query[] getDrillDownQueries() {
+if (dirtyDimQueryIndex.isEmpty()) {
+  // returns previously built dimQueries
+  Query[] builtDimQueriesCopy = new Query[builtDimQueries.size()];
+  return builtDimQueries.toArray(builtDimQueriesCopy);
+}
+for (int i = 0; i < this.dimQueries.size(); ++i) {
+  if (dirtyDimQueryIndex.contains(i)) {
+builtDimQueries.set(i, this.dimQueries.get(i).build());
+dirtyDimQueryIndex.remove(i);
+  }
 }
-return dimQueries;
+assert dirtyDimQueryIndex.isEmpty();
+// by this time builtDimQueries has all the built queries and 
dirtyDimQueryIndex is empty
+return getDrillDownQueries();

Review comment:
   What about something like:
   
   ```
 public Query[] getDrillDownQueries() {
   for (Integer dirtyIndex : dirtyDimQueryIndex) {
 builtDimQueries.set(dirtyIndex, dimQueries.get(dirtyIndex).build());
   }
   dirtyDimQueryIndex.clear();
   
   return builtDimQueries.toArray(new Query[0]);
 }
   ```

##
File path: lucene/facet/src/java/org/apache/lucene/facet/DrillDownQuery.java
##
@@ -170,16 +185,36 @@ private BooleanQuery getBooleanQuery() {
 return bq.build();
   }
 
-  Query getBaseQuery() {
+  /**
+   * Returns the internal baseQuery of the DrillDownQuery
+   *
+   * @return The baseQuery used on initialization of DrillDownQuery
+   */
+  public Query getBaseQuery() {
 return baseQuery;
   }
 
-  Query[] getDrillDownQueries() {
-Query[] dimQueries = new Query[this.dimQueries.size()];
-for (int i = 0; i < dimQueries.length; ++i) {
-  dimQueries[i] = this.dimQueries.get(i).build();
+  /**
+   * Returns the dimension queries added either via {@link #add(String, 
Query)} or {@link
+   * #add(String, String...)}
+   *
+   * @return The array of dimQueries
+   */
+  public Query[] getDrillDownQueries() {

Review comment:
   Let's take advantage of the built query caching in `getBooleanQuery` as 
well! (Sorry for missing this earlier). We should be able to replace the 
explicit building of the drill down queries with something like:
   
   ```
   for (Query query : getDrillDownQueries()) {
 bq.add(query, Occur.FILTER);
   }
   ```

##
File path: lucene/facet/src/java/org/apache/lucene/facet/DrillDownQuery.java
##
@@ -53,6 +55,8 @@ public static Term term(String field, String dim, String... 
path) {
   private final Query baseQuery;
   private final List dimQueries = new ArrayList<>();
   private final Map drillDownDims = new LinkedHashMap<>();
+  private List builtDimQueries = new ArrayList<>();
+  private Set dirtyDimQueryIndex = new HashSet<>();

Review comment:
   Can you make these final please?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a change in pull request #159: LUCENE-9945: extend DrillSideways to expose FacetCollector and drill-down dimensions

2021-05-28 Thread GitBox


gsmiller commented on a change in pull request #159:
URL: https://github.com/apache/lucene/pull/159#discussion_r641852899



##
File path: lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java
##
@@ -335,10 +342,22 @@ protected boolean scoreSubDocsAtOnce() {
 /** Hits. */
 public final TopDocs hits;
 
+/** Collector to compute Facets */
+FacetsCollector drillDowns;
+
+/** drill-down dimensions */
+Map drillDownDims;

Review comment:
   The idea here is actually a little bit different. What we'd like to do 
is expose the same inputs available to `buildFacetsResult` inside of 
`DrillSidewaysResult`. So the idea is that calling code has direct access to 
the `FacetsCollector` instances, and also knows which dimensions they belong 
to. So what you did here for the `drillDowns` collector is exactly what I was 
thinking (except make it `public` so it has visibility to the calling code 
outside the package). But for the "sideways" collectors, I think we want a 
`Map` that maps each drill down dimension to the 
`FacetsCollector` associated with it. Essentially we want calling code to get 
all the same information that `buildFacetsResult` gets, but not force the user 
to produce an instance of `Facets` as a result.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] sejal-pawar commented on a change in pull request #159: LUCENE-9945: extend DrillSideways to expose FacetCollector and drill-down dimensions

2021-05-28 Thread GitBox


sejal-pawar commented on a change in pull request #159:
URL: https://github.com/apache/lucene/pull/159#discussion_r641856799



##
File path: lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java
##
@@ -335,10 +342,22 @@ protected boolean scoreSubDocsAtOnce() {
 /** Hits. */
 public final TopDocs hits;
 
+/** Collector to compute Facets */
+FacetsCollector drillDowns;
+
+/** drill-down dimensions */
+Map drillDownDims;

Review comment:
   Oh okay, understood. In that case why not keep `DrillSidewaysResult` 
consistent with the signature of `buildFacetResult` and have `FacetsCollector 
drillDowns, FacetsCollector[] drillSideways`, String[] drillSidewaysDims` 
exposed as is instead of `Map`? Or alternatively is it 
possible to consider revising signature of `buildFacetResult`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9945) Extend DrillSideways to support exposing FacetCollectors directly

2021-05-28 Thread Sejal Pawar (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353635#comment-17353635
 ] 

Sejal Pawar commented on LUCENE-9945:
-

Pull request: https://github.com/apache/lucene/pull/159

> Extend DrillSideways to support exposing FacetCollectors directly
> -
>
> Key: LUCENE-9945
> URL: https://issues.apache.org/jira/browse/LUCENE-9945
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: main (9.0)
>Reporter: Greg Miller
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The {{DrillSideways}} logic currently encapsulates, 1) the creation of 
> multiple {{FacetsCollector}} instances, and 2) the processing of those 
> {{FacetsCollectors into a single Facets}} instance. While I suspect this 
> works well for most common cases, and is simple to understand, it's difficult 
> to extend to more advanced cases.
> I propose extending {{DrillSideways}} to support exposing the underlying 
> {{FacetsCollector}} instances if the user needs them, in addition to 
> maintaining the current functionality for all of the more common cases. 
> Specifically, I'd like to add both the "drill down" {{FacetsCollector}} and 
> map of dim -> {{FacetsCollector}} for "drill sideways" to the 
> {{DrillSidewaysResult}} and {{ConcurrentDrillSidewaysResult}} classes. While 
> it's true that a user can extend {{DrillSideways}} and override 
> {{buildFacetsResult}} to keep track of these, it seems reasonable to provide 
> this in {{DrillSideways}} itself so users don't need to sub-class for only 
> this purpose.
> Here are two use-cases illustrating the desire for this:
>  # For certain cases, instead of actually creating a {{Facets}} instance from 
> the {{FacetsCollector}}, I'd like to intersect the {{FacetsCollector}} 
> matching docs with a {{TermEnum}} to do determine whether-or-not at least one 
> match exists. This is useful for a "facet" that can only take on the values 
> "true"/"false", and I want to make sure at least one hit has the value "true".
>  # In another case, I only care about some aggregate statistics for a given 
> facet field. For example, I want to find the min and max values. For this, I 
> want to intersect some doc value field with a {{FacetsCollector}} and only 
> track the min/max values I observe while iterating.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] sejal-pawar commented on a change in pull request #159: LUCENE-9945: extend DrillSideways to expose FacetCollector and drill-down dimensions

2021-05-28 Thread GitBox


sejal-pawar commented on a change in pull request #159:
URL: https://github.com/apache/lucene/pull/159#discussion_r641856799



##
File path: lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java
##
@@ -335,10 +342,22 @@ protected boolean scoreSubDocsAtOnce() {
 /** Hits. */
 public final TopDocs hits;
 
+/** Collector to compute Facets */
+FacetsCollector drillDowns;
+
+/** drill-down dimensions */
+Map drillDownDims;

Review comment:
   Oh okay, understood. In that case why not keep `DrillSidewaysResult` 
consistent with the signature of `buildFacetResult` and have `FacetsCollector 
drillDowns, FacetsCollector[] drillSideways, String[] drillSidewaysDims` 
exposed as is instead of `Map`? Or alternatively is it 
possible to consider revising signature of `buildFacetResult`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9981) CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit

2021-05-28 Thread Robert Muir (Jira)
Robert Muir created LUCENE-9981:
---

 Summary: CompiledAutomaton.getCommonSuffix can be extraordinarily 
slow, even with default maxDeterminizedStates limit
 Key: LUCENE-9981
 URL: https://issues.apache.org/jira/browse/LUCENE-9981
 Project: Lucene - Core
  Issue Type: Task
Reporter: Robert Muir


We have a {{maxDeterminizedStates = 1}} limit designed to keep regexp-type 
queries from blowing up. 

But we have an adversary that will run for 268s on my laptop before hitting 
exception, first reported here: 
https://github.com/opensearch-project/OpenSearch/issues/687

When I run the test and jstack the threads, this what I see:
{noformat}
"TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 
os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x7fff7006ca80 nid=0x231c8 
runnable  [0x7fff8b7f]
   java.lang.Thread.State: RUNNABLE
at 
org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106)
at 
org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769)
at 
org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155)
at 
org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247)
at 
org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:104)
at 
org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:82)
at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:138)
at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:114)
at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:72)
at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:62)
at 
org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42)
{noformat}

This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be an 
"up-front" optimization to make the actual subsequent terms-intensive part of 
the query faster. But it makes the whole query run for nearly 5 minutes before 
it does anything.

So I definitely think we should improve {{getCommonSuffixBytesRef}} to be more 
"best-effort". For example, it can reduce the lower bound to {{1000}} and catch 
the exception like such:

{code}
try {
   // this is slow, and just an opto anyway, so don't burn cycles on it for 
some crazy worst-case.
   // if we don't set this common suffix, the query will just run a bit slower, 
that's all.
   int limit = Math.min(1000, maxDeterminizedStates);
   BytesRef suffix = Operations.getCommonSuffixBytesRef(binary, limit);
   ... (setting commonSuffixRef)
} catch (TooComplexTooDeterminizeException notWorthIt) {
  commonSuffixRef = null;
}
{code}

Another, maybe simpler option, is to just check that input state/transitions 
accounts don't exceed some low limit N.

Basically this opto is geared at stuff like leading wildcard query of "*foo". 
By computing that the common suffix is "foo" we can spend less CPU in the terms 
dictionary because we can first do a memcmp before having to run data thru any 
finite state machine. It's really a microopt and we shouldn't be spending whole 
seconds of cpu on it, ever.

But I still don't quite understand how the current limits are giving the 
behavior today, maybe there is a bigger issue and I don't want to shove 
something under the rug.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a change in pull request #159: LUCENE-9945: extend DrillSideways to expose FacetCollector and drill-down dimensions

2021-05-28 Thread GitBox


gsmiller commented on a change in pull request #159:
URL: https://github.com/apache/lucene/pull/159#discussion_r641859710



##
File path: lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java
##
@@ -335,10 +342,22 @@ protected boolean scoreSubDocsAtOnce() {
 /** Hits. */
 public final TopDocs hits;
 
+/** Collector to compute Facets */
+FacetsCollector drillDowns;
+
+/** drill-down dimensions */
+Map drillDownDims;

Review comment:
   Great question! Both of these are reasonable suggestions with different 
"ripple effects" I think. Changing the method signature of `buildFacetResult` 
is possible, but is non-backwards compatible. So there would be some users 
impacted when upgrading if they've subclassed `DrillSideways` with this method 
overridden. I thought of putting a `Map` in the response since it seemed like a 
"friendlier" API for users to work with, but maybe that's a silly thought since 
`buildFacetsResult` is also effectively a public API. I think it's totally 
valid to be consistent with the `buildFacetResult` method signature here and 
provide parallel arrays for the dimensions and associated `FacetsCollector`s. I 
would suggest adding some nice javadoc to those fields describing them (which I 
would recommend regardless of the approach I suppose). Thanks again!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9981) CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit

2021-05-28 Thread Robert Muir (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-9981:

Attachment: LUCENE-9981_test.patch

> CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with 
> default maxDeterminizedStates limit
> 
>
> Key: LUCENE-9981
> URL: https://issues.apache.org/jira/browse/LUCENE-9981
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9981_test.patch
>
>
> We have a {{maxDeterminizedStates = 1}} limit designed to keep 
> regexp-type queries from blowing up. 
> But we have an adversary that will run for 268s on my laptop before hitting 
> exception, first reported here: 
> https://github.com/opensearch-project/OpenSearch/issues/687
> When I run the test and jstack the threads, this what I see:
> {noformat}
> "TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 
> os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x7fff7006ca80 nid=0x231c8 
> runnable  [0x7fff8b7f]
>java.lang.Thread.State: RUNNABLE
>   at 
> org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106)
>   at 
> org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769)
>   at 
> org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155)
>   at 
> org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:104)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:82)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:138)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:114)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:72)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:62)
>   at 
> org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42)
> {noformat}
> This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be 
> an "up-front" optimization to make the actual subsequent terms-intensive part 
> of the query faster. But it makes the whole query run for nearly 5 minutes 
> before it does anything.
> So I definitely think we should improve {{getCommonSuffixBytesRef}} to be 
> more "best-effort". For example, it can reduce the lower bound to {{1000}} 
> and catch the exception like such:
> {code}
> try {
>// this is slow, and just an opto anyway, so don't burn cycles on it for 
> some crazy worst-case.
>// if we don't set this common suffix, the query will just run a bit 
> slower, that's all.
>int limit = Math.min(1000, maxDeterminizedStates);
>BytesRef suffix = Operations.getCommonSuffixBytesRef(binary, limit);
>... (setting commonSuffixRef)
> } catch (TooComplexTooDeterminizeException notWorthIt) {
>   commonSuffixRef = null;
> }
> {code}
> Another, maybe simpler option, is to just check that input state/transitions 
> accounts don't exceed some low limit N.
> Basically this opto is geared at stuff like leading wildcard query of "*foo". 
> By computing that the common suffix is "foo" we can spend less CPU in the 
> terms dictionary because we can first do a memcmp before having to run data 
> thru any finite state machine. It's really a microopt and we shouldn't be 
> spending whole seconds of cpu on it, ever.
> But I still don't quite understand how the current limits are giving the 
> behavior today, maybe there is a bigger issue and I don't want to shove 
> something under the rug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9981) CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit

2021-05-28 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353637#comment-17353637
 ] 

Robert Muir commented on LUCENE-9981:
-

Before we {{det(reverse())}} this automaton to compute the common suffix for 
the optimization, note that it has:
* 52000 states
* 84000 transitions

I think we should just not even bother to do this optimization if the automaton 
is "big" (say more than 1,000 states or more than 1,000 transitions). 

Something like this is a 1-line change:
{noformat}
--- 
a/lucene/core/src/java/org/apache/lucene/util/automaton/CompiledAutomaton.java
+++ 
b/lucene/core/src/java/org/apache/lucene/util/automaton/CompiledAutomaton.java
@@ -237,7 +237,7 @@ public class CompiledAutomaton implements Accountable {
   binary = new UTF32ToUTF8().convert(automaton);
 }

-if (this.finite) {
+if (this.finite || binary.getNumStates() > 1000 || 
binary.getNumTransitions() > 1000) {
   commonSuffixRef = null;
 } else {
   // NOTE: this is a very costly operation!  We should test if it's really 
warranted in
{noformat}

This means the test passes in 5 seconds instead of failing with exception after 
268 seconds. I think its a good general fix for this specific method.

But still, I think there might be more interesting problems here: 5 seconds 
still sucks, especially on a 0-document index. And you can be sure I don't want 
to add a unit test taking 5 seconds of cpu time :)



> CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with 
> default maxDeterminizedStates limit
> 
>
> Key: LUCENE-9981
> URL: https://issues.apache.org/jira/browse/LUCENE-9981
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9981_test.patch
>
>
> We have a {{maxDeterminizedStates = 1}} limit designed to keep 
> regexp-type queries from blowing up. 
> But we have an adversary that will run for 268s on my laptop before hitting 
> exception, first reported here: 
> https://github.com/opensearch-project/OpenSearch/issues/687
> When I run the test and jstack the threads, this what I see:
> {noformat}
> "TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 
> os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x7fff7006ca80 nid=0x231c8 
> runnable  [0x7fff8b7f]
>java.lang.Thread.State: RUNNABLE
>   at 
> org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106)
>   at 
> org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769)
>   at 
> org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155)
>   at 
> org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:104)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:82)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:138)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:114)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:72)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:62)
>   at 
> org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42)
> {noformat}
> This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be 
> an "up-front" optimization to make the actual subsequent terms-intensive part 
> of the query faster. But it makes the whole query run for nearly 5 minutes 
> before it does anything.
> So I definitely think we should improve {{getCommonSuffixBytesRef}} to be 
> more "best-effort". For example, it can reduce the lower bound to {{1000}} 
> and catch the exception like such:
> {code}
> try {
>// this is slow, and just an opto anyway, so don't burn cycles on it for 
> some crazy worst-case.
>// if we don't set this common suffix, the query will just run a bit 
> slower, that's all.
>int limit = Math.min(1000, maxDeterminizedStates);
>BytesRef suffix = Operations.getCommonSuffixBytesRef(binary, limit);
>... (setting commonSuffixRef)
> } catch (TooComplexTooDeterminizeException notWorthIt) {
>   commonSuffixRef = null;
> }
> {code}
> Another, maybe simpler option, is to just check that input state/transitions 
> accounts don't exceed some low limit N.
> Basically this opto is geared at stuff like leading wildcard query of "*foo". 
> By computing that the common suffix is "foo" we can spend less CPU in the 
> terms dictionary because we can first do a memcmp before having to run data 
> thru any finite state machine. It's really a microopt and we shouldn't be 
> spending whole seconds of cpu on it, ever.
> B

[jira] [Commented] (LUCENE-9379) Directory based approach for index encryption

2021-05-28 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353644#comment-17353644
 ] 

Robert Muir commented on LUCENE-9379:
-

Sorry, the above comment is really wrong. Please see my comments on linked 
issues.

You can definitely manage encryption at multiple levels in the os:
* block level
* filesystem level

Please understand the options available and be educated about this, see: 
https://www.kernel.org/doc/html/latest/filesystems/fscrypt.html
This FS-level crypto subsystem is usable with e.g. ext4 and f2fs filesystems, 
among others. So you can definitely do different stuff per-directory, which 
makes multitenant use-cases easily possible (and from my understanding, was the 
intent of the changes in the first place)

I won't drop my {{-1}} vote on this because folks won't read the documentation 
for their operating system.


> Directory based approach for index encryption
> -
>
> Key: LUCENE-9379
> URL: https://issues.apache.org/jira/browse/LUCENE-9379
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Bruno Roustant
>Assignee: Bruno Roustant
>Priority: Major
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> +Important+: This Lucene Directory wrapper approach is to be considered only 
> if an OS level encryption is not possible. OS level encryption better fits 
> Lucene usage of OS cache, and thus is more performant.
> But there are some use-case where OS level encryption is not possible. This 
> Jira issue was created to address those.
> 
>  
> The goal is to provide optional encryption of the index, with a scope limited 
> to an encryptable Lucene Directory wrapper.
> Encryption is at rest on disk, not in memory.
> This simple approach should fit any Codec as it would be orthogonal, without 
> modifying APIs as much as possible.
> Use a standard encryption method. Limit perf/memory impact as much as 
> possible.
> Determine how callers provide encryption keys. They must not be stored on 
> disk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9379) Directory based approach for index encryption

2021-05-28 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353645#comment-17353645
 ] 

Robert Muir commented on LUCENE-9379:
-

As always, you can count on arch to have some good user-level wiki docs on how 
to do this: https://wiki.archlinux.org/title/Fscrypt


> Directory based approach for index encryption
> -
>
> Key: LUCENE-9379
> URL: https://issues.apache.org/jira/browse/LUCENE-9379
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Bruno Roustant
>Assignee: Bruno Roustant
>Priority: Major
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> +Important+: This Lucene Directory wrapper approach is to be considered only 
> if an OS level encryption is not possible. OS level encryption better fits 
> Lucene usage of OS cache, and thus is more performant.
> But there are some use-case where OS level encryption is not possible. This 
> Jira issue was created to address those.
> 
>  
> The goal is to provide optional encryption of the index, with a scope limited 
> to an encryptable Lucene Directory wrapper.
> Encryption is at rest on disk, not in memory.
> This simple approach should fit any Codec as it would be orthogonal, without 
> modifying APIs as much as possible.
> Use a standard encryption method. Limit perf/memory impact as much as 
> possible.
> Determine how callers provide encryption keys. They must not be stored on 
> disk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org