[GitHub] [lucene] jpountz commented on a change in pull request #413: LUCENE-9614: Fix KnnVectorQuery failure when numDocs is 0

2021-10-27 Thread GitBox


jpountz commented on a change in pull request #413:
URL: https://github.com/apache/lucene/pull/413#discussion_r737165184



##
File path: lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java
##
@@ -25,18 +25,10 @@
 import java.io.IOException;
 import java.util.HashSet;
 import java.util.Set;
-import org.apache.lucene.document.Document;
-import org.apache.lucene.document.Field;
-import org.apache.lucene.document.KnnVectorField;
-import org.apache.lucene.document.StringField;
-import org.apache.lucene.index.DirectoryReader;
-import org.apache.lucene.index.IndexReader;
-import org.apache.lucene.index.IndexWriter;
-import org.apache.lucene.index.IndexWriterConfig;
-import org.apache.lucene.index.RandomIndexWriter;
-import org.apache.lucene.index.Term;
-import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.document.*;
+import org.apache.lucene.index.*;

Review comment:
   Oh, I thought we failed the build on wildcard imports, but apparently we 
don't. Maybe still use explicit imports to reduce line changes of this PR?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #415: LUCENE-10206 Implement O(1) count on query cache

2021-10-27 Thread GitBox


jpountz merged pull request #415:
URL: https://github.com/apache/lucene/pull/415


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10206) Implement O(1) count on query cache

2021-10-27 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17434733#comment-17434733
 ] 

ASF subversion and git services commented on LUCENE-10206:
--

Commit 941df98c3f718371af4702c92bf6537739120064 in lucene's branch 
refs/heads/main from Nik Everett
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=941df98 ]

LUCENE-10206 Implement O(1) count on query cache (#415)

When we load a query into the query cache we always calculate the count
of matching documents. This uses that count to power the new `O(1)`
`Weight#count` method.

> Implement O(1) count on query cache
> ---
>
> Key: LUCENE-10206
> URL: https://issues.apache.org/jira/browse/LUCENE-10206
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Nik Everett
>Priority: Minor
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> I'd like to implement the `Weight#count` method in `LRUQueryCache` so cached 
> queries can quickly return their counts. We already have a count on all of 
> the bit sets we use for the query cache we just have to store it and "plug it 
> in".
>  
> I got here because we frequently end up wanting to get counts and I saw hot 
> `RoaringDocIdSet`'s iterator hot spotting. I don't think it's slow or 
> anything, but when the collector is just `count++` the iterator is 
> substantial. It seems like we could frequently avoid the whole thing by 
> implementing `count` in the query cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] janhoy commented on a change in pull request #2594: SOLR-14726: Initial draft of a new quickstart guide

2021-10-27 Thread GitBox


janhoy commented on a change in pull request #2594:
URL: https://github.com/apache/lucene-solr/pull/2594#discussion_r737229976



##
File path: solr/solr-ref-guide/src/quickstart.adoc
##
@@ -0,0 +1,140 @@
+= Quickstart Guide
+:experimental:
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+Here's a quickstart guide to start Solr, add some documents and perform some 
searches.
+
+== Starting Solr
+
+Start a Solr node in cluster mode (SolrCloud mode)
+
+[source,subs="verbatim,attributes+"]
+
+$ bin/solr -c
+
+Waiting up to 180 seconds to see Solr running on port 8983 [\]
+Started Solr server on port 8983 (pid=34942). Happy searching!
+
+
+To start another Solr node and have it join the cluster alongside the first 
node,
+
+[source,subs="verbatim,attributes+"]
+
+$ bin/solr -c -z localhost:9983 -p 8984
+
+
+An instance of the cluster coordination service, i.e. Zookeeper, was started 
on port 9983 when the first node was started. To start Zookeeper separately, 
please refer to .
+
+== Creating a collection
+
+Like a database system holds data in tables, Solr holds data in collections. A 
collection can be created as follows:
+
+[source,subs="verbatim,attributes+"]
+
+$ curl --request POST \
+  --url http://localhost:8983/api/collections \
+  --header 'Content-Type: application/json' \
+  --data '{
+   "create": {
+   "name": "techproducts",
+   "numShards": 1,
+   "replicationFactor": 1

Review comment:
   I thought the same. If the consensus is that we're going away from field 
guessing, then we should not promote the current _default config, but rather be 
explicit and reference the bundled `techproducts` configset. Or better, show 
them how to use Schema Designer to setup a configset for a certain dataset?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10206) Implement O(1) count on query cache

2021-10-27 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10206.
---
Fix Version/s: main (9.0)
   Resolution: Fixed

> Implement O(1) count on query cache
> ---
>
> Key: LUCENE-10206
> URL: https://issues.apache.org/jira/browse/LUCENE-10206
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Nik Everett
>Priority: Minor
> Fix For: main (9.0)
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> I'd like to implement the `Weight#count` method in `LRUQueryCache` so cached 
> queries can quickly return their counts. We already have a count on all of 
> the bit sets we use for the query cache we just have to store it and "plug it 
> in".
>  
> I got here because we frequently end up wanting to get counts and I saw hot 
> `RoaringDocIdSet`'s iterator hot spotting. I don't think it's slow or 
> anything, but when the collector is just `count++` the iterator is 
> substantial. It seems like we could frequently avoid the whole thing by 
> implementing `count` in the query cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10207) Make TermInSetQuery usable with IndexOrDocValuesQuery

2021-10-27 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17434769#comment-17434769
 ] 

Michael McCandless commented on LUCENE-10207:
-

I love this idea!  Using the aggregate term statistics already in the index to 
efficiently guesstimate the cost on the index side of things.  The user can 
always override the decision if they know something is unusual about their 
index?  (Hmm, maybe not – looks like the logic is hardcoded deep inside an 
anonymous {{ScorerSuppplier}} in {{IoDVQ}}).

Should we try to take deletions into account at all?  Because a PK field with 
deletions will look like it is not "precisely" PK based on the aggregate stats. 
 Though I suppose even with e.g. 50% deletions in the index, this proposed cost 
metric is close enough.

> Make TermInSetQuery usable with IndexOrDocValuesQuery
> -
>
> Key: LUCENE-10207
> URL: https://issues.apache.org/jira/browse/LUCENE-10207
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> IndexOrDocValuesQuery is very useful to pick the right execution mode for a 
> query depending on other bits of the query tree.
> We would like to be able to use it to optimize execution of TermInSetQuery. 
> However IndexOrDocValuesQuery only works well if the "index" query can give 
> an estimation of the cost of the query without doing anything expensive (like 
> looking up all terms of the TermInSetQuery in the terms dict). Maybe we could 
> implement it for primary keys (terms.size() == sumDocFreq) by returning the 
> number of terms of the query? Another idea is to multiply the number of terms 
> by the average postings length, though this could be dangerous if the field 
> has a zipfian distribution and some terms have a much higher doc frequency 
> than the average.
> [~romseygeek] and I were discussing this a few weeks ago, and more recently 
> [~mikemccand] and [~gsmiller] again independently. So it looks like there is 
> interest in this. Here is an email thread where this was recently discussed: 
> https://lists.apache.org/thread.html/re3b20a486c9a4e66b2ca4a2646e2d3be48535a90cdd95911a8445183%40%3Cdev.lucene.apache.org%3E.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9660) gradle task cache should not cache --tests

2021-10-27 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17434803#comment-17434803
 ] 

ASF subversion and git services commented on LUCENE-9660:
-

Commit 486141f0eb01c892dbeeed67060b5b4adc77d38d in lucene's branch 
refs/heads/hnsw from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=486141f ]

LUCENE-9660: correct help/tests.txt.


> gradle task cache should not cache --tests
> --
>
> Key: LUCENE-9660
> URL: https://issues.apache.org/jira/browse/LUCENE-9660
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: general/build
>Reporter: David Smiley
>Assignee: Dawid Weiss
>Priority: Minor
> Fix For: main (9.0)
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I recently ran a specific test at the CLI via gradle to see if a particular 
> build failure repeats.  It includes the {{--tests}} command line option to 
> specify the test.  The test passed.  Later I wanted to run it again; I 
> suspected it might be flakey.  Gradle completed in 10 seconds, and I'm 
> certain it didn't actually run the test. There was no printout and the 
> build/test-results/test/outputs/...  from the test run still had not changed 
> from previously.
> Mike Drob informed me of "gradlew cleanTest" but I'd prefer to not have to 
> know about that, at least not for the specific case of wanting to execute a 
> specific test.
> CC [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10163) Review top-level *.txt and *.md files

2021-10-27 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17434808#comment-17434808
 ] 

ASF subversion and git services commented on LUCENE-10163:
--

Commit 1613355149e5fc11d0804b457742f5862e843ae2 in lucene's branch 
refs/heads/hnsw from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=1613355 ]

LUCENE-10163: update smoke tester - README inside lucene/ is no longer there in 
the source release.


> Review top-level *.txt and *.md files
> -
>
> Key: LUCENE-10163
> URL: https://issues.apache.org/jira/browse/LUCENE-10163
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Dawid Weiss
>Priority: Major
> Fix For: main (9.0)
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Some of them contain obsolete pointers and information 
> (SYSTEM_REQUIREMENTS.md, etc.).
> Also, move the files that are distribution-specific (lucene/README.md) to the 
> distribution project. Otherwise they
> give odd, incorrect information like:
> {code}
> To review the documentation, read the main documentation page, located at: 
> `docs/index.html` 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9660) gradle task cache should not cache --tests

2021-10-27 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17434801#comment-17434801
 ] 

ASF subversion and git services commented on LUCENE-9660:
-

Commit 81f5b4d6423958890876bd755e4ed68c73fbb612 in lucene's branch 
refs/heads/hnsw from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=81f5b4d ]

LUCENE-9660: add tests.neverUpToDate=true option which, by default, makes test 
tasks always execute. (#410)



> gradle task cache should not cache --tests
> --
>
> Key: LUCENE-9660
> URL: https://issues.apache.org/jira/browse/LUCENE-9660
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: general/build
>Reporter: David Smiley
>Assignee: Dawid Weiss
>Priority: Minor
> Fix For: main (9.0)
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I recently ran a specific test at the CLI via gradle to see if a particular 
> build failure repeats.  It includes the {{--tests}} command line option to 
> specify the test.  The test passed.  Later I wanted to run it again; I 
> suspected it might be flakey.  Gradle completed in 10 seconds, and I'm 
> certain it didn't actually run the test. There was no printout and the 
> build/test-results/test/outputs/...  from the test run still had not changed 
> from previously.
> Mike Drob informed me of "gradlew cleanTest" but I'd prefer to not have to 
> know about that, at least not for the specific case of wanting to execute a 
> specific test.
> CC [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10198) Allow external JAVA_OPTS in gradlew scripts; use sane defaults (heap, stack and system proxies)

2021-10-27 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17434807#comment-17434807
 ] 

ASF subversion and git services commented on LUCENE-10198:
--

Commit 4329450392f11303fdd8ed5352d9cfffca8dc8c1 in lucene's branch 
refs/heads/hnsw from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=4329450 ]

LUCENE-10198: remove debug statement that crept in.


> Allow external JAVA_OPTS in gradlew scripts; use sane defaults (heap, stack 
> and system proxies)
> ---
>
> Key: LUCENE-10198
> URL: https://issues.apache.org/jira/browse/LUCENE-10198
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: main (9.0)
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10154) NumericLeafComparator to define getPointValues

2021-10-27 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17434802#comment-17434802
 ] 

ASF subversion and git services commented on LUCENE-10154:
--

Commit 2ed6e4aa78eb6d1fbb90c21c9723313ab5077e83 in lucene's branch 
refs/heads/hnsw from Mayya Sharipova
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=2ed6e4a ]

LUCENE-10154 NumericLeafComparator to define getPointValues (#364)

This patch adds getPointValues to NumericLeafComparatorsimilar how it
has getNumericDocValues.

Numeric Sort optimization with points relies on the assumption that
points and doc values record the same information, as we substitute
iterator over doc_values with one over points.

If we override getNumericDocValues it almost certainly means that whatever
PointValues NumericComparator is going to look at shouldn't be used to
skip non-competitive documents. Returning null for pointValues in this
case will force comparator NOT to use sort optimization with points,
and continue with a traditional way of iterating over doc values.

> NumericLeafComparator to define getPointValues
> --
>
> Key: LUCENE-10154
> URL: https://issues.apache.org/jira/browse/LUCENE-10154
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Mayya Sharipova
>Priority: Minor
> Fix For: main (9.0), 8.11
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> NumericLeafComparator must have a method getPointValues similar how it has 
> getNumericDocValues.
> Numeric Sort optimization with points relies on the assumption that points 
> and doc values record the same information, as we substitute iterator over 
> doc_values with one over points.
> If we extend {{getNumericDocValues}} it almost certainly means that whatever 
> {{PointValues}} NumericComparator is going to look at shouldn't be used to 
> skip non-competitive documents. Returning null for pointValues in this case 
> will force comparator NOT to use sort optimization with points, and continue 
> with a traditional way of iterating over doc values.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10198) Allow external JAVA_OPTS in gradlew scripts; use sane defaults (heap, stack and system proxies)

2021-10-27 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17434804#comment-17434804
 ] 

ASF subversion and git services commented on LUCENE-10198:
--

Commit 780846a732b9c3f9c8b0abeae7d1d2c19df524e4 in lucene's branch 
refs/heads/hnsw from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=780846a ]

LUCENE-10198: Allow external JAVA_OPTS in gradlew scripts; use sane defaults 
(heap, stack and system proxies) (#405)

Co-authored-by: balmukundblr 

> Allow external JAVA_OPTS in gradlew scripts; use sane defaults (heap, stack 
> and system proxies)
> ---
>
> Key: LUCENE-10198
> URL: https://issues.apache.org/jira/browse/LUCENE-10198
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: main (9.0)
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10199) Drop ZIP binary distribution from release artifacts

2021-10-27 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17434806#comment-17434806
 ] 

ASF subversion and git services commented on LUCENE-10199:
--

Commit fb6aaa7b2c28749c93553c7ffb7e5f5a372ad9b3 in lucene's branch 
refs/heads/hnsw from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=fb6aaa7 ]

LUCENE-10199: drop binary .zip artifact. (#407)



> Drop ZIP binary distribution from release artifacts
> ---
>
> Key: LUCENE-10199
> URL: https://issues.apache.org/jira/browse/LUCENE-10199
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
> Fix For: main (9.0)
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10206) Implement O(1) count on query cache

2021-10-27 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17434809#comment-17434809
 ] 

ASF subversion and git services commented on LUCENE-10206:
--

Commit 941df98c3f718371af4702c92bf6537739120064 in lucene's branch 
refs/heads/hnsw from Nik Everett
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=941df98 ]

LUCENE-10206 Implement O(1) count on query cache (#415)

When we load a query into the query cache we always calculate the count
of matching documents. This uses that count to power the new `O(1)`
`Weight#count` method.

> Implement O(1) count on query cache
> ---
>
> Key: LUCENE-10206
> URL: https://issues.apache.org/jira/browse/LUCENE-10206
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Nik Everett
>Priority: Minor
> Fix For: main (9.0)
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> I'd like to implement the `Weight#count` method in `LRUQueryCache` so cached 
> queries can quickly return their counts. We already have a count on all of 
> the bit sets we use for the query cache we just have to store it and "plug it 
> in".
>  
> I got here because we frequently end up wanting to get counts and I saw hot 
> `RoaringDocIdSet`'s iterator hot spotting. I don't think it's slow or 
> anything, but when the collector is just `count++` the iterator is 
> substantial. It seems like we could frequently avoid the whole thing by 
> implementing `count` in the query cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10163) Review top-level *.txt and *.md files

2021-10-27 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17434805#comment-17434805
 ] 

ASF subversion and git services commented on LUCENE-10163:
--

Commit 08c03566648c0b024b8160869b3d694c3cebaabd in lucene's branch 
refs/heads/hnsw from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=08c0356 ]

LUCENE-10163: clean up and remove some old cruft in readme files. Move binary 
release only README.md to the distribution project so that it doesn't look 
weird in the source tree. (#406)



> Review top-level *.txt and *.md files
> -
>
> Key: LUCENE-10163
> URL: https://issues.apache.org/jira/browse/LUCENE-10163
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Dawid Weiss
>Priority: Major
> Fix For: main (9.0)
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Some of them contain obsolete pointers and information 
> (SYSTEM_REQUIREMENTS.md, etc.).
> Also, move the files that are distribution-specific (lucene/README.md) to the 
> distribution project. Otherwise they
> give odd, incorrect information like:
> {code}
> To review the documentation, read the main documentation page, located at: 
> `docs/index.html` 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] nik9000 commented on pull request #415: LUCENE-10206 Implement O(1) count on query cache

2021-10-27 Thread GitBox


nik9000 commented on pull request #415:
URL: https://github.com/apache/lucene/pull/415#issuecomment-952923025


   > jpountz merged commit 941df98 into apache:main 5 hours ago
   
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10208) Minimum score can decrease in concurrent search

2021-10-27 Thread Jim Ferenczi (Jira)
Jim Ferenczi created LUCENE-10208:
-

 Summary: Minimum score can decrease in concurrent search
 Key: LUCENE-10208
 URL: https://issues.apache.org/jira/browse/LUCENE-10208
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Jim Ferenczi


TestLatLonPointDistanceFeatureQuery#testCompareSorting started to fail 
sporadically after https://github.com/apache/lucene/pull/331. 
The test change added in this PR exposes an existing bug in top docs collector. 
They re-set the minimum score multiple times per segment when a bulk scorer is 
used.
In practice this is not a problem because the local minimum score cannot 
decrease. 
However when concurrent search is used,  the global minimum score is updated 
after the local one so that breaks the assertion.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] chatman commented on a change in pull request #2594: SOLR-14726: Initial draft of a new quickstart guide

2021-10-27 Thread GitBox


chatman commented on a change in pull request #2594:
URL: https://github.com/apache/lucene-solr/pull/2594#discussion_r737538697



##
File path: solr/solr-ref-guide/src/quickstart.adoc
##
@@ -0,0 +1,140 @@
+= Quickstart Guide
+:experimental:
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+Here's a quickstart guide to start Solr, add some documents and perform some 
searches.
+
+== Starting Solr
+
+Start a Solr node in cluster mode (SolrCloud mode)
+
+[source,subs="verbatim,attributes+"]
+
+$ bin/solr -c
+
+Waiting up to 180 seconds to see Solr running on port 8983 [\]
+Started Solr server on port 8983 (pid=34942). Happy searching!
+
+
+To start another Solr node and have it join the cluster alongside the first 
node,
+
+[source,subs="verbatim,attributes+"]
+
+$ bin/solr -c -z localhost:9983 -p 8984
+
+
+An instance of the cluster coordination service, i.e. Zookeeper, was started 
on port 9983 when the first node was started. To start Zookeeper separately, 
please refer to .
+
+== Creating a collection
+
+Like a database system holds data in tables, Solr holds data in collections. A 
collection can be created as follows:
+
+[source,subs="verbatim,attributes+"]
+
+$ curl --request POST \
+  --url http://localhost:8983/api/collections \
+  --header 'Content-Type: application/json' \
+  --data '{
+   "create": {
+   "name": "techproducts",
+   "numShards": 1,
+   "replicationFactor": 1

Review comment:
   For quickstart examples, we don't need the user to use their own 
configsets. They can start with the default configset, add fields (schema API) 
and their indexing/searching.
   
   > If the consensus is that we're going away from field guessing, then we 
should not promote the current _default config, but rather be explicit and 
reference the bundled techproducts configset.
   
   I'm more inclined to remove the techproducts configset. They can be 
downloaded from some web resource for those who need it.
   
   > Or better, show them how to use Schema Designer to setup a configset for a 
certain dataset?
   
   +1




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] apanimesh061 commented on a change in pull request #412: LUCENE-10197: UnifiedHighlighter should use builders for thread-safety

2021-10-27 Thread GitBox


apanimesh061 commented on a change in pull request #412:
URL: https://github.com/apache/lucene/pull/412#discussion_r737588612



##
File path: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java
##
@@ -143,6 +143,106 @@
 
   private int cacheFieldValCharsThreshold = DEFAULT_CACHE_CHARS_THRESHOLD;
 
+  /** Builder for UnifiedHighlighter. */
+  public abstract static class Builder> {
+private IndexSearcher searcher;
+private Analyzer indexAnalyzer;
+private boolean handleMultiTermQuery = true;
+private boolean highlightPhrasesStrictly = true;
+private boolean passageRelevancyOverSpeed = true;
+private int maxLength = DEFAULT_MAX_LENGTH;
+private Supplier breakIterator =
+() -> BreakIterator.getSentenceInstance(Locale.ROOT);
+private Predicate fieldMatcher;
+private PassageScorer scorer = new PassageScorer();
+private PassageFormatter formatter = new DefaultPassageFormatter();
+private int maxNoHighlightPassages = -1;
+private int cacheFieldValCharsThreshold = DEFAULT_CACHE_CHARS_THRESHOLD;
+
+public T withSearcher(IndexSearcher value) {
+  this.searcher = value;
+  return self();
+}
+
+public T withIndexAnalyzer(Analyzer value) {
+  this.indexAnalyzer = value;
+  return self();
+}
+
+public T withHandleMultiTermQuery(boolean value) {
+  this.handleMultiTermQuery = value;
+  return self();
+}
+
+public T withHighlightPhrasesStrictly(boolean value) {
+  this.highlightPhrasesStrictly = value;
+  return self();
+}
+
+public T withPassageRelevancyOverSpeed(boolean value) {
+  this.passageRelevancyOverSpeed = value;
+  return self();
+}
+
+public T withMaxLength(int value) {
+  if (value < 0 || value == Integer.MAX_VALUE) {
+// two reasons: no overflow problems in 
BreakIterator.preceding(offset+1),
+// our sentinel in the offsets queue uses this value to terminate.
+throw new IllegalArgumentException("maxLength must be < 
Integer.MAX_VALUE");
+  }
+  this.maxLength = value;
+  return self();
+}
+
+public T withBreakIterator(Supplier value) {
+  this.breakIterator = value;
+  return self();
+}
+
+public T withFieldMatcher(Predicate value) {
+  this.fieldMatcher = value;
+  return self();
+}
+
+public T withScorer(PassageScorer value) {
+  this.scorer = value;
+  return self();
+}
+
+public T withFormatter(PassageFormatter value) {
+  this.formatter = value;
+  return self();
+}
+
+public T withMaxNoHighlightPassages(int value) {
+  this.maxNoHighlightPassages = value;
+  return self();
+}
+
+public T withCacheFieldValCharsThreshold(int value) {
+  this.cacheFieldValCharsThreshold = value;
+  return self();
+}
+
+protected abstract T self();
+
+public UnifiedHighlighter build() {
+  return new UnifiedHighlighter(this);
+}
+  }
+
+  // Why? 
https://web.archive.org/web/20150920054846/https://weblogs.java.net/node/642849

Review comment:
   @dsmiley Is there a way to run the checks again on the code? I see that 
1/3 checks failed. The failure was due to `socket hang up`. I wonder if 
retrying might work.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a change in pull request #413: LUCENE-9614: Fix KnnVectorQuery failure when numDocs is 0

2021-10-27 Thread GitBox


jtibshirani commented on a change in pull request #413:
URL: https://github.com/apache/lucene/pull/413#discussion_r737599366



##
File path: lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java
##
@@ -25,18 +25,10 @@
 import java.io.IOException;
 import java.util.HashSet;
 import java.util.Set;
-import org.apache.lucene.document.Document;
-import org.apache.lucene.document.Field;
-import org.apache.lucene.document.KnnVectorField;
-import org.apache.lucene.document.StringField;
-import org.apache.lucene.index.DirectoryReader;
-import org.apache.lucene.index.IndexReader;
-import org.apache.lucene.index.IndexWriter;
-import org.apache.lucene.index.IndexWriterConfig;
-import org.apache.lucene.index.RandomIndexWriter;
-import org.apache.lucene.index.Term;
-import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.document.*;
+import org.apache.lucene.index.*;

Review comment:
   I also noticed our static analysis is totally fine with it 
(surprisingly?) I'll need to fix my IntelliJ setup :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova opened a new pull request #416: LUCENE-10054 Make HnswGraph hierarchical

2021-10-27 Thread GitBox


mayya-sharipova opened a new pull request #416:
URL: https://github.com/apache/lucene/pull/416


   Currently HNSW has only a single layer.
   This patch attempts to make multi-layered.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on pull request #416: LUCENE-10054 Make HnswGraph hierarchical

2021-10-27 Thread GitBox


mayya-sharipova commented on pull request #416:
URL: https://github.com/apache/lucene/pull/416#issuecomment-953142944


   Benchmarking based on @jtibshirani [setup](based on 
https://github.com/jtibshirani/lucene/pull/1)
   
   baseline: main branch
   candidate: this PR
   
   **glove-25-angular**
   |  | baseline recall | baseline QPS | candidate recall | 
candidate QPS |
   |  | --: | ---: | ---: | 
: |
   | n_cands=10   |   0.626 |10962.821 |0.631 |  
8869.807 |
   | n_cands=50   |   0.888 | 4409.952 |0.889 |  
4111.685 |
   | n_cands=100  |   0.946 | 2621.846 |0.947 |  
2734.787 |
   | n_cands=500  |   0.994 |  661.253 |0.994 |   
686.700 |
   | n_cands=800  |   0.997 |  430.172 |0.997 |   
459.356 |
   | n_cands=1000 |   0.998 |  342.915 |0.998 |   
355.238 |
   
   
   **sift-128-euclidean**
   
   |  | baseline recall | baseline QPS | candidate recall | 
candidate QPS |
   |  | --: | ---: | ---: | 
: |
   | n_cands=10   |   0.601 | 6948.736 |0.607 |  
6677.931 |
   | n_cands=50   |   0.889 | 3003.781 |0.892 |  
3202.925 |
   | n_cands=100  |   0.952 | 1622.276 |0.953 |  
1996.992 |
   | n_cands=500  |   0.996 |  444.135 |0.996 |   
540.368 |
   | n_cands=800  |   0.998 |  296.835 |0.998 |   
367.316 |
   | n_cands=1000 |   0.999 |  245.498 |0.999 |   
311.339 |
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova edited a comment on pull request #416: LUCENE-10054 Make HnswGraph hierarchical

2021-10-27 Thread GitBox


mayya-sharipova edited a comment on pull request #416:
URL: https://github.com/apache/lucene/pull/416#issuecomment-953142944


   Benchmarking based on @jtibshirani [setup](based on 
https://github.com/jtibshirani/lucene/pull/1)
   
   baseline: main branch
   candidate: this PR
   
   **glove-25-angular**
   |  | baseline recall | baseline QPS | candidate recall | 
candidate QPS |
   |  | --: | ---: | ---: | 
: |
   | n_cands=10   |   0.626 |10962.821 |0.631 |  
8869.807 |
   | n_cands=50   |   0.888 | 4409.952 |0.889 |  
4111.685 |
   | n_cands=100  |   0.946 | 2621.846 |0.947 |  
2734.787 |
   | n_cands=500  |   0.994 |  661.253 |0.994 |   
686.700 |
   | n_cands=800  |   0.997 |  430.172 |0.997 |   
459.356 |
   | n_cands=1000 |   0.998 |  342.915 |0.998 |   
355.238 |
   
   
   **sift-128-euclidean**
   
   |  | baseline recall | baseline QPS | candidate recall | 
candidate QPS |
   |  | --: | ---: | ---: | 
: |
   | n_cands=10   |   0.601 | 6948.736 |0.607 |  
6677.931 |
   | n_cands=50   |   0.889 | 3003.781 |0.892 |  
3202.925 |
   | n_cands=100  |   0.952 | 1622.276 |0.953 |  
1996.992 |
   | n_cands=500  |   0.996 |  444.135 |0.996 |   
540.368 |
   | n_cands=800  |   0.998 |  296.835 |0.998 |   
367.316 |
   | n_cands=1000 |   0.999 |  245.498 |0.999 |   
311.339 |
   
   
   As can be seen from the comparison, there is very slight change that the 
hierarchy brings: a small increase in recall by at the expense of lower QPSs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani merged pull request #413: LUCENE-9614: Fix KnnVectorQuery failure when numDocs is 0

2021-10-27 Thread GitBox


jtibshirani merged pull request #413:
URL: https://github.com/apache/lucene/pull/413


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9614) Implement KNN Query

2021-10-27 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17434996#comment-17434996
 ] 

ASF subversion and git services commented on LUCENE-9614:
-

Commit abd5ec4ff0b56b1abfc2883e47e75871e60d3cad in lucene's branch 
refs/heads/main from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=abd5ec4 ]

LUCENE-9614: Fix KnnVectorQuery failure when numDocs is 0 (#413)

When the reader has no live docs, `KnnVectorQuery` can error out. This happens
because `IndexReader#numDocs` is 0, and we end up passing an illegal value of
`k = 0` to the search method.

This commit removes the problematic optimization in `KnnVectorQuery` and
replaces with a lower-level based on the total number of vectors in the segment.

> Implement KNN Query
> ---
>
> Key: LUCENE-9614
> URL: https://issues.apache.org/jira/browse/LUCENE-9614
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> Now we have a vector index format, and one vector indexing/KNN search 
> implementation, but the interface is low-level: you can search across a 
> single segment only. We would like to expose a Query implementation. 
> Initially, we want to support a usage where the KnnVectorQuery selects the 
> k-nearest neighbors without regard to any other constraints, and these can 
> then be filtered as part of an enclosing Boolean or other query.
> Later we will want to explore some kind of filtering *while* performing 
> vector search, or a re-entrant search process that can yield further results. 
> Because of the nature of knn search (all documents having any vector value 
> match), it is more like a ranking than a filtering operation, and it doesn't 
> really make sense to provide an iterator interface that can be merged in the 
> usual way, in docid order, skipping ahead. It's not yet clear how to satisfy 
> a query that is "k nearest neighbors satsifying some arbitrary Query", at 
> least not without realizing a complete bitset for the Query. But this is for 
> a later issue; *this* issue is just about performing the knn search in 
> isolation, computing a set of (some given) K nearest neighbors, and providing 
> an iterator over those.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on pull request #416: LUCENE-10054 Make HnswGraph hierarchical

2021-10-27 Thread GitBox


jtibshirani commented on pull request #416:
URL: https://github.com/apache/lucene/pull/416#issuecomment-953198759


   > As can be seen from the comparison, there is very slight change that the 
hierarchy brings: a small increase in recall by at the expense of lower QPSs
   
   It looks like QPS is sometimes worse, but often better (like in all the 
sift-128-euclidean runs, num_cands >=50). I wonder if the first runs are 
affected by a lack of warm-up? My original set-up you linked to didn't include 
a warm-up, and in LUCENE-9937 we found that this can have a big impact on the 
first runs. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10200) Restructure and modernize the release artifacts

2021-10-27 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated LUCENE-10200:
-
Description: 
This is an umbrella issue for various sub-tasks as per my e-mail [1].
 [1] [https://markmail.org/thread/f7yrggnynq2ijgmy]

In this order, perhaps:
 * Apply small text file changes (LUCENE-10163)
 * Simplify artifacts (LUCENE-10199 drop ZIP binary), (LUCENE-10192 drop third 
party JARs).
 * Create an additional binary artifact for Luke (LUCENE-9978).
 * Review the content of licenses/ - there are some entries there that relate 
to tests only (jetty).
 * Test everything with the smoke tester.

  was:
This is an umbrella issue for various sub-tasks as per my e-mail [1].
[1] https://markmail.org/thread/f7yrggnynq2ijgmy

In this order, perhaps:

* Apply small text file changes (LUCENE-10163)
* Simplify artifacts (LUCENE-10199 drop ZIP binary), (LUCENE-10192 drop third 
party JARs).
* Create an additional binary artifact for Luke (LUCENE-9978).
* Review the content of licenses/ - there are some entries there that relate to 
tests only (jetty) or oddballs like elegant-icon-font or a stray pddl*.txt.
* Test everything with the smoke tester.


> Restructure and modernize the release artifacts
> ---
>
> Key: LUCENE-10200
> URL: https://issues.apache.org/jira/browse/LUCENE-10200
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
>
> This is an umbrella issue for various sub-tasks as per my e-mail [1].
>  [1] [https://markmail.org/thread/f7yrggnynq2ijgmy]
> In this order, perhaps:
>  * Apply small text file changes (LUCENE-10163)
>  * Simplify artifacts (LUCENE-10199 drop ZIP binary), (LUCENE-10192 drop 
> third party JARs).
>  * Create an additional binary artifact for Luke (LUCENE-9978).
>  * Review the content of licenses/ - there are some entries there that relate 
> to tests only (jetty).
>  * Test everything with the smoke tester.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10200) Restructure and modernize the release artifacts

2021-10-27 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17435007#comment-17435007
 ] 

ASF subversion and git services commented on LUCENE-10200:
--

Commit 62eb9a809e8e6327df0006efd342b980b2d18bd9 in lucene's branch 
refs/heads/main from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=62eb9a8 ]

LUCENE-10200: remove unused dangling license exclusions. Add references to the 
remaining ones.


> Restructure and modernize the release artifacts
> ---
>
> Key: LUCENE-10200
> URL: https://issues.apache.org/jira/browse/LUCENE-10200
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
>
> This is an umbrella issue for various sub-tasks as per my e-mail [1].
> [1] https://markmail.org/thread/f7yrggnynq2ijgmy
> In this order, perhaps:
> * Apply small text file changes (LUCENE-10163)
> * Simplify artifacts (LUCENE-10199 drop ZIP binary), (LUCENE-10192 drop third 
> party JARs).
> * Create an additional binary artifact for Luke (LUCENE-9978).
> * Review the content of licenses/ - there are some entries there that relate 
> to tests only (jetty) or oddballs like elegant-icon-font or a stray pddl*.txt.
> * Test everything with the smoke tester.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10209) Gradle wrapper validation gh workflow step fails with odd messages

2021-10-27 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17435016#comment-17435016
 ] 

ASF subversion and git services commented on LUCENE-10209:
--

Commit 727c6b1e0b1429bc521174ab5c60bebf0e0178e1 in lucene's branch 
refs/heads/main from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=727c6b1 ]

LUCENE-10209: Temporarily comment out gradle validation.


> Gradle wrapper validation gh workflow step fails with odd messages
> --
>
> Key: LUCENE-10209
> URL: https://issues.apache.org/jira/browse/LUCENE-10209
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Priority: Minor
>
> I will comment it out for the time being. Don't know what's causing it.
> https://github.com/gradle/wrapper-validation-action/issues/46



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10209) Gradle wrapper validation gh workflow step fails with odd messages

2021-10-27 Thread Dawid Weiss (Jira)
Dawid Weiss created LUCENE-10209:


 Summary: Gradle wrapper validation gh workflow step fails with odd 
messages
 Key: LUCENE-10209
 URL: https://issues.apache.org/jira/browse/LUCENE-10209
 Project: Lucene - Core
  Issue Type: Task
Reporter: Dawid Weiss


I will comment it out for the time being. Don't know what's causing it.

https://github.com/gradle/wrapper-validation-action/issues/46



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-10209) Gradle wrapper validation gh workflow step fails with odd messages

2021-10-27 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss reassigned LUCENE-10209:


Assignee: Dawid Weiss

> Gradle wrapper validation gh workflow step fails with odd messages
> --
>
> Key: LUCENE-10209
> URL: https://issues.apache.org/jira/browse/LUCENE-10209
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
>
> I will comment it out for the time being. Don't know what's causing it.
> https://github.com/gradle/wrapper-validation-action/issues/46



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dsmiley commented on a change in pull request #412: LUCENE-10197: UnifiedHighlighter should use builders for thread-safety

2021-10-27 Thread GitBox


dsmiley commented on a change in pull request #412:
URL: https://github.com/apache/lucene/pull/412#discussion_r737770485



##
File path: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java
##
@@ -143,6 +143,106 @@
 
   private int cacheFieldValCharsThreshold = DEFAULT_CACHE_CHARS_THRESHOLD;
 
+  /** Builder for UnifiedHighlighter. */
+  public abstract static class Builder> {
+private IndexSearcher searcher;
+private Analyzer indexAnalyzer;
+private boolean handleMultiTermQuery = true;
+private boolean highlightPhrasesStrictly = true;
+private boolean passageRelevancyOverSpeed = true;
+private int maxLength = DEFAULT_MAX_LENGTH;
+private Supplier breakIterator =
+() -> BreakIterator.getSentenceInstance(Locale.ROOT);
+private Predicate fieldMatcher;
+private PassageScorer scorer = new PassageScorer();
+private PassageFormatter formatter = new DefaultPassageFormatter();
+private int maxNoHighlightPassages = -1;
+private int cacheFieldValCharsThreshold = DEFAULT_CACHE_CHARS_THRESHOLD;
+
+public T withSearcher(IndexSearcher value) {
+  this.searcher = value;
+  return self();
+}
+
+public T withIndexAnalyzer(Analyzer value) {
+  this.indexAnalyzer = value;
+  return self();
+}
+
+public T withHandleMultiTermQuery(boolean value) {
+  this.handleMultiTermQuery = value;
+  return self();
+}
+
+public T withHighlightPhrasesStrictly(boolean value) {
+  this.highlightPhrasesStrictly = value;
+  return self();
+}
+
+public T withPassageRelevancyOverSpeed(boolean value) {
+  this.passageRelevancyOverSpeed = value;
+  return self();
+}
+
+public T withMaxLength(int value) {
+  if (value < 0 || value == Integer.MAX_VALUE) {
+// two reasons: no overflow problems in 
BreakIterator.preceding(offset+1),
+// our sentinel in the offsets queue uses this value to terminate.
+throw new IllegalArgumentException("maxLength must be < 
Integer.MAX_VALUE");
+  }
+  this.maxLength = value;
+  return self();
+}
+
+public T withBreakIterator(Supplier value) {
+  this.breakIterator = value;
+  return self();
+}
+
+public T withFieldMatcher(Predicate value) {
+  this.fieldMatcher = value;
+  return self();
+}
+
+public T withScorer(PassageScorer value) {
+  this.scorer = value;
+  return self();
+}
+
+public T withFormatter(PassageFormatter value) {
+  this.formatter = value;
+  return self();
+}
+
+public T withMaxNoHighlightPassages(int value) {
+  this.maxNoHighlightPassages = value;
+  return self();
+}
+
+public T withCacheFieldValCharsThreshold(int value) {
+  this.cacheFieldValCharsThreshold = value;
+  return self();
+}
+
+protected abstract T self();
+
+public UnifiedHighlighter build() {
+  return new UnifiedHighlighter(this);
+}
+  }
+
+  // Why? 
https://web.archive.org/web/20150920054846/https://weblogs.java.net/node/642849

Review comment:
   I suspect you intended a top-level comment but responded to a previous 
thread about builder subclassing.  Any way; I re-ran them and they passed.  I 
wouldn't worry about this; just run locally.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova edited a comment on pull request #416: LUCENE-10054 Make HnswGraph hierarchical

2021-10-27 Thread GitBox


mayya-sharipova edited a comment on pull request #416:
URL: https://github.com/apache/lucene/pull/416#issuecomment-953142944


   Benchmarking based on @jtibshirani 
[setup](https://github.com/jtibshirani/lucene/pull/1)
   
   baseline: main branch
   candidate: this PR
   
   **glove-25-angular**
   |  | baseline recall | baseline QPS | candidate recall | 
candidate QPS |
   |  | --: | ---: | ---: | 
: |
   | n_cands=10   |   0.626 |10962.821 |0.631 |  
8869.807 |
   | n_cands=50   |   0.888 | 4409.952 |0.889 |  
4111.685 |
   | n_cands=100  |   0.946 | 2621.846 |0.947 |  
2734.787 |
   | n_cands=500  |   0.994 |  661.253 |0.994 |   
686.700 |
   | n_cands=800  |   0.997 |  430.172 |0.997 |   
459.356 |
   | n_cands=1000 |   0.998 |  342.915 |0.998 |   
355.238 |
   
   
   **sift-128-euclidean**
   
   |  | baseline recall | baseline QPS | candidate recall | 
candidate QPS |
   |  | --: | ---: | ---: | 
: |
   | n_cands=10   |   0.601 | 6948.736 |0.607 |  
6677.931 |
   | n_cands=50   |   0.889 | 3003.781 |0.892 |  
3202.925 |
   | n_cands=100  |   0.952 | 1622.276 |0.953 |  
1996.992 |
   | n_cands=500  |   0.996 |  444.135 |0.996 |   
540.368 |
   | n_cands=800  |   0.998 |  296.835 |0.998 |   
367.316 |
   | n_cands=1000 |   0.999 |  245.498 |0.999 |   
311.339 |
   
   
   As can be seen from the comparison, there is very slight change that the 
hierarchy brings: a small increase in recall by at the expense of lower QPSs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on pull request #416: LUCENE-10054 Make HnswGraph hierarchical

2021-10-27 Thread GitBox


mayya-sharipova commented on pull request #416:
URL: https://github.com/apache/lucene/pull/416#issuecomment-953265461


   @jtibshirani  Thanks for the comment.
   
   >  I wonder if the first runs are affected by a lack of warm-up?
   
   I've added a warmup stage as well,  but starting with bogus query args in 
ann benchmarking algorithm. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova edited a comment on pull request #416: LUCENE-10054 Make HnswGraph hierarchical

2021-10-27 Thread GitBox


mayya-sharipova edited a comment on pull request #416:
URL: https://github.com/apache/lucene/pull/416#issuecomment-953265461


   @jtibshirani  Thanks for the comment.
   
   >  I wonder if the first runs are affected by a lack of warm-up?
   
   I've added a warmup stage as well,  but starting with bogus query args in 
[ann benchmarking 
algorithm](https://github.com/jtibshirani/ann-benchmarks/blob/lucene-hnsw/algos.yaml#L70)
 . 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] apanimesh061 commented on a change in pull request #412: LUCENE-10197: UnifiedHighlighter should use builders for thread-safety

2021-10-27 Thread GitBox


apanimesh061 commented on a change in pull request #412:
URL: https://github.com/apache/lucene/pull/412#discussion_r737860855



##
File path: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java
##
@@ -143,6 +143,106 @@
 
   private int cacheFieldValCharsThreshold = DEFAULT_CACHE_CHARS_THRESHOLD;
 
+  /** Builder for UnifiedHighlighter. */
+  public abstract static class Builder> {
+private IndexSearcher searcher;
+private Analyzer indexAnalyzer;
+private boolean handleMultiTermQuery = true;
+private boolean highlightPhrasesStrictly = true;
+private boolean passageRelevancyOverSpeed = true;
+private int maxLength = DEFAULT_MAX_LENGTH;
+private Supplier breakIterator =
+() -> BreakIterator.getSentenceInstance(Locale.ROOT);
+private Predicate fieldMatcher;
+private PassageScorer scorer = new PassageScorer();
+private PassageFormatter formatter = new DefaultPassageFormatter();
+private int maxNoHighlightPassages = -1;
+private int cacheFieldValCharsThreshold = DEFAULT_CACHE_CHARS_THRESHOLD;
+
+public T withSearcher(IndexSearcher value) {
+  this.searcher = value;
+  return self();
+}
+
+public T withIndexAnalyzer(Analyzer value) {
+  this.indexAnalyzer = value;
+  return self();
+}
+
+public T withHandleMultiTermQuery(boolean value) {
+  this.handleMultiTermQuery = value;
+  return self();
+}
+
+public T withHighlightPhrasesStrictly(boolean value) {
+  this.highlightPhrasesStrictly = value;
+  return self();
+}
+
+public T withPassageRelevancyOverSpeed(boolean value) {
+  this.passageRelevancyOverSpeed = value;
+  return self();
+}
+
+public T withMaxLength(int value) {
+  if (value < 0 || value == Integer.MAX_VALUE) {
+// two reasons: no overflow problems in 
BreakIterator.preceding(offset+1),
+// our sentinel in the offsets queue uses this value to terminate.
+throw new IllegalArgumentException("maxLength must be < 
Integer.MAX_VALUE");
+  }
+  this.maxLength = value;
+  return self();
+}
+
+public T withBreakIterator(Supplier value) {
+  this.breakIterator = value;
+  return self();
+}
+
+public T withFieldMatcher(Predicate value) {
+  this.fieldMatcher = value;
+  return self();
+}
+
+public T withScorer(PassageScorer value) {
+  this.scorer = value;
+  return self();
+}
+
+public T withFormatter(PassageFormatter value) {
+  this.formatter = value;
+  return self();
+}
+
+public T withMaxNoHighlightPassages(int value) {
+  this.maxNoHighlightPassages = value;
+  return self();
+}
+
+public T withCacheFieldValCharsThreshold(int value) {
+  this.cacheFieldValCharsThreshold = value;
+  return self();
+}
+
+protected abstract T self();
+
+public UnifiedHighlighter build() {
+  return new UnifiedHighlighter(this);
+}
+  }
+
+  // Why? 
https://web.archive.org/web/20150920054846/https://weblogs.java.net/node/642849

Review comment:
   @dsmiley ah yes you are right. Thanks for rerunning the tests.
   
   Does the new builder class look right to you? Is it expected to remove all 
setters from this class? This would mean I'll have to modify all their 
references in other classes and unit tests and replace them with builders.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dsmiley commented on pull request #412: LUCENE-10197: UnifiedHighlighter should use builders for thread-safety

2021-10-27 Thread GitBox


dsmiley commented on pull request #412:
URL: https://github.com/apache/lucene/pull/412#issuecomment-953329902


   > Does the new builder class look right to you? Is it expected to remove all 
setters from this class? This would mean I'll have to modify all their 
references in other classes and unit tests and replace them with builders.
   
   It looks good at a glance... you/I will see better if you update one of the 
clients that might want to subclass with extra configuration.  Is there any or 
is this builder subclassing issue entirely hypothetical at this point?  I 
suspect only hypothetical.  We'll want nice Javadocs on the builder setters 
since this is where consumers/clients will see it.  We can merely move the docs 
there from the existing locations, and add javadoc references pointing to the 
builder from the existing fields/enum values as desired.
   
   > This would mean I'll have to modify all their references in other classes 
and unit tests and replace them with builders.
   
   Yes, that's the point of this issue.  You might try updating just one/two 
source files (presumably tests) and see how it goes.  If there's some ugliness 
that brings doubt then maybe stop and share, otherwise continue.
   
   RE 9.0.  If this doesn't make 9.0, then the actual removal of the setters 
would happen in 10 but the rest of it (the builder) could arrive in 9.1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] apanimesh061 commented on pull request #412: LUCENE-10197: UnifiedHighlighter should use builders for thread-safety

2021-10-27 Thread GitBox


apanimesh061 commented on pull request #412:
URL: https://github.com/apache/lucene/pull/412#issuecomment-95323


   > It looks good at a glance... you/I will see better if you update one of 
the clients that might want to subclass with extra configuration. Is there any 
or is this builder subclassing issue entirely hypothetical at this point? I 
suspect only hypothetical. We'll want nice Javadocs on the builder setters 
since this is where consumers/clients will see it. We can merely move the docs 
there from the existing locations, and add javadoc references pointing to the 
builder from the existing fields/enum values as desired.
   
   Great. I added a unit test (just for demo) and a class 
`SubUnifiedHighlighter` in `TestUnifiedHighlighter.java` where I've added a new 
test field and also tested it. It does look right to me since it is able to use 
the new field and also fields from parent class.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10207) Make TermInSetQuery usable with IndexOrDocValuesQuery

2021-10-27 Thread Robert Muir (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-10207:
-
Attachment: LUCENE-10207_multitermquery.patch

> Make TermInSetQuery usable with IndexOrDocValuesQuery
> -
>
> Key: LUCENE-10207
> URL: https://issues.apache.org/jira/browse/LUCENE-10207
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-10207_multitermquery.patch
>
>
> IndexOrDocValuesQuery is very useful to pick the right execution mode for a 
> query depending on other bits of the query tree.
> We would like to be able to use it to optimize execution of TermInSetQuery. 
> However IndexOrDocValuesQuery only works well if the "index" query can give 
> an estimation of the cost of the query without doing anything expensive (like 
> looking up all terms of the TermInSetQuery in the terms dict). Maybe we could 
> implement it for primary keys (terms.size() == sumDocFreq) by returning the 
> number of terms of the query? Another idea is to multiply the number of terms 
> by the average postings length, though this could be dangerous if the field 
> has a zipfian distribution and some terms have a much higher doc frequency 
> than the average.
> [~romseygeek] and I were discussing this a few weeks ago, and more recently 
> [~mikemccand] and [~gsmiller] again independently. So it looks like there is 
> interest in this. Here is an email thread where this was recently discussed: 
> https://lists.apache.org/thread.html/re3b20a486c9a4e66b2ca4a2646e2d3be48535a90cdd95911a8445183%40%3Cdev.lucene.apache.org%3E.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10207) Make TermInSetQuery usable with IndexOrDocValuesQuery

2021-10-27 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17435109#comment-17435109
 ] 

Robert Muir commented on LUCENE-10207:
--

I attached a patch that refactors {{TermInSetQuery}} to extend 
{{MultiTermQuery}}.

Instead of {{seekExact}}'ing to e.g. thousands of terms like the current query, 
it acts more like AutomatonQuery: ping-pong intersects the {{PrefixCodedTerms}} 
against the terms dictionary.

With the change, if you want it to run against DV instead terms/postings, you 
can just call {{termInSetQuery.setRewriteMethod(new DocValuesRewriteMethod())}} 
and it should work to provide a "slow" implementation.

> Make TermInSetQuery usable with IndexOrDocValuesQuery
> -
>
> Key: LUCENE-10207
> URL: https://issues.apache.org/jira/browse/LUCENE-10207
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-10207_multitermquery.patch
>
>
> IndexOrDocValuesQuery is very useful to pick the right execution mode for a 
> query depending on other bits of the query tree.
> We would like to be able to use it to optimize execution of TermInSetQuery. 
> However IndexOrDocValuesQuery only works well if the "index" query can give 
> an estimation of the cost of the query without doing anything expensive (like 
> looking up all terms of the TermInSetQuery in the terms dict). Maybe we could 
> implement it for primary keys (terms.size() == sumDocFreq) by returning the 
> number of terms of the query? Another idea is to multiply the number of terms 
> by the average postings length, though this could be dangerous if the field 
> has a zipfian distribution and some terms have a much higher doc frequency 
> than the average.
> [~romseygeek] and I were discussing this a few weeks ago, and more recently 
> [~mikemccand] and [~gsmiller] again independently. So it looks like there is 
> interest in this. Here is an email thread where this was recently discussed: 
> https://lists.apache.org/thread.html/re3b20a486c9a4e66b2ca4a2646e2d3be48535a90cdd95911a8445183%40%3Cdev.lucene.apache.org%3E.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10207) Make TermInSetQuery usable with IndexOrDocValuesQuery

2021-10-27 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17435115#comment-17435115
 ] 

Robert Muir commented on LUCENE-10207:
--

{quote}
Should we try to take deletions into account at all?  Because a PK field with 
deletions will look like it is not "precisely" PK based on the aggregate stats. 
 Though I suppose even with e.g. 50% deletions in the index, this proposed cost 
metric is close enough.
{quote}

Deletions are irrelevant, term statistics don't reflect deletions. If the same 
term is in segmentM (and its doc deleted) and then its also in segmentN (with 
the updated doc), it causes no issue for the proposed estimation here because 
the stats are per-segment.

> Make TermInSetQuery usable with IndexOrDocValuesQuery
> -
>
> Key: LUCENE-10207
> URL: https://issues.apache.org/jira/browse/LUCENE-10207
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-10207_multitermquery.patch
>
>
> IndexOrDocValuesQuery is very useful to pick the right execution mode for a 
> query depending on other bits of the query tree.
> We would like to be able to use it to optimize execution of TermInSetQuery. 
> However IndexOrDocValuesQuery only works well if the "index" query can give 
> an estimation of the cost of the query without doing anything expensive (like 
> looking up all terms of the TermInSetQuery in the terms dict). Maybe we could 
> implement it for primary keys (terms.size() == sumDocFreq) by returning the 
> number of terms of the query? Another idea is to multiply the number of terms 
> by the average postings length, though this could be dangerous if the field 
> has a zipfian distribution and some terms have a much higher doc frequency 
> than the average.
> [~romseygeek] and I were discussing this a few weeks ago, and more recently 
> [~mikemccand] and [~gsmiller] again independently. So it looks like there is 
> interest in this. Here is an email thread where this was recently discussed: 
> https://lists.apache.org/thread.html/re3b20a486c9a4e66b2ca4a2646e2d3be48535a90cdd95911a8445183%40%3Cdev.lucene.apache.org%3E.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10207) Make TermInSetQuery usable with IndexOrDocValuesQuery

2021-10-27 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17435116#comment-17435116
 ] 

Robert Muir commented on LUCENE-10207:
--

cc [~uschindler] if you get a chance to look at the MultiTermQuery patch. It 
has been a long time since I tried to subclass FilteredTermsEnum. The assert 
statements in this subclass were extremely helpful

> Make TermInSetQuery usable with IndexOrDocValuesQuery
> -
>
> Key: LUCENE-10207
> URL: https://issues.apache.org/jira/browse/LUCENE-10207
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-10207_multitermquery.patch
>
>
> IndexOrDocValuesQuery is very useful to pick the right execution mode for a 
> query depending on other bits of the query tree.
> We would like to be able to use it to optimize execution of TermInSetQuery. 
> However IndexOrDocValuesQuery only works well if the "index" query can give 
> an estimation of the cost of the query without doing anything expensive (like 
> looking up all terms of the TermInSetQuery in the terms dict). Maybe we could 
> implement it for primary keys (terms.size() == sumDocFreq) by returning the 
> number of terms of the query? Another idea is to multiply the number of terms 
> by the average postings length, though this could be dangerous if the field 
> has a zipfian distribution and some terms have a much higher doc frequency 
> than the average.
> [~romseygeek] and I were discussing this a few weeks ago, and more recently 
> [~mikemccand] and [~gsmiller] again independently. So it looks like there is 
> interest in this. Here is an email thread where this was recently discussed: 
> https://lists.apache.org/thread.html/re3b20a486c9a4e66b2ca4a2646e2d3be48535a90cdd95911a8445183%40%3Cdev.lucene.apache.org%3E.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova edited a comment on pull request #416: LUCENE-10054 Make HnswGraph hierarchical

2021-10-27 Thread GitBox


mayya-sharipova edited a comment on pull request #416:
URL: https://github.com/apache/lucene/pull/416#issuecomment-953142944


   Benchmarking based on @jtibshirani 
[setup](https://github.com/jtibshirani/lucene/pull/1)
   
   baseline: main branch
   candidate: this PR
   
   **glove-25-angular**
   |  | baseline recall | baseline QPS | candidate recall | 
candidate QPS |
   |  | --: | ---: | ---: | 
: |
   | n_cands=10   |   0.626 |10962.821 |0.631 |  
8869.807 |
   | n_cands=50   |   0.888 | 4409.952 |0.889 |  
4111.685 |
   | n_cands=100  |   0.946 | 2621.846 |0.947 |  
2734.787 |
   | n_cands=500  |   0.994 |  661.253 |0.994 |   
686.700 |
   | n_cands=800  |   0.997 |  430.172 |0.997 |   
459.356 |
   | n_cands=1000 |   0.998 |  342.915 |0.998 |   
355.238 |
   
   
   **glove-200-angular**
   |  | baseline recall | baseline QPS | candidate recall | 
candidate QPS |
   |  | --: | ---: | ---: | 
: |
   | n_cands=10   |   0.285 | 4843.028 |0.312 |  
5208.453 |
   | n_cands=50   |   0.556 | 2119.933 |0.558 |  
2250.213 |
   | n_cands=100  |   0.655 | 1399.261 |0.648 |  
1454.996 |
   | n_cands=500  |   0.806 |  379.745 |0.806 |   
410.553 |
   | n_cands=800  |   0.836 |  252.796 |0.836 |   
276.456 |
   | n_cands=1000 |   0.849 |  201.012 |0.849 |   
220.739 |
   
   
   
   
   **sift-128-euclidean**
   
   |  | baseline recall | baseline QPS | candidate recall | 
candidate QPS |
   |  | --: | ---: | ---: | 
: |
   | n_cands=10   |   0.601 | 6948.736 |0.607 |  
6677.931 |
   | n_cands=50   |   0.889 | 3003.781 |0.892 |  
3202.925 |
   | n_cands=100  |   0.952 | 1622.276 |0.953 |  
1996.992 |
   | n_cands=500  |   0.996 |  444.135 |0.996 |   
540.368 |
   | n_cands=800  |   0.998 |  296.835 |0.998 |   
367.316 |
   | n_cands=1000 |   0.999 |  245.498 |0.999 |   
311.339 |
   
   
   As can be seen from the comparison, there is very slight change that the 
hierarchy brings: a small increase in recall by at the expense of lower QPSs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] apanimesh061 edited a comment on pull request #412: LUCENE-10197: UnifiedHighlighter should use builders for thread-safety

2021-10-27 Thread GitBox


apanimesh061 edited a comment on pull request #412:
URL: https://github.com/apache/lucene/pull/412#issuecomment-95323


   > It looks good at a glance... you/I will see better if you update one of 
the clients that might want to subclass with extra configuration. Is there any 
or is this builder subclassing issue entirely hypothetical at this point? I 
suspect only hypothetical. We'll want nice Javadocs on the builder setters 
since this is where consumers/clients will see it. We can merely move the docs 
there from the existing locations, and add javadoc references pointing to the 
builder from the existing fields/enum values as desired.
   
   @dsmiley Great. I added a unit test (just for demo) and a class 
`SubUnifiedHighlighter` in `TestUnifiedHighlighter.java` where I've added a new 
test field and also tested it. It does look right to me since it is able to use 
the new field and also fields from parent class.
   
   I can add some unit tests to test the new builder, then I can focus on 
modifying the javadocs in order to introduce the builder.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] noblepaul merged pull request #2596: SOLR-15722: Delete Replica does not delete the Per replica state

2021-10-27 Thread GitBox


noblepaul merged pull request #2596:
URL: https://github.com/apache/lucene-solr/pull/2596


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10207) Make TermInSetQuery usable with IndexOrDocValuesQuery

2021-10-27 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17435190#comment-17435190
 ] 

Adrien Grand commented on LUCENE-10207:
---

I have vague memories of playing with the MultiTermQuery approach in the past 
and it wasn't an obvious win due to the fact that seekExact could return false 
by just looking at the terms index while the MultiTermQuery approach would 
always advance to the next term after the target, which would in-turn always 
decode a frame of the terms dictionary. (It's been a very long time though, so 
I might remember wrong, or maybe other changes have been made since then so 
that this is no longer a problem.)

> Make TermInSetQuery usable with IndexOrDocValuesQuery
> -
>
> Key: LUCENE-10207
> URL: https://issues.apache.org/jira/browse/LUCENE-10207
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-10207_multitermquery.patch
>
>
> IndexOrDocValuesQuery is very useful to pick the right execution mode for a 
> query depending on other bits of the query tree.
> We would like to be able to use it to optimize execution of TermInSetQuery. 
> However IndexOrDocValuesQuery only works well if the "index" query can give 
> an estimation of the cost of the query without doing anything expensive (like 
> looking up all terms of the TermInSetQuery in the terms dict). Maybe we could 
> implement it for primary keys (terms.size() == sumDocFreq) by returning the 
> number of terms of the query? Another idea is to multiply the number of terms 
> by the average postings length, though this could be dangerous if the field 
> has a zipfian distribution and some terms have a much higher doc frequency 
> than the average.
> [~romseygeek] and I were discussing this a few weeks ago, and more recently 
> [~mikemccand] and [~gsmiller] again independently. So it looks like there is 
> interest in this. Here is an email thread where this was recently discussed: 
> https://lists.apache.org/thread.html/re3b20a486c9a4e66b2ca4a2646e2d3be48535a90cdd95911a8445183%40%3Cdev.lucene.apache.org%3E.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org