[GitHub] [lucene-solr] jpountz commented on a change in pull request #1912: LUCENE-9535: Try to do larger flushes.

2020-11-19 Thread GitBox


jpountz commented on a change in pull request #1912:
URL: https://github.com/apache/lucene-solr/pull/1912#discussion_r526737612



##
File path: 
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThreadPool.java
##
@@ -112,19 +110,12 @@ private synchronized DocumentsWriterPerThread newWriter() 
{
   DocumentsWriterPerThread getAndLock() {
 synchronized (this) {
   ensureOpen();
-  // Important that we are LIFO here! This way if number of concurrent 
indexing threads was once high,
-  // but has now reduced, we only use a limited number of DWPTs. This also 
guarantees that if we have suddenly
-  // a single thread indexing
-  final Iterator descendingIterator = 
freeList.descendingIterator();
-  while (descendingIterator.hasNext()) {
-DocumentsWriterPerThread perThread = descendingIterator.next();
-if (perThread.tryLock()) {
-  descendingIterator.remove();
-  return perThread;
-}
+  DocumentsWriterPerThread dwpt = 
freeList.poll(DocumentsWriterPerThread::tryLock);
+  if (dwpt == null) {
+// DWPT is already locked before return by this method:

Review comment:
   > making me think the "allocate a new DWPT" case has something to do 
with the locking semantics.
   
   Hmm, this is exactly what my understanding is. :) To me the comment wat 
about highlighting that `newWriter()` implicitly takes the lock on the DWPT it 
creates?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz commented on pull request #1912: LUCENE-9535: Try to do larger flushes.

2020-11-19 Thread GitBox


jpountz commented on pull request #1912:
URL: https://github.com/apache/lucene-solr/pull/1912#issuecomment-730272727


   I'm planning to merge this change to see how it plays with nightly 
benchmarks, especially now that it moved to a ThreadRipper 3990X. I'll revert 
if it makes things slower.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dweiss commented on pull request #1912: LUCENE-9535: Try to do larger flushes.

2020-11-19 Thread GitBox


dweiss commented on pull request #1912:
URL: https://github.com/apache/lucene-solr/pull/1912#issuecomment-730275557


   bq. especially now that it moved to a ThreadRipper 3990X. I'll revert if it 
makes things slower.
   
   Who's 'it'? :)
   
   I've been playing with TR 3970X and I can cause internal JVM warnings on GC 
not being able to catch up while all the threads are busy... it's fun to watch.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz edited a comment on pull request #1912: LUCENE-9535: Try to do larger flushes.

2020-11-19 Thread GitBox


jpountz edited a comment on pull request #1912:
URL: https://github.com/apache/lucene-solr/pull/1912#issuecomment-730272727


   I'm planning to merge this change to see how it plays with nightly 
benchmarks, especially now that ~~it~~ they moved to a ThreadRipper 3990X. I'll 
revert if it makes things slower.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz commented on pull request #1912: LUCENE-9535: Try to do larger flushes.

2020-11-19 Thread GitBox


jpountz commented on pull request #1912:
URL: https://github.com/apache/lucene-solr/pull/1912#issuecomment-730276978


   Woops I meant the nightly bencmarks, I edited my above message.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-15008) Avoid building OrdinalMap for each facet

2020-11-19 Thread Radu Gheorghe (Jira)
Radu Gheorghe created SOLR-15008:


 Summary: Avoid building OrdinalMap for each facet
 Key: SOLR-15008
 URL: https://issues.apache.org/jira/browse/SOLR-15008
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
  Components: Facet Module
Affects Versions: 8.7
Reporter: Radu Gheorghe
 Attachments: Screenshot 2020-11-19 at 12.01.55.png

I'm running against the following scenario:
 * [JSON] faceting on a high cardinality field
 * few matching documents => few unique values

Yet the query almost always takes a long time. Here's an example taking almost 
4s for ~300 documents and unique values (edited a bit):

 
{code:java}
"QTime":3869,
"params":{
  "json":"{\"query\": \"*:*\",
  \"filter\": [\"type:test_type\", \"date:[1603670360 TO 1604361599]\", 
\"unique_id:49866\"]
  \"facet\": 
{\"keywords\":{\"type\":\"terms\",\"field\":\"keywords\",\"limit\":20,\"mincount\":20}}}",
  "rows":"0"}},
  
"response":{"numFound":333,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[]
  },
  "facets":{
"count":333,
"keywords":{
  "buckets":[{
  "val":"value1",
  "count":124},
  ...
{code}
I did some [profiling with our Sematext 
Monitoring|https://sematext.com/docs/monitoring/on-demand-profiling/] and it 
points me to OrdinalMap building. If I read the code right, an OrdinalMap is 
built with every facet. And it's expensive since there are many unique values 
in the shard (previously, there we more smaller shards, making latency better, 
but this approach doesn't scale for this particular use-case).

If I'm right up to this point, I see a couple of potential improvements, 
[inspired from 
Elasticsearch|[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-execution-hint]:]
 # Keep the OrdinalMap cached until the next softCommit, so that only the first 
query takes the penalty
 # Allow faceting on actual values (a Map) rather than ordinals, for situations 
like the one above where we have few matching documents. We could potentially 
auto-detect this scenario (e.g. by configuring a threshold) and use a Map when 
there are few documents

I'm curious about what you're thinking:
 * would a PR/patch be welcome for any of the two ideas above?
 * do you see better options? am I missing something?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-15008) Avoid building OrdinalMap for each facet

2020-11-19 Thread Radu Gheorghe (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-15008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Radu Gheorghe updated SOLR-15008:
-
Description: 
I'm running against the following scenario:
 * [JSON] faceting on a high cardinality field
 * few matching documents => few unique values

Yet the query almost always takes a long time. Here's an example taking almost 
4s for ~300 documents and unique values (edited a bit):

 
{code:java}
"QTime":3869,
"params":{
  "json":"{\"query\": \"*:*\",
  \"filter\": [\"type:test_type\", \"date:[1603670360 TO 1604361599]\", 
\"unique_id:49866\"]
  \"facet\": 
{\"keywords\":{\"type\":\"terms\",\"field\":\"keywords\",\"limit\":20,\"mincount\":20}}}",
  "rows":"0"}},
  
"response":{"numFound":333,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[]
  },
  "facets":{
"count":333,
"keywords":{
  "buckets":[{
  "val":"value1",
  "count":124},
  ...
{code}
I did some [profiling with our Sematext 
Monitoring|https://sematext.com/docs/monitoring/on-demand-profiling/] and it 
points me to OrdinalMap building (see attached screenshot). If I read the code 
right, an OrdinalMap is built with every facet. And it's expensive since there 
are many unique values in the shard (previously, there we more smaller shards, 
making latency better, but this approach doesn't scale for this particular 
use-case).

If I'm right up to this point, I see a couple of potential improvements, 
[inspired from 
Elasticsearch|#search-aggregations-bucket-terms-aggregation-execution-hint]:
 # Keep the OrdinalMap cached until the next softCommit, so that only the first 
query takes the penalty
 # Allow faceting on actual values (a Map) rather than ordinals, for situations 
like the one above where we have few matching documents. We could potentially 
auto-detect this scenario (e.g. by configuring a threshold) and use a Map when 
there are few documents

I'm curious about what you're thinking:
 * would a PR/patch be welcome for any of the two ideas above?
 * do you see better options? am I missing something?

 

  was:
I'm running against the following scenario:
 * [JSON] faceting on a high cardinality field
 * few matching documents => few unique values

Yet the query almost always takes a long time. Here's an example taking almost 
4s for ~300 documents and unique values (edited a bit):

 
{code:java}
"QTime":3869,
"params":{
  "json":"{\"query\": \"*:*\",
  \"filter\": [\"type:test_type\", \"date:[1603670360 TO 1604361599]\", 
\"unique_id:49866\"]
  \"facet\": 
{\"keywords\":{\"type\":\"terms\",\"field\":\"keywords\",\"limit\":20,\"mincount\":20}}}",
  "rows":"0"}},
  
"response":{"numFound":333,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[]
  },
  "facets":{
"count":333,
"keywords":{
  "buckets":[{
  "val":"value1",
  "count":124},
  ...
{code}
I did some [profiling with our Sematext 
Monitoring|https://sematext.com/docs/monitoring/on-demand-profiling/] and it 
points me to OrdinalMap building (see attached screenshot). If I read the code 
right, an OrdinalMap is built with every facet. And it's expensive since there 
are many unique values in the shard (previously, there we more smaller shards, 
making latency better, but this approach doesn't scale for this particular 
use-case).

If I'm right up to this point, I see a couple of potential improvements, 
[inspired from 
Elasticsearch|#search-aggregations-bucket-terms-aggregation-execution-hint]:]
 # Keep the OrdinalMap cached until the next softCommit, so that only the first 
query takes the penalty
 # Allow faceting on actual values (a Map) rather than ordinals, for situations 
like the one above where we have few matching documents. We could potentially 
auto-detect this scenario (e.g. by configuring a threshold) and use a Map when 
there are few documents

I'm curious about what you're thinking:
 * would a PR/patch be welcome for any of the two ideas above?
 * do you see better options? am I missing something?

 


> Avoid building OrdinalMap for each facet
> 
>
> Key: SOLR-15008
> URL: https://issues.apache.org/jira/browse/SOLR-15008
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Affects Versions: 8.7
>Reporter: Radu Gheorghe
>Priority: Major
>  Labels: performance
> Attachments: Screenshot 2020-11-19 at 12.01.55.png
>
>
> I'm running against the following scenario:
>  * [JSON] faceting on a high cardinality field
>  * few matching documents => few unique values
> Yet the query almost always takes a long time. Here's an example taking 
> almost 4s for ~300 documents and unique values (edited a bit):
>  
> {code:java}
> "QTime"

[jira] [Updated] (SOLR-15008) Avoid building OrdinalMap for each facet

2020-11-19 Thread Radu Gheorghe (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-15008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Radu Gheorghe updated SOLR-15008:
-
Description: 
I'm running against the following scenario:
 * [JSON] faceting on a high cardinality field
 * few matching documents => few unique values

Yet the query almost always takes a long time. Here's an example taking almost 
4s for ~300 documents and unique values (edited a bit):

 
{code:java}
"QTime":3869,
"params":{
  "json":"{\"query\": \"*:*\",
  \"filter\": [\"type:test_type\", \"date:[1603670360 TO 1604361599]\", 
\"unique_id:49866\"]
  \"facet\": 
{\"keywords\":{\"type\":\"terms\",\"field\":\"keywords\",\"limit\":20,\"mincount\":20}}}",
  "rows":"0"}},
  
"response":{"numFound":333,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[]
  },
  "facets":{
"count":333,
"keywords":{
  "buckets":[{
  "val":"value1",
  "count":124},
  ...
{code}
I did some [profiling with our Sematext 
Monitoring|https://sematext.com/docs/monitoring/on-demand-profiling/] and it 
points me to OrdinalMap building (see attached screenshot). If I read the code 
right, an OrdinalMap is built with every facet. And it's expensive since there 
are many unique values in the shard (previously, there we more smaller shards, 
making latency better, but this approach doesn't scale for this particular 
use-case).

If I'm right up to this point, I see a couple of potential improvements, 
[inspired from 
Elasticsearch|#search-aggregations-bucket-terms-aggregation-execution-hint]:]
 # Keep the OrdinalMap cached until the next softCommit, so that only the first 
query takes the penalty
 # Allow faceting on actual values (a Map) rather than ordinals, for situations 
like the one above where we have few matching documents. We could potentially 
auto-detect this scenario (e.g. by configuring a threshold) and use a Map when 
there are few documents

I'm curious about what you're thinking:
 * would a PR/patch be welcome for any of the two ideas above?
 * do you see better options? am I missing something?

 

  was:
I'm running against the following scenario:
 * [JSON] faceting on a high cardinality field
 * few matching documents => few unique values

Yet the query almost always takes a long time. Here's an example taking almost 
4s for ~300 documents and unique values (edited a bit):

 
{code:java}
"QTime":3869,
"params":{
  "json":"{\"query\": \"*:*\",
  \"filter\": [\"type:test_type\", \"date:[1603670360 TO 1604361599]\", 
\"unique_id:49866\"]
  \"facet\": 
{\"keywords\":{\"type\":\"terms\",\"field\":\"keywords\",\"limit\":20,\"mincount\":20}}}",
  "rows":"0"}},
  
"response":{"numFound":333,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[]
  },
  "facets":{
"count":333,
"keywords":{
  "buckets":[{
  "val":"value1",
  "count":124},
  ...
{code}
I did some [profiling with our Sematext 
Monitoring|https://sematext.com/docs/monitoring/on-demand-profiling/] and it 
points me to OrdinalMap building. If I read the code right, an OrdinalMap is 
built with every facet. And it's expensive since there are many unique values 
in the shard (previously, there we more smaller shards, making latency better, 
but this approach doesn't scale for this particular use-case).

If I'm right up to this point, I see a couple of potential improvements, 
[inspired from 
Elasticsearch|[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-execution-hint]:]
 # Keep the OrdinalMap cached until the next softCommit, so that only the first 
query takes the penalty
 # Allow faceting on actual values (a Map) rather than ordinals, for situations 
like the one above where we have few matching documents. We could potentially 
auto-detect this scenario (e.g. by configuring a threshold) and use a Map when 
there are few documents

I'm curious about what you're thinking:
 * would a PR/patch be welcome for any of the two ideas above?
 * do you see better options? am I missing something?

 


> Avoid building OrdinalMap for each facet
> 
>
> Key: SOLR-15008
> URL: https://issues.apache.org/jira/browse/SOLR-15008
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Affects Versions: 8.7
>Reporter: Radu Gheorghe
>Priority: Major
>  Labels: performance
> Attachments: Screenshot 2020-11-19 at 12.01.55.png
>
>
> I'm running against the following scenario:
>  * [JSON] faceting on a high cardinality field
>  * few matching documents => few unique values
> Yet the query almost always takes a long time. Here's an example taking 
> almost

[jira] [Updated] (SOLR-15008) Avoid building OrdinalMap for each facet

2020-11-19 Thread Radu Gheorghe (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-15008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Radu Gheorghe updated SOLR-15008:
-
Description: 
I'm running against the following scenario:
 * [JSON] faceting on a high cardinality field
 * few matching documents => few unique values

Yet the query almost always takes a long time. Here's an example taking almost 
4s for ~300 documents and unique values (edited a bit):

 
{code:java}
"QTime":3869,
"params":{
  "json":"{\"query\": \"*:*\",
  \"filter\": [\"type:test_type\", \"date:[1603670360 TO 1604361599]\", 
\"unique_id:49866\"]
  \"facet\": 
{\"keywords\":{\"type\":\"terms\",\"field\":\"keywords\",\"limit\":20,\"mincount\":20}}}",
  "rows":"0"}},
  
"response":{"numFound":333,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[]
  },
  "facets":{
"count":333,
"keywords":{
  "buckets":[{
  "val":"value1",
  "count":124},
  ...
{code}
I did some [profiling with our Sematext 
Monitoring|https://sematext.com/docs/monitoring/on-demand-profiling/] and it 
points me to OrdinalMap building (see attached screenshot). If I read the code 
right, an OrdinalMap is built with every facet. And it's expensive since there 
are many unique values in the shard (previously, there we more smaller shards, 
making latency better, but this approach doesn't scale for this particular 
use-case).

If I'm right up to this point, I see a couple of potential improvements, 
[inspired from 
Elasticsearch|#search-aggregations-bucket-terms-aggregation-execution-hint]:
 # *Keep the OrdinalMap cached until the next softCommit*, so that only the 
first query takes the penalty
 # *Allow faceting on actual values (a Map) rather than ordinals*, for 
situations like the one above where we have few matching documents. We could 
potentially auto-detect this scenario (e.g. by configuring a threshold) and use 
a Map when there are few documents

I'm curious about what you're thinking:
 * would a PR/patch be welcome for any of the two ideas above?
 * do you see better options? am I missing something?

 

  was:
I'm running against the following scenario:
 * [JSON] faceting on a high cardinality field
 * few matching documents => few unique values

Yet the query almost always takes a long time. Here's an example taking almost 
4s for ~300 documents and unique values (edited a bit):

 
{code:java}
"QTime":3869,
"params":{
  "json":"{\"query\": \"*:*\",
  \"filter\": [\"type:test_type\", \"date:[1603670360 TO 1604361599]\", 
\"unique_id:49866\"]
  \"facet\": 
{\"keywords\":{\"type\":\"terms\",\"field\":\"keywords\",\"limit\":20,\"mincount\":20}}}",
  "rows":"0"}},
  
"response":{"numFound":333,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[]
  },
  "facets":{
"count":333,
"keywords":{
  "buckets":[{
  "val":"value1",
  "count":124},
  ...
{code}
I did some [profiling with our Sematext 
Monitoring|https://sematext.com/docs/monitoring/on-demand-profiling/] and it 
points me to OrdinalMap building (see attached screenshot). If I read the code 
right, an OrdinalMap is built with every facet. And it's expensive since there 
are many unique values in the shard (previously, there we more smaller shards, 
making latency better, but this approach doesn't scale for this particular 
use-case).

If I'm right up to this point, I see a couple of potential improvements, 
[inspired from 
Elasticsearch|#search-aggregations-bucket-terms-aggregation-execution-hint]:
 # Keep the OrdinalMap cached until the next softCommit, so that only the first 
query takes the penalty
 # Allow faceting on actual values (a Map) rather than ordinals, for situations 
like the one above where we have few matching documents. We could potentially 
auto-detect this scenario (e.g. by configuring a threshold) and use a Map when 
there are few documents

I'm curious about what you're thinking:
 * would a PR/patch be welcome for any of the two ideas above?
 * do you see better options? am I missing something?

 


> Avoid building OrdinalMap for each facet
> 
>
> Key: SOLR-15008
> URL: https://issues.apache.org/jira/browse/SOLR-15008
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Affects Versions: 8.7
>Reporter: Radu Gheorghe
>Priority: Major
>  Labels: performance
> Attachments: Screenshot 2020-11-19 at 12.01.55.png
>
>
> I'm running against the following scenario:
>  * [JSON] faceting on a high cardinality field
>  * few matching documents => few unique values
> Yet the query almost always takes a long time. Here's an example taking 
> almost 4s for ~300 documents and unique values (edited a bit):
>  
> {code:java}
> "QTi

[jira] [Commented] (LUCENE-9431) UnifiedHighlighter: Make WEIGHT_MATCHES the default

2020-11-19 Thread Yury Hohin (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235360#comment-17235360
 ] 

Yury Hohin commented on LUCENE-9431:


Hi, I want to help to solve this issue. Could you please assign this task to me?

> UnifiedHighlighter: Make WEIGHT_MATCHES the default
> ---
>
> Key: LUCENE-9431
> URL: https://issues.apache.org/jira/browse/LUCENE-9431
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Priority: Blocker
> Fix For: master (9.0)
>
>
> This mode uses Lucene's modern mechanism of exposing information that 
> previously required complicated highlighting machinery.  It's also likely to 
> generally work better out-of-the-box and with custom queries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9431) UnifiedHighlighter: Make WEIGHT_MATCHES the default

2020-11-19 Thread Erick Erickson (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235385#comment-17235385
 ] 

Erick Erickson commented on LUCENE-9431:


Yuri:

The Jira system only allows assigning to committers, telling us that you're 
working on it is enough.

When you're ready create a pull request (preferred) or attach a patch, 
whichever you're more comfortable with.

Then, assuming all is well, a committer can pick it up and push it to the repo. 
You may have to nudge us a bit if it languishes...

And thanks!

> UnifiedHighlighter: Make WEIGHT_MATCHES the default
> ---
>
> Key: LUCENE-9431
> URL: https://issues.apache.org/jira/browse/LUCENE-9431
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Priority: Blocker
> Fix For: master (9.0)
>
>
> This mode uses Lucene's modern mechanism of exposing information that 
> previously required complicated highlighting machinery.  It's also likely to 
> generally work better out-of-the-box and with custom queries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9614) Implement KNN Query

2020-11-19 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235437#comment-17235437
 ] 

Adrien Grand commented on LUCENE-9614:
--

I wonder if we should use the Query API at all for nearest-neighbor search. 
Today the Query API assumes that you can figure out whether a document matches 
in isolation, regardless of other matches in the index/segment. Maybe we should 
have a new top-level API on IndexSearcher, something like 
`IndexSearcher#nearestNeighbors(String field, float[] target)`, which we could 
later expand into `IndexSearcher#nearestNeighbors(String field, float[] target, 
Query filter)` to add support for filtering?

> Implement KNN Query
> ---
>
> Key: LUCENE-9614
> URL: https://issues.apache.org/jira/browse/LUCENE-9614
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
>
> Now we have a vector index format, and one vector indexing/KNN search 
> implementation, but the interface is low-level: you can search across a 
> single segment only. We would like to expose a Query implementation. 
> Initially, we want to support a usage where the KnnVectorQuery selects the 
> k-nearest neighbors without regard to any other constraints, and these can 
> then be filtered as part of an enclosing Boolean or other query.
> Later we will want to explore some kind of filtering *while* performing 
> vector search, or a re-entrant search process that can yield further results. 
> Because of the nature of knn search (all documents having any vector value 
> match), it is more like a ranking than a filtering operation, and it doesn't 
> really make sense to provide an iterator interface that can be merged in the 
> usual way, in docid order, skipping ahead. It's not yet clear how to satisfy 
> a query that is "k nearest neighbors satsifying some arbitrary Query", at 
> least not without realizing a complete bitset for the Query. But this is for 
> a later issue; *this* issue is just about performing the knn search in 
> isolation, computing a set of (some given) K nearest neighbors, and providing 
> an iterator over those.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] msfroh commented on pull request #2088: LUCENE-9617: Reset lowestUnassignedFieldNumber in FieldNumbers.clear()

2020-11-19 Thread GitBox


msfroh commented on pull request #2088:
URL: https://github.com/apache/lucene-solr/pull/2088#issuecomment-730361695


   > I'm suspicious that this is safe to do. What if another thread is calling 
addDocument at the same time?
   
   As long as `FieldNumbers.clear()` is only called from 
`IndexWriter.deleteAll()`, my understanding is that the safety is provided by 
the `try (Closeable finalizer = docWriter.lockAndAbortAll()) {` block, which (I 
think) guarantees that any concurrent indexing is blocked until the lock is 
released.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dsmiley commented on a change in pull request #2088: LUCENE-9617: Reset lowestUnassignedFieldNumber in FieldNumbers.clear()

2020-11-19 Thread GitBox


dsmiley commented on a change in pull request #2088:
URL: https://github.com/apache/lucene-solr/pull/2088#discussion_r526893867



##
File path: lucene/CHANGES.txt
##
@@ -184,6 +184,9 @@ Bug fixes
 * LUCENE-9365: FuzzyQuery was missing matches when prefix length was equal to 
the term length
   (Mark Harwood, Mike Drob)
 
+* LUCENE-9617: Reset lowestUnassignedFieldNumber on FieldNumbers.clear(), to 
avoid leaking

Review comment:
   Can you rewrite in terms of what a user might understand?  e.g.
   `IndexWriter.deleteAll now resets internal field numbers; prevents 
ever-increasing numbers in unusual use-cases`
   The latter part of what you wrote isn't bad but the first part is technical 
mumbo-jumbo that only Lucene deep divers would even recognize.

##
File path: lucene/core/src/test/org/apache/lucene/index/TestFieldInfos.java
##
@@ -187,4 +187,23 @@ public void testMergedFieldInfos_singleLeaf() throws 
IOException {
 writer.close();
 dir.close();
   }
+
+  public void testFieldNumbersAutoIncrement() {
+FieldInfos.FieldNumbers fieldNumbers = new 
FieldInfos.FieldNumbers("softDeletes");
+for (int i = 0; i < 10; i++) {
+  fieldNumbers.addOrGet("field" + i, -1, IndexOptions.NONE, 
DocValuesType.NONE,
+  0, 0, 0, 0,
+  VectorValues.SearchStrategy.NONE, false);
+}
+int idx = fieldNumbers.addOrGet("EleventhField", -1, IndexOptions.NONE, 
DocValuesType.NONE,
+0, 0, 0, 0,
+VectorValues.SearchStrategy.NONE, false);
+assertEquals("Field numbers 0 through 9 were allocated", 10, idx);
+
+fieldNumbers.clear();

Review comment:
   My only problem with unit tests like this is that it doesn't test what 
we _really_ want to know -- that when IW.deleteAll() is called (a user level 
thing, fieldNumbers.clear() is not), that the field numbers get reset.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] rmuir commented on pull request #2088: LUCENE-9617: Reset lowestUnassignedFieldNumber in FieldNumbers.clear()

2020-11-19 Thread GitBox


rmuir commented on pull request #2088:
URL: https://github.com/apache/lucene-solr/pull/2088#issuecomment-730387967


   > As long as FieldNumbers.clear() is only called from 
IndexWriter.deleteAll(), my understanding is that the safety is provided by the 
try (Closeable finalizer = docWriter.lockAndAbortAll()) { block, which (I 
think) guarantees that any concurrent indexing is blocked until the lock is 
released.
   
   Thanks, maybe there is a way to improve the testing of `deleteAll` to better 
enforce this? Lucene is not testing this method much, but I know some users 
(e.g. Solr) are using it often. My concern is some race condition that 
ultimately creates segments with unaligned field numbers. This would be a 
disaster and definitely result in corruption (think, stored fields merging etc 
which copies binary/compressed data directly).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-15008) Avoid building OrdinalMap for each facet

2020-11-19 Thread Michael Gibney (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235549#comment-17235549
 ] 

Michael Gibney commented on SOLR-15008:
---

Interesting; I'm surprised that profiling indicated {{OrdinalMap}} building, 
since I'm pretty sure the {{OrdinalMap}} instances (as accessed via 
{{FacetFieldProcessorByArrayDV}}  are already cached in the way you're 
suggesting:
# in 
[FacetFieldProcessorByArrayDV.findStartAndEndOrds(...)|https://github.com/apache/lucene-solr/blob/40e2122b5a5b89f446e51692ef0d72e48c7b71e5/solr/core/src/java/org/apache/solr/search/facet/FacetFieldProcessorByArrayDV.java#L60]
# in 
[FieldUtil.getSortedSetDocValues(...)|https://github.com/apache/lucene-solr/blob/1d85cd783863f75cea133fb9c452302214165a4d/solr/core/src/java/org/apache/solr/search/facet/FieldUtil.java#L55]
# in 
[SlowCompositeReaderWrapper.getSortedSetDocValues(...)|https://github.com/apache/lucene-solr/blob/c02f07f2d5db5c983c2eedf71febf9516189595d/solr/core/src/java/org/apache/solr/index/SlowCompositeReaderWrapper.java#L197-L211]

Do you have more information about the total numbers involved (high-cardinality 
field -- specifically how high per core? how many documents overall per core? 
how many cores? does the latency manifest even across a single indexSearcher -- 
i.e., no intervening updates?). A couple of things that might be worth doing in 
the meantime, just as a sanity check:
# disable refinement for the facet field ({{"refinement":"none"}}) -- among 
other things, this would take the {{filterCache}} out of the equation
# if possible, try optimizing each replica to a single segment, which should 
take {{OrdinalMap}} out of the equation (this of course strictly diagnostic, 
not a "workaround" suggestion).

{quote}Allow faceting on actual values (a Map) rather than ordinals
{quote}
Interesting -- even if {{OrdinalMap}} is already getting cached (as I think it 
is?), this would be one way to avoid the overhead of allocating a 
{{CountSlotArrAcc}} backed by an int array of a size matching the field 
cardinality (this is why I asked more specifically about the cardinality of the 
field involved). I'm not sure how big a problem this is in practice, but I 
imagine a value-Map-based faceting implementation would probably perform better 
for this type of use case ... not 100% sure though, and not sure how _much_ 
better ... (I think {{FacetFieldProcessorByHashDV}} was designed to meet this a 
similar use case, but it only works for single-valued fields).

> Avoid building OrdinalMap for each facet
> 
>
> Key: SOLR-15008
> URL: https://issues.apache.org/jira/browse/SOLR-15008
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Affects Versions: 8.7
>Reporter: Radu Gheorghe
>Priority: Major
>  Labels: performance
> Attachments: Screenshot 2020-11-19 at 12.01.55.png
>
>
> I'm running against the following scenario:
>  * [JSON] faceting on a high cardinality field
>  * few matching documents => few unique values
> Yet the query almost always takes a long time. Here's an example taking 
> almost 4s for ~300 documents and unique values (edited a bit):
>  
> {code:java}
> "QTime":3869,
> "params":{
>   "json":"{\"query\": \"*:*\",
>   \"filter\": [\"type:test_type\", \"date:[1603670360 TO 1604361599]\", 
> \"unique_id:49866\"]
>   \"facet\": 
> {\"keywords\":{\"type\":\"terms\",\"field\":\"keywords\",\"limit\":20,\"mincount\":20}}}",
>   "rows":"0"}},
>   
> "response":{"numFound":333,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[]
>   },
>   "facets":{
> "count":333,
> "keywords":{
>   "buckets":[{
>   "val":"value1",
>   "count":124},
>   ...
> {code}
> I did some [profiling with our Sematext 
> Monitoring|https://sematext.com/docs/monitoring/on-demand-profiling/] and it 
> points me to OrdinalMap building (see attached screenshot). If I read the 
> code right, an OrdinalMap is built with every facet. And it's expensive since 
> there are many unique values in the shard (previously, there we more smaller 
> shards, making latency better, but this approach doesn't scale for this 
> particular use-case).
> If I'm right up to this point, I see a couple of potential improvements, 
> [inspired from 
> Elasticsearch|#search-aggregations-bucket-terms-aggregation-execution-hint]:
>  # *Keep the OrdinalMap cached until the next softCommit*, so that only the 
> first query takes the penalty
>  # *Allow faceting on actual values (a Map) rather than ordinals*, for 
> situations like the one above where we have few matching documents. We could 
> potentially auto-detect this scenario (e.g. by configuring a threshold) and 
> use a M

[jira] [Commented] (SOLR-14560) Learning To Rank Interleaving

2020-11-19 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235554#comment-17235554
 ] 

ASF subversion and git services commented on SOLR-14560:


Commit 85297846419c626585dd26efe70d6eb031a4b3c9 in lucene-solr's branch 
refs/heads/branch_8x from Christine Poerschke
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=8529784 ]

SOLR-14560: javadocs link tweak in Solr Ref Guide (branch_8x only)


> Learning To Rank Interleaving
> -
>
> Key: SOLR-14560
> URL: https://issues.apache.org/jira/browse/SOLR-14560
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - LTR
>Affects Versions: 8.5.2
>Reporter: Alessandro Benedetti
>Priority: Minor
> Fix For: master (9.0), 8.8
>
>  Time Spent: 10h 10m
>  Remaining Estimate: 0h
>
> Interleaving is an approach to Online Search Quality evaluation that can be 
> very useful for Learning To Rank models:
> [https://sease.io/2020/05/online-testing-for-learning-to-rank-interleaving.html|https://sease.io/2020/05/online-testing-for-learning-to-rank-interleaving.html]
> Scope of this issue is to introduce the ability to the LTR query parser of 
> accepting multiple models (2 to start with).
> If one model is passed, normal reranking happens.
> If two models are passed, reranking happens for both models and the final 
> reranked list is the interleaved sequence of results coming from the two 
> models lists.
> As a first step it is going to be implemented through:
> TeamDraft Interleaving with two models in input.
> In the future, we can expand the functionality adding the interleaving 
> algorithm as a parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14788) Solr: The Next Big Thing

2020-11-19 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235576#comment-17235576
 ] 

Mark Robert Miller commented on SOLR-14788:
---

{quote}Mark, I have something of extremely narrow focus that I would like to 
bring up.
{quote}
 

Hey Shawn, sorry, I only recently saw this.

 

So the real deal with the Overseer is that it ended up a disaster.

 

I won't get into it again, but when I agreed we needed a node(s) with these 
types of capabilities, it was to fulfill a part of the design that I intended 
to work on reaching. Essentially, I changed jobs as SolrCloud was released and 
my new job was hdfs and related, not solrcloud  design and impl and finishing, 
etc. Sure, I took time to try and make the thing float, but it was not my work 
load or direct task(s).

So the Overseer just doesn't work at all like I envisioned, it doesn't solve 
the problem I envisioned, in fact, it made things worse in pretty much every 
regard compared to what we had. It was intended to be an optimization and 
coordination point that enhanced the system vs the naive path.

That panned out pretty much 0. So when you talk about all these state updates 
and ZooKeeper queues, and slow restarts, and lost overseers, and scalability 
and all that, it really hardly applies. We hired the Overseer to be a farmer 
and instead he was a tractor.

Trying to solve for those silly looping threads and crazy number of state 
updates and blocking/locking/slow behavior has been 100% the wrong approach. 
Instead, we hire a farmer and this time make sure he is a farmer first.

> Solr: The Next Big Thing
> 
>
> Key: SOLR-14788
> URL: https://issues.apache.org/jira/browse/SOLR-14788
> Project: Solr
>  Issue Type: Task
>Reporter: Mark Robert Miller
>Assignee: Mark Robert Miller
>Priority: Critical
>
> h3. 
> [!https://www.unicode.org/consortium/aacimg/1F46E.png!|https://www.unicode.org/consortium/adopted-characters.html#b1F46E]{color:#00875a}*The
>  Policeman is on duty!*{color}
> {quote}_{color:#de350b}*When The Policeman is on duty, sit back, relax, and 
> have some fun. Try to make some progress. Don't stress too much about the 
> impact of your changes or maintaining stability and performance and 
> correctness so much. Until the end of phase 1, I've got your back. I have a 
> variety of tools and contraptions I have been building over the years and I 
> will continue training them on this branch. I will review your changes and 
> peer out across the land and course correct where needed. As Mike D will be 
> thinking, "Sounds like a bottleneck Mark." And indeed it will be to some 
> extent. Which is why once stage one is completed, I will flip The Policeman 
> to off duty. When off duty, I'm always* {color:#de350b}*occasionally*{color} 
> *down for some vigilante justice, but I won't be walking the beat, all that 
> stuff about sit back and relax goes out the window.*{color}_
> {quote}
>  
> I have stolen this title from Ishan or Noble and Ishan.
> This issue is meant to capture the work of a small team that is forming to 
> push Solr and SolrCloud to the next phase.
> I have kicked off the work with an effort to create a very fast and solid 
> base. That work is not 100% done, but it's ready to join the fight.
> Tim Potter has started giving me a tremendous hand in finishing up. Ishan and 
> Noble have already contributed support and testing and have plans for 
> additional work to shore up some of our current shortcomings.
> Others have expressed an interest in helping and hopefully they will pop up 
> here as well.
> Let's organize and discuss our efforts here and in various sub issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] thelabdude commented on pull request #2067: SOLR-14987: Reuse HttpSolrClient per node vs. one per Solr core when using CloudSolrStream

2020-11-19 Thread GitBox


thelabdude commented on pull request #2067:
URL: https://github.com/apache/lucene-solr/pull/2067#issuecomment-730490848


   Hi @joel-bernstein wasn't able to assign you as a reviewer on this, but 
would love for you to take a look when convenient.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] thelabdude commented on a change in pull request #2067: SOLR-14987: Reuse HttpSolrClient per node vs. one per Solr core when using CloudSolrStream

2020-11-19 Thread GitBox


thelabdude commented on a change in pull request #2067:
URL: https://github.com/apache/lucene-solr/pull/2067#discussion_r527028199



##
File path: 
solr/solrj/src/java/org/apache/solr/client/solrj/io/stream/CloudSolrStream.java
##
@@ -334,88 +330,76 @@ private StreamComparator parseComp(String sort, String 
fl) throws IOException {
   public static Slice[] getSlices(String collectionName, ZkStateReader 
zkStateReader, boolean checkAlias) throws IOException {
 ClusterState clusterState = zkStateReader.getClusterState();
 
-Map collectionsMap = 
clusterState.getCollectionsMap();
-
-//TODO we should probably split collection by comma to query more than one
-//  which is something already supported in other parts of Solr
-
 // check for alias or collection
 
 List allCollections = new ArrayList<>();
 String[] collectionNames = collectionName.split(",");
+Aliases aliases = checkAlias ? zkStateReader.getAliases() : null;
+
 for(String col : collectionNames) {
-  List collections = checkAlias
-  ? zkStateReader.getAliases().resolveAliases(col)  // if not an 
alias, returns collectionName
+  List collections = (aliases != null)
+  ? aliases.resolveAliases(col)  // if not an alias, returns 
collectionName
   : Collections.singletonList(collectionName);
   allCollections.addAll(collections);
 }
 
 // Lookup all actives slices for these collections
 List slices = allCollections.stream()
-.map(collectionsMap::get)
+.map(c -> clusterState.getCollectionOrNull(c, true))
 .filter(Objects::nonNull)
 .flatMap(docCol -> Arrays.stream(docCol.getActiveSlicesArr()))
 .collect(Collectors.toList());
 if (!slices.isEmpty()) {
-  return slices.toArray(new Slice[slices.size()]);
-}
-
-// Check collection case insensitive
-for(Entry entry : collectionsMap.entrySet()) {
-  if(entry.getKey().equalsIgnoreCase(collectionName)) {
-return entry.getValue().getActiveSlicesArr();
-  }
+  return slices.toArray(new Slice[0]);
 }
 
 throw new IOException("Slices not found for " + collectionName);
   }
 
   protected void constructStreams() throws IOException {
+final ModifiableSolrParams mParams = adjustParams(new 
ModifiableSolrParams(params));
+mParams.set(DISTRIB, "false"); // We are the aggregator.
 try {
+  final Stream streamOfSolrStream;
+  if (streamContext != null && streamContext.get("shards") != null) {
+// stream of shard url with core
+streamOfSolrStream = getShards(this.zkHost, this.collection, 
this.streamContext, mParams).stream()
+.map(s -> new SolrStream(s, mParams));
+  } else {
+// stream of replicas to reuse the same SolrHttpClient per baseUrl
+// avoids re-parsing data we already have in the replicas
+streamOfSolrStream = getReplicas(this.zkHost, this.collection, 
this.streamContext, mParams).stream()

Review comment:
   Here we're keeping the Replica so we have direct access to its baseUrl 
and core name instead of parsing those out of the shardUrl





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] thelabdude commented on a change in pull request #2067: SOLR-14987: Reuse HttpSolrClient per node vs. one per Solr core when using CloudSolrStream

2020-11-19 Thread GitBox


thelabdude commented on a change in pull request #2067:
URL: https://github.com/apache/lucene-solr/pull/2067#discussion_r527029201



##
File path: 
solr/solrj/src/java/org/apache/solr/client/solrj/io/stream/SolrStream.java
##
@@ -268,8 +275,7 @@ private Map mapFields(Map fields, Map 
mappings) {
 return fields;
   }
 
-  // temporary...
-  public TupleStreamParser constructParser(SolrClient server, SolrParams 
requestParams) throws IOException, SolrServerException {
+  private TupleStreamParser constructParser(SolrParams requestParams) throws 
IOException, SolrServerException {

Review comment:
   Didn't seem like this method needed to be public and we already get a 
SolrClient in the open method, so no need to pass it. However, this breaks a 
public method signature, so is only for Solr 9.x and shouldn't be back-ported 
to 8.x





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] thelabdude commented on a change in pull request #2067: SOLR-14987: Reuse HttpSolrClient per node vs. one per Solr core when using CloudSolrStream

2020-11-19 Thread GitBox


thelabdude commented on a change in pull request #2067:
URL: https://github.com/apache/lucene-solr/pull/2067#discussion_r527027234



##
File path: 
solr/solrj/src/java/org/apache/solr/client/solrj/io/stream/CloudSolrStream.java
##
@@ -334,88 +330,76 @@ private StreamComparator parseComp(String sort, String 
fl) throws IOException {
   public static Slice[] getSlices(String collectionName, ZkStateReader 
zkStateReader, boolean checkAlias) throws IOException {
 ClusterState clusterState = zkStateReader.getClusterState();
 
-Map collectionsMap = 
clusterState.getCollectionsMap();
-
-//TODO we should probably split collection by comma to query more than one
-//  which is something already supported in other parts of Solr
-
 // check for alias or collection
 
 List allCollections = new ArrayList<>();
 String[] collectionNames = collectionName.split(",");
+Aliases aliases = checkAlias ? zkStateReader.getAliases() : null;
+
 for(String col : collectionNames) {
-  List collections = checkAlias
-  ? zkStateReader.getAliases().resolveAliases(col)  // if not an 
alias, returns collectionName
+  List collections = (aliases != null)
+  ? aliases.resolveAliases(col)  // if not an alias, returns 
collectionName
   : Collections.singletonList(collectionName);
   allCollections.addAll(collections);
 }
 
 // Lookup all actives slices for these collections
 List slices = allCollections.stream()
-.map(collectionsMap::get)
+.map(c -> clusterState.getCollectionOrNull(c, true))
 .filter(Objects::nonNull)
 .flatMap(docCol -> Arrays.stream(docCol.getActiveSlicesArr()))
 .collect(Collectors.toList());
 if (!slices.isEmpty()) {
-  return slices.toArray(new Slice[slices.size()]);
-}
-
-// Check collection case insensitive
-for(Entry entry : collectionsMap.entrySet()) {

Review comment:
   I removed this b/c I don't think we should try to accommodate improperly 
cased collection names. No tests broke, but let me know if we need this for 
some reason I don't understand 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] thelabdude commented on a change in pull request #2067: SOLR-14987: Reuse HttpSolrClient per node vs. one per Solr core when using CloudSolrStream

2020-11-19 Thread GitBox


thelabdude commented on a change in pull request #2067:
URL: https://github.com/apache/lucene-solr/pull/2067#discussion_r527027557



##
File path: 
solr/solrj/src/java/org/apache/solr/client/solrj/io/stream/CloudSolrStream.java
##
@@ -334,11 +334,6 @@ private StreamComparator parseComp(String sort, String fl) 
throws IOException {
   public static Slice[] getSlices(String collectionName, ZkStateReader 
zkStateReader, boolean checkAlias) throws IOException {
 ClusterState clusterState = zkStateReader.getClusterState();
 
-Map collectionsMap = 
clusterState.getCollectionsMap();
-
-//TODO we should probably split collection by comma to query more than one
-//  which is something already supported in other parts of Solr
-
 // check for alias or collection

Review comment:
   Moved the call to getAliases out of the for loop





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14035) remove deprecated preferLocalShards references

2020-11-19 Thread Christine Poerschke (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235629#comment-17235629
 ] 

Christine Poerschke commented on SOLR-14035:


Hello [~Alexey Bulygin], welcome! Thank you for the attached patch, it looks 
good to me and I'll proceed to commit it to the repo shortly.

> remove deprecated preferLocalShards references
> --
>
> Key: SOLR-14035
> URL: https://issues.apache.org/jira/browse/SOLR-14035
> Project: Solr
>  Issue Type: Task
>Reporter: Christine Poerschke
>Priority: Blocker
> Fix For: master (9.0)
>
> Attachments: SOLR-14035.patch
>
>
> {{preferLocalShards}} support was added under SOLR-6832 in version 5.1 
> (https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.1.0/solr/solrj/src/java/org/apache/solr/common/params/CommonParams.java#L223-L226)
>  and deprecated under SOLR-11982 in version 7.4 
> (https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.4.0/solr/solrj/src/java/org/apache/solr/common/params/CommonParams.java#L265-L269)
> This ticket is to fully remove {{preferLocalShards}} references in code, 
> tests and documentation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (SOLR-14035) remove deprecated preferLocalShards references

2020-11-19 Thread Christine Poerschke (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christine Poerschke reassigned SOLR-14035:
--

Assignee: Christine Poerschke

> remove deprecated preferLocalShards references
> --
>
> Key: SOLR-14035
> URL: https://issues.apache.org/jira/browse/SOLR-14035
> Project: Solr
>  Issue Type: Task
>Reporter: Christine Poerschke
>Assignee: Christine Poerschke
>Priority: Blocker
> Fix For: master (9.0)
>
> Attachments: SOLR-14035.patch
>
>
> {{preferLocalShards}} support was added under SOLR-6832 in version 5.1 
> (https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.1.0/solr/solrj/src/java/org/apache/solr/common/params/CommonParams.java#L223-L226)
>  and deprecated under SOLR-11982 in version 7.4 
> (https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.4.0/solr/solrj/src/java/org/apache/solr/common/params/CommonParams.java#L265-L269)
> This ticket is to fully remove {{preferLocalShards}} references in code, 
> tests and documentation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14035) remove deprecated preferLocalShards references

2020-11-19 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235638#comment-17235638
 ] 

ASF subversion and git services commented on SOLR-14035:


Commit c4d4767bca196ad358b72156889effd27fdfcc9b in lucene-solr's branch 
refs/heads/master from Christine Poerschke
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c4d4767 ]

SOLR-14035: Remove deprecated preferLocalShards=true support in favour of the 
shards.preference=replica.location:local alternative.
(Alex Bulygin via Christine Poerschke)


> remove deprecated preferLocalShards references
> --
>
> Key: SOLR-14035
> URL: https://issues.apache.org/jira/browse/SOLR-14035
> Project: Solr
>  Issue Type: Task
>Reporter: Christine Poerschke
>Assignee: Christine Poerschke
>Priority: Blocker
> Fix For: master (9.0)
>
> Attachments: SOLR-14035.patch
>
>
> {{preferLocalShards}} support was added under SOLR-6832 in version 5.1 
> (https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.1.0/solr/solrj/src/java/org/apache/solr/common/params/CommonParams.java#L223-L226)
>  and deprecated under SOLR-11982 in version 7.4 
> (https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.4.0/solr/solrj/src/java/org/apache/solr/common/params/CommonParams.java#L265-L269)
> This ticket is to fully remove {{preferLocalShards}} references in code, 
> tests and documentation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-14035) remove deprecated preferLocalShards references

2020-11-19 Thread Christine Poerschke (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christine Poerschke resolved SOLR-14035.

Resolution: Fixed

> remove deprecated preferLocalShards references
> --
>
> Key: SOLR-14035
> URL: https://issues.apache.org/jira/browse/SOLR-14035
> Project: Solr
>  Issue Type: Task
>Reporter: Christine Poerschke
>Assignee: Christine Poerschke
>Priority: Blocker
> Fix For: master (9.0)
>
> Attachments: SOLR-14035.patch
>
>
> {{preferLocalShards}} support was added under SOLR-6832 in version 5.1 
> (https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.1.0/solr/solrj/src/java/org/apache/solr/common/params/CommonParams.java#L223-L226)
>  and deprecated under SOLR-11982 in version 7.4 
> (https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.4.0/solr/solrj/src/java/org/apache/solr/common/params/CommonParams.java#L265-L269)
> This ticket is to fully remove {{preferLocalShards}} references in code, 
> tests and documentation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9616) Improve test coverage for internal format versions

2020-11-19 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235643#comment-17235643
 ] 

Adrien Grand commented on LUCENE-9616:
--

I had not though about naming the class Lucene88DocValuesConsumer.  I like the 
simpler naming scheme but wonder that it might be confusing to make the name of 
the consumer diverge from the name of the format (the string that is passed to 
"super" in the DocValuesFormat constructor). And absolutely, these should be in 
backward codecs. For this particular change I'm even considering creating a new 
Lucene88DocValuesFormat given that it changes quite significantly the file 
format.

> Improve test coverage for internal format versions
> --
>
> Key: LUCENE-9616
> URL: https://issues.apache.org/jira/browse/LUCENE-9616
> Project: Lucene - Core
>  Issue Type: Test
>Reporter: Julie Tibshirani
>Priority: Minor
>
> Some formats use an internal versioning system -- for example 
> {{CompressingStoredFieldsFormat}} maintains older logic for reading an 
> on-heap fields index. Because we always allow reading segments from the 
> current + previous major version, some users still rely on the read-side 
> logic of older internal versions.
> Although the older version logic is covered by 
> {{TestBackwardsCompatibility}}, it looks like it's not exercised in unit 
> tests. Older versions aren't "in rotation" when choosing a random codec for 
> tests. They also don't have dedicated unit tests as we have for separate 
> older formats, for example {{TestLucene60PointsFormat}}.
> It could be good to improve unit test coverage for the older versions, since 
> they're in active use. A downside is that it's not straightforward to add 
> unit tests, since we tend to just change/ delete the old write-side logic as 
> we bump internal versions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9618) Improve IntervalIterator.nextInterval's behavior/documentation/test

2020-11-19 Thread Haoyu Zhai (Jira)
Haoyu Zhai created LUCENE-9618:
--

 Summary: Improve IntervalIterator.nextInterval's 
behavior/documentation/test
 Key: LUCENE-9618
 URL: https://issues.apache.org/jira/browse/LUCENE-9618
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/query
Reporter: Haoyu Zhai


I'm trying to play around with my own {{IntervalSource}} and found out that 
{{nextInterval}} method of IntervalIterator will be called sometimes even after 
{{nextDoc}}/ {{docID}}/ {{advance}} method returns NO_MORE_DOCS.
 
After I dug a bit more I found that {{FilteringIntervalIterator.reset}} is 
calling an inner iterator's {{nextInterval}} regardless of what the result of 
{{nextDoc}}, and also most (if not all) existing {{IntervalIterator}}'s 
implementation do considered the case where {{nextInterval}} is called after 
{{nextDoc}} returns NO_MORE_DOCS.
 
We should probably update the javadoc and test if the behavior is necessary. Or 
we should change the current implementation to avoid this behavior
original email discussion thread:

https://markmail.org/message/7itbwk6ts3bo3gdh



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] zhaih opened a new pull request #2090: LUCENE-9618: demo unit test

2020-11-19 Thread GitBox


zhaih opened a new pull request #2090:
URL: https://github.com/apache/lucene-solr/pull/2090


   
   
   
   # Description
   
   This PR is not intended to be merged. It's just for demonstration of issues 
mentioned in [LUCENE-9618](https://issues.apache.org/jira/browse/LUCENE-9618)
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9618) Improve IntervalIterator.nextInterval's behavior/documentation/test

2020-11-19 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235652#comment-17235652
 ] 

Haoyu Zhai commented on LUCENE-9618:


I created a [PR|https://github.com/apache/lucene-solr/pull/2090] with a simple 
test case to demonstrate the issue mentioned.

> Improve IntervalIterator.nextInterval's behavior/documentation/test
> ---
>
> Key: LUCENE-9618
> URL: https://issues.apache.org/jira/browse/LUCENE-9618
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/query
>Reporter: Haoyu Zhai
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I'm trying to play around with my own {{IntervalSource}} and found out that 
> {{nextInterval}} method of IntervalIterator will be called sometimes even 
> after {{nextDoc}}/ {{docID}}/ {{advance}} method returns NO_MORE_DOCS.
>  
> After I dug a bit more I found that {{FilteringIntervalIterator.reset}} is 
> calling an inner iterator's {{nextInterval}} regardless of what the result of 
> {{nextDoc}}, and also most (if not all) existing {{IntervalIterator}}'s 
> implementation do considered the case where {{nextInterval}} is called after 
> {{nextDoc}} returns NO_MORE_DOCS.
>  
> We should probably update the javadoc and test if the behavior is necessary. 
> Or we should change the current implementation to avoid this behavior
> original email discussion thread:
> https://markmail.org/message/7itbwk6ts3bo3gdh



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13671) Remove check for bare "var" declarations in validate-source-patterns in before releasing Solr 9.0

2020-11-19 Thread Christine Poerschke (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235656#comment-17235656
 ] 

Christine Poerschke commented on SOLR-13671:


bq. ... lucene/tools/src/groovy/check-source-patterns.groovy ...

I see the file got renamed in the 
https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=def82ab commit on 
{{master}} branch but remains present on {{branch_8x}} branch.

[~erickerickson], do you perhaps recall if the changes in this JIRA task here 
were intended for {{master}} branch or {{branch_8x}} or both? Usually -- 
https://cwiki.apache.org/confluence/display/SOLR/HowToContribute -- our 
development is on master branch and then gets backported, but perhaps this 
scenario here is different (I haven't looked yet at the discussion in the 
linked JIRA) and I note that [~Alexey Bulygin]'s patch can be {{cd lucene ; git 
apply}} applied to branch_8x, hence asking. Hope that helps.

> Remove check for bare "var" declarations in validate-source-patterns in 
> before releasing Solr 9.0
> -
>
> Key: SOLR-13671
> URL: https://issues.apache.org/jira/browse/SOLR-13671
> Project: Solr
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Priority: Blocker
> Fix For: master (9.0)
>
> Attachments: SOLR-13671.patch
>
>
> See the discussion in the linked JIRA.
> Remove the line:
> (~$/\n\s*var\s+/$) : 'var is not allowed in until we stop development on the 
> 8x code line'
> in
> invalidJavaOnlyPatterns
> from lucene/tools/src/groovy/check-source-patterns.groovy



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9618) Improve IntervalIterator.nextInterval's behavior/documentation/test

2020-11-19 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235675#comment-17235675
 ] 

Michael McCandless commented on LUCENE-9618:


Hmm it is weird that these queries call {{nextInterval}} even after {{nextDoc}} 
returned {{NO_MORE_DOCS}}?

Normally for Lucene DISI iterators, once {{NO_MORE_DOCS}} is returned, the 
iterator is done (in an undefined state) and you cannot call further methods on 
it.

> Improve IntervalIterator.nextInterval's behavior/documentation/test
> ---
>
> Key: LUCENE-9618
> URL: https://issues.apache.org/jira/browse/LUCENE-9618
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/query
>Reporter: Haoyu Zhai
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I'm trying to play around with my own {{IntervalSource}} and found out that 
> {{nextInterval}} method of IntervalIterator will be called sometimes even 
> after {{nextDoc}}/ {{docID}}/ {{advance}} method returns NO_MORE_DOCS.
>  
> After I dug a bit more I found that {{FilteringIntervalIterator.reset}} is 
> calling an inner iterator's {{nextInterval}} regardless of what the result of 
> {{nextDoc}}, and also most (if not all) existing {{IntervalIterator}}'s 
> implementation do considered the case where {{nextInterval}} is called after 
> {{nextDoc}} returns NO_MORE_DOCS.
>  
> We should probably update the javadoc and test if the behavior is necessary. 
> Or we should change the current implementation to avoid this behavior
> original email discussion thread:
> https://markmail.org/message/7itbwk6ts3bo3gdh



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9618) Improve IntervalIterator.nextInterval's behavior/documentation/test

2020-11-19 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235677#comment-17235677
 ] 

Michael McCandless commented on LUCENE-9618:


And thank you [~zhai7631] for the PR showing the issue!

> Improve IntervalIterator.nextInterval's behavior/documentation/test
> ---
>
> Key: LUCENE-9618
> URL: https://issues.apache.org/jira/browse/LUCENE-9618
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/query
>Reporter: Haoyu Zhai
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I'm trying to play around with my own {{IntervalSource}} and found out that 
> {{nextInterval}} method of IntervalIterator will be called sometimes even 
> after {{nextDoc}}/ {{docID}}/ {{advance}} method returns NO_MORE_DOCS.
>   
>  After I dug a bit more I found that {{FilteringIntervalIterator.reset}} is 
> calling an inner iterator's {{nextInterval}} regardless of what the result of 
> {{nextDoc}}, and also most (if not all) existing {{IntervalIterator}}'s 
> implementation do considered the case where {{nextInterval}} is called after 
> {{nextDoc}} returns NO_MORE_DOCS.
>   
>  We should probably update the javadoc and test if the behavior is necessary. 
> Or we should change the current implementation to avoid this behavior
>  original email discussion thread:
> https://markmail.org/thread/aytal77bgzl2zafm



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9618) Improve IntervalIterator.nextInterval's behavior/documentation/test

2020-11-19 Thread Haoyu Zhai (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haoyu Zhai updated LUCENE-9618:
---
Description: 
I'm trying to play around with my own {{IntervalSource}} and found out that 
{{nextInterval}} method of IntervalIterator will be called sometimes even after 
{{nextDoc}}/ {{docID}}/ {{advance}} method returns NO_MORE_DOCS.
  
 After I dug a bit more I found that {{FilteringIntervalIterator.reset}} is 
calling an inner iterator's {{nextInterval}} regardless of what the result of 
{{nextDoc}}, and also most (if not all) existing {{IntervalIterator}}'s 
implementation do considered the case where {{nextInterval}} is called after 
{{nextDoc}} returns NO_MORE_DOCS.
  
 We should probably update the javadoc and test if the behavior is necessary. 
Or we should change the current implementation to avoid this behavior
 original email discussion thread:

https://markmail.org/thread/aytal77bgzl2zafm

  was:
I'm trying to play around with my own {{IntervalSource}} and found out that 
{{nextInterval}} method of IntervalIterator will be called sometimes even after 
{{nextDoc}}/ {{docID}}/ {{advance}} method returns NO_MORE_DOCS.
 
After I dug a bit more I found that {{FilteringIntervalIterator.reset}} is 
calling an inner iterator's {{nextInterval}} regardless of what the result of 
{{nextDoc}}, and also most (if not all) existing {{IntervalIterator}}'s 
implementation do considered the case where {{nextInterval}} is called after 
{{nextDoc}} returns NO_MORE_DOCS.
 
We should probably update the javadoc and test if the behavior is necessary. Or 
we should change the current implementation to avoid this behavior
original email discussion thread:

https://markmail.org/message/7itbwk6ts3bo3gdh


> Improve IntervalIterator.nextInterval's behavior/documentation/test
> ---
>
> Key: LUCENE-9618
> URL: https://issues.apache.org/jira/browse/LUCENE-9618
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/query
>Reporter: Haoyu Zhai
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I'm trying to play around with my own {{IntervalSource}} and found out that 
> {{nextInterval}} method of IntervalIterator will be called sometimes even 
> after {{nextDoc}}/ {{docID}}/ {{advance}} method returns NO_MORE_DOCS.
>   
>  After I dug a bit more I found that {{FilteringIntervalIterator.reset}} is 
> calling an inner iterator's {{nextInterval}} regardless of what the result of 
> {{nextDoc}}, and also most (if not all) existing {{IntervalIterator}}'s 
> implementation do considered the case where {{nextInterval}} is called after 
> {{nextDoc}} returns NO_MORE_DOCS.
>   
>  We should probably update the javadoc and test if the behavior is necessary. 
> Or we should change the current implementation to avoid this behavior
>  original email discussion thread:
> https://markmail.org/thread/aytal77bgzl2zafm



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9617) FieldNumbers.clear() should reset lowestUnassignedFieldNumber

2020-11-19 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235680#comment-17235680
 ] 

Michael McCandless commented on LUCENE-9617:


Whoa, good catch [~msfroh]!  I'll try to review your PR, thank you.

> FieldNumbers.clear() should reset lowestUnassignedFieldNumber
> -
>
> Key: LUCENE-9617
> URL: https://issues.apache.org/jira/browse/LUCENE-9617
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 8.7
>Reporter: Michael Froh
>Priority: Minor
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> A call to IndexWriter.deleteAll() should completely reset the state of the 
> index. Part of that is a call to globalFieldNumbersMap.clear(), which purges 
> all knowledge of fields by clearing name -> number and number -> name maps. 
> However, it does not reset lowestUnassignedFieldNumber.
> If we have loop that adds some documents, calls deleteAll(), adds documents, 
> etc. lowestUnassignedFieldNumber keeps counting up. Since FieldInfos 
> allocates an array for number -> FieldInfo, this array will get larger and 
> larger, effectively leaking memory.
> We can fix this by resetting lowestUnassignedFieldNumber to -1 in 
> FieldNumbers.clear().
> I'll write a unit test and attach a patch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-6733) Umbrella issue - Solr as a standalone application

2020-11-19 Thread Houston Putman (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-6733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235711#comment-17235711
 ] 

Houston Putman commented on SOLR-6733:
--

Is there still interest in this idea? If so, I'd volunteer to help take it 
forward.

> Umbrella issue - Solr as a standalone application
> -
>
> Key: SOLR-6733
> URL: https://issues.apache.org/jira/browse/SOLR-6733
> Project: Solr
>  Issue Type: New Feature
>Reporter: Shawn Heisey
>Priority: Major
>
> Umbrella issue.
> Solr should be a standalone application, where the main method is provided by 
> Solr source code.
> Here are the major tasks I envision, if we choose to embed Jetty:
>  * Create org.apache.solr.start.Main (and possibly other classes in the same 
> package), to be placed in solr-start.jar.  The Main class will contain the 
> main method that starts the embedded Jetty and Solr.  I do not know how to 
> adjust the build system to do this successfully.
>  * Handle central configurations in code -- TCP port, SSL, and things like 
> web.xml.
>  * For each of these steps, clean up any test fallout.
>  * Handle cloud-related configurations in code -- port, hostname, protocol, 
> etc.  Use the same information as the central configurations.
>  * Consider whether things like authentication need changes.
>  * Handle any remaining container configurations.
> I am currently imagining this work happening in a new branch and ultimately 
> being applied only to master, not the stable branch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9618) Improve IntervalIterator.nextInterval's behavior/documentation/test

2020-11-19 Thread Alan Woodward (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235738#comment-17235738
 ] 

Alan Woodward commented on LUCENE-9618:
---

Thank you for opening this [~zhai7631].  As you said on the mailing list, I 
misunderstood what you were saying.  Calling `nextInterval()` after `nextDoc()` 
has returned NO_MORE_DOCS is definitely an error and we should fix that in 
FilteringIntervalIterator.

> Improve IntervalIterator.nextInterval's behavior/documentation/test
> ---
>
> Key: LUCENE-9618
> URL: https://issues.apache.org/jira/browse/LUCENE-9618
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/query
>Reporter: Haoyu Zhai
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I'm trying to play around with my own {{IntervalSource}} and found out that 
> {{nextInterval}} method of IntervalIterator will be called sometimes even 
> after {{nextDoc}}/ {{docID}}/ {{advance}} method returns NO_MORE_DOCS.
>   
>  After I dug a bit more I found that {{FilteringIntervalIterator.reset}} is 
> calling an inner iterator's {{nextInterval}} regardless of what the result of 
> {{nextDoc}}, and also most (if not all) existing {{IntervalIterator}}'s 
> implementation do considered the case where {{nextInterval}} is called after 
> {{nextDoc}} returns NO_MORE_DOCS.
>   
>  We should probably update the javadoc and test if the behavior is necessary. 
> Or we should change the current implementation to avoid this behavior
>  original email discussion thread:
> https://markmail.org/thread/aytal77bgzl2zafm



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-6733) Umbrella issue - Solr as a standalone application

2020-11-19 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-6733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235777#comment-17235777
 ] 

David Smiley commented on SOLR-6733:


In my mind, this is a controversial topic.  We'd give up easy config of CORS or 
whatever... and I'm dubious on what benefit we gained.  A strength of Solr is 
customizability.  One might argue too much?  But it's a trade-off that 
distinguishes it with differentiating advantages.

> Umbrella issue - Solr as a standalone application
> -
>
> Key: SOLR-6733
> URL: https://issues.apache.org/jira/browse/SOLR-6733
> Project: Solr
>  Issue Type: New Feature
>Reporter: Shawn Heisey
>Priority: Major
>
> Umbrella issue.
> Solr should be a standalone application, where the main method is provided by 
> Solr source code.
> Here are the major tasks I envision, if we choose to embed Jetty:
>  * Create org.apache.solr.start.Main (and possibly other classes in the same 
> package), to be placed in solr-start.jar.  The Main class will contain the 
> main method that starts the embedded Jetty and Solr.  I do not know how to 
> adjust the build system to do this successfully.
>  * Handle central configurations in code -- TCP port, SSL, and things like 
> web.xml.
>  * For each of these steps, clean up any test fallout.
>  * Handle cloud-related configurations in code -- port, hostname, protocol, 
> etc.  Use the same information as the central configurations.
>  * Consider whether things like authentication need changes.
>  * Handle any remaining container configurations.
> I am currently imagining this work happening in a new branch and ultimately 
> being applied only to master, not the stable branch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-6733) Umbrella issue - Solr as a standalone application

2020-11-19 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-6733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235779#comment-17235779
 ] 

David Smiley commented on SOLR-6733:


For example my colleagues added some specialized mTLS stuff without having to 
hack Solr.  That was possible because of Jetty's configurability, which we 
leave exposed.

> Umbrella issue - Solr as a standalone application
> -
>
> Key: SOLR-6733
> URL: https://issues.apache.org/jira/browse/SOLR-6733
> Project: Solr
>  Issue Type: New Feature
>Reporter: Shawn Heisey
>Priority: Major
>
> Umbrella issue.
> Solr should be a standalone application, where the main method is provided by 
> Solr source code.
> Here are the major tasks I envision, if we choose to embed Jetty:
>  * Create org.apache.solr.start.Main (and possibly other classes in the same 
> package), to be placed in solr-start.jar.  The Main class will contain the 
> main method that starts the embedded Jetty and Solr.  I do not know how to 
> adjust the build system to do this successfully.
>  * Handle central configurations in code -- TCP port, SSL, and things like 
> web.xml.
>  * For each of these steps, clean up any test fallout.
>  * Handle cloud-related configurations in code -- port, hostname, protocol, 
> etc.  Use the same information as the central configurations.
>  * Consider whether things like authentication need changes.
>  * Handle any remaining container configurations.
> I am currently imagining this work happening in a new branch and ultimately 
> being applied only to master, not the stable branch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14788) Solr: The Next Big Thing

2020-11-19 Thread Ilan Ginzburg (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235783#comment-17235783
 ] 

Ilan Ginzburg commented on SOLR-14788:
--

I believe with a central {{/clusterstate.json}}, having a central server batch 
updates made sense. Each separate node (or thread) trying to do its own direct 
update to a shared (ZooKeeper) file likely creates too much contention. I 
believe that in its original context, Overseer does make sense.

Now that state is distributed per collection ({{/state.json}}), it's possible 
to start thinking about distributing these updates without the help of a 
central server.

The high number of replica state updates when a node goes up or down (which I 
suspect is the main reason the Overseer cluster state change ZooKeeper queue 
saturates) can likely be greatly reduced by considering a replica as down if 
its node is down, regardless of the state the replica last broadcast.

> Solr: The Next Big Thing
> 
>
> Key: SOLR-14788
> URL: https://issues.apache.org/jira/browse/SOLR-14788
> Project: Solr
>  Issue Type: Task
>Reporter: Mark Robert Miller
>Assignee: Mark Robert Miller
>Priority: Critical
>
> h3. 
> [!https://www.unicode.org/consortium/aacimg/1F46E.png!|https://www.unicode.org/consortium/adopted-characters.html#b1F46E]{color:#00875a}*The
>  Policeman is on duty!*{color}
> {quote}_{color:#de350b}*When The Policeman is on duty, sit back, relax, and 
> have some fun. Try to make some progress. Don't stress too much about the 
> impact of your changes or maintaining stability and performance and 
> correctness so much. Until the end of phase 1, I've got your back. I have a 
> variety of tools and contraptions I have been building over the years and I 
> will continue training them on this branch. I will review your changes and 
> peer out across the land and course correct where needed. As Mike D will be 
> thinking, "Sounds like a bottleneck Mark." And indeed it will be to some 
> extent. Which is why once stage one is completed, I will flip The Policeman 
> to off duty. When off duty, I'm always* {color:#de350b}*occasionally*{color} 
> *down for some vigilante justice, but I won't be walking the beat, all that 
> stuff about sit back and relax goes out the window.*{color}_
> {quote}
>  
> I have stolen this title from Ishan or Noble and Ishan.
> This issue is meant to capture the work of a small team that is forming to 
> push Solr and SolrCloud to the next phase.
> I have kicked off the work with an effort to create a very fast and solid 
> base. That work is not 100% done, but it's ready to join the fight.
> Tim Potter has started giving me a tremendous hand in finishing up. Ishan and 
> Noble have already contributed support and testing and have plans for 
> additional work to shore up some of our current shortcomings.
> Others have expressed an interest in helping and hopefully they will pop up 
> here as well.
> Let's organize and discuss our efforts here and in various sub issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9616) Improve test coverage for internal format versions

2020-11-19 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235801#comment-17235801
 ] 

Robert Muir commented on LUCENE-9616:
-

+1 to more aggressively copy-on-write the format classes changing the 
underlying file format: and try to only use some internal versioning for truly 
minor/bugfix changes?

I think internal formats only had that use-case in mind, and old versions 
should not all be tested because this way because they are buggy. it should be 
possible to fix some bad bugs in the codec (in a backwards compatible way), yet 
not be annoyed by backwards tests for the rest of a major release.

> Improve test coverage for internal format versions
> --
>
> Key: LUCENE-9616
> URL: https://issues.apache.org/jira/browse/LUCENE-9616
> Project: Lucene - Core
>  Issue Type: Test
>Reporter: Julie Tibshirani
>Priority: Minor
>
> Some formats use an internal versioning system -- for example 
> {{CompressingStoredFieldsFormat}} maintains older logic for reading an 
> on-heap fields index. Because we always allow reading segments from the 
> current + previous major version, some users still rely on the read-side 
> logic of older internal versions.
> Although the older version logic is covered by 
> {{TestBackwardsCompatibility}}, it looks like it's not exercised in unit 
> tests. Older versions aren't "in rotation" when choosing a random codec for 
> tests. They also don't have dedicated unit tests as we have for separate 
> older formats, for example {{TestLucene60PointsFormat}}.
> It could be good to improve unit test coverage for the older versions, since 
> they're in active use. A downside is that it's not straightforward to add 
> unit tests, since we tend to just change/ delete the old write-side logic as 
> we bump internal versions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14788) Solr: The Next Big Thing

2020-11-19 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235852#comment-17235852
 ] 

Mark Robert Miller commented on SOLR-14788:
---

My initial implementation only really focused on a single collection - even 
that was far, far from completed. Now I was not involved in Overseer 
implementation, but it was not introduced to batch updates, to one state.json 
or several - at least nothing like what it was doing. If that was even the 
drive for it (it wasn't from my memory and knowledge) it would have been very 
silly to try and handle our absurd state.json update load via an Overseer node 
before making all the other nodes try and behave even remotely sane.

That Overseer did nothing by batching between collections - that was like 
adding a bucket of water to a fire truck that is out of water. That is mostly 
what has happened unfortunately - bandaids and workarounds. I never implemented 
the SolrCloud Yonik and I worked out. I started it. We had the design. I put a 
foot in that direction. Since then, things have mostly gone down that foot hole 
instead of forward. Likey, as was the case for me, for many, it was not their 
job to finish implementing SolrCloud, it was a huge task, few understood what 
the actual design was, and you could do quite well riding on what was there for 
little effort vs a lot of effort and who knows where you end up.

The Overseer as implemented was not in line with the design. This is an event 
driven design. A light weight, low cost, simple design. Building it on an 
existing and non Cloud oriented design made it very difficult to decipher what 
the plan actually was or even how/if you could get there on these building 
blocks while keeping them stable and active and non cloud mode, etc.

So when I talk about the benefits the Overseer type nodes can bring, they 
hardly apply to master. It's a common problem I've run into. I'll talk about 
how slow something is, or how much better things can be if do X, and someone 
might take a little look and come back with, meh, didn't seem like what you 
were saying to me. And often, there are so many layers that you can't see much 
benefit or any when you play around with some isolated change in the current 
world. 10 other things will eat you first.

Anyway, the system started by distributing updates without the help of a 
central server :) The Overseer was not created to deal with clusterstate.json, 
because we did not have state.json, that would be crazy :) It literally serves 
no practical purpose at this point, other than a huge amount of problems and 
slowness and bad behavior.

Now, I'm excited for any competition on what direction to go here. Don't take 
any of this negatively. If your CAS system can run the gauntlet, I'll 
congratulate you and be thankful. But your responses and the details in the 
remove the Overseer issue seem (as is common enough) overly caught up in the 
current nonsensical SolrCloud world. I wish you the best of luck making this 
system and what it does and supports hum without a central server(s). It was 
what I tried to keep in the design at the start. But it loses when you run the 
mind simulations and ignore the current SolrCloud baggage and it almost 
certainly loses when you implement it. You will have to shoot for the moon 
though, not the current Overseer implementation, because my challenger is 
almost to the ring and is in a different weight class / league / world than 
what you have evaluated in 8x/master.

> Solr: The Next Big Thing
> 
>
> Key: SOLR-14788
> URL: https://issues.apache.org/jira/browse/SOLR-14788
> Project: Solr
>  Issue Type: Task
>Reporter: Mark Robert Miller
>Assignee: Mark Robert Miller
>Priority: Critical
>
> h3. 
> [!https://www.unicode.org/consortium/aacimg/1F46E.png!|https://www.unicode.org/consortium/adopted-characters.html#b1F46E]{color:#00875a}*The
>  Policeman is on duty!*{color}
> {quote}_{color:#de350b}*When The Policeman is on duty, sit back, relax, and 
> have some fun. Try to make some progress. Don't stress too much about the 
> impact of your changes or maintaining stability and performance and 
> correctness so much. Until the end of phase 1, I've got your back. I have a 
> variety of tools and contraptions I have been building over the years and I 
> will continue training them on this branch. I will review your changes and 
> peer out across the land and course correct where needed. As Mike D will be 
> thinking, "Sounds like a bottleneck Mark." And indeed it will be to some 
> extent. Which is why once stage one is completed, I will flip The Policeman 
> to off duty. When off duty, I'm always* {color:#de350b}*occasionally*{color} 
> *down for some vigilante justice, but I won't be walking the beat, all that 
> stuff about si

[GitHub] [lucene-solr] jtibshirani merged pull request #2084: LUCENE-9592: Loosen equality checks in TestVectorUtil.

2020-11-19 Thread GitBox


jtibshirani merged pull request #2084:
URL: https://github.com/apache/lucene-solr/pull/2084


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9592) TestVectorUtil can fail with assertion error

2020-11-19 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235858#comment-17235858
 ] 

ASF subversion and git services commented on LUCENE-9592:
-

Commit 8c7b709c08662d396bd12b1e352db99bb489a7da in lucene-solr's branch 
refs/heads/master from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=8c7b709 ]

LUCENE-9592: Loosen equality checks in TestVectorUtil. (#2084)

TestVectorUtil occasionally fails because of floating point errors. This
change slightly increases the epsilon in equality checks -- testing shows that
this will greatly decrease the chance of failure.

> TestVectorUtil can fail with assertion error
> 
>
> Key: LUCENE-9592
> URL: https://issues.apache.org/jira/browse/LUCENE-9592
> Project: Lucene - Core
>  Issue Type: Test
>Reporter: Julie Tibshirani
>Priority: Minor
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Example failure:
> {code:java}
>  java.lang.AssertionError: expected:<35.699527740478516> but 
> was:<35.69953918457031>java.lang.AssertionError: 
> expected:<35.699527740478516> but was:<35.69953918457031> at 
> __randomizedtesting.SeedInfo.seed([305701410F76FAD0:4797D77886281D68]:0) at 
> org.junit.Assert.fail(Assert.java:89) at 
> org.junit.Assert.failNotEquals(Assert.java:835) at 
> org.junit.Assert.assertEquals(Assert.java:555) at 
> org.junit.Assert.assertEquals(Assert.java:685) at 
> org.apache.lucene.util.TestVectorUtil.testSelfDotProduct(TestVectorUtil.java:28)
>  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.base/java.lang.reflect.Method.invoke(Method.java:567){code}
> Reproduce line: 
> {code:java}
> gradlew test --tests TestVectorUtil.testSelfDotProduct 
> -Dtests.seed=305701410F76FAD0 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=ar-AE -Dtests.timezone=SystemV/MST7 -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8 {code}
> Perhaps the vector utility methods should work with doubles instead of floats 
> to avoid loss of precision.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9592) TestVectorUtil can fail with assertion error

2020-11-19 Thread Julie Tibshirani (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani resolved LUCENE-9592.
--
Resolution: Fixed

> TestVectorUtil can fail with assertion error
> 
>
> Key: LUCENE-9592
> URL: https://issues.apache.org/jira/browse/LUCENE-9592
> Project: Lucene - Core
>  Issue Type: Test
>Reporter: Julie Tibshirani
>Priority: Minor
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Example failure:
> {code:java}
>  java.lang.AssertionError: expected:<35.699527740478516> but 
> was:<35.69953918457031>java.lang.AssertionError: 
> expected:<35.699527740478516> but was:<35.69953918457031> at 
> __randomizedtesting.SeedInfo.seed([305701410F76FAD0:4797D77886281D68]:0) at 
> org.junit.Assert.fail(Assert.java:89) at 
> org.junit.Assert.failNotEquals(Assert.java:835) at 
> org.junit.Assert.assertEquals(Assert.java:555) at 
> org.junit.Assert.assertEquals(Assert.java:685) at 
> org.apache.lucene.util.TestVectorUtil.testSelfDotProduct(TestVectorUtil.java:28)
>  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.base/java.lang.reflect.Method.invoke(Method.java:567){code}
> Reproduce line: 
> {code:java}
> gradlew test --tests TestVectorUtil.testSelfDotProduct 
> -Dtests.seed=305701410F76FAD0 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=ar-AE -Dtests.timezone=SystemV/MST7 -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8 {code}
> Perhaps the vector utility methods should work with doubles instead of floats 
> to avoid loss of precision.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] zacharymorn commented on a change in pull request #2052: LUCENE-8982: Make NativeUnixDirectory pure java with FileChannel direct IO flag, and rename to DirectIODirectory

2020-11-19 Thread GitBox


zacharymorn commented on a change in pull request #2052:
URL: https://github.com/apache/lucene-solr/pull/2052#discussion_r527373046



##
File path: 
lucene/misc/src/test/org/apache/lucene/misc/store/TestDirectIODirectory.java
##
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.misc.store;
+
+import com.carrotsearch.randomizedtesting.LifecycleScope;
+import com.carrotsearch.randomizedtesting.RandomizedTest;
+import org.apache.lucene.store.*;
+
+import java.io.IOException;
+import java.nio.file.Files;
+import java.nio.file.Path;
+
+import static 
org.apache.lucene.misc.store.DirectIODirectory.DEFAULT_MIN_BYTES_DIRECT;
+
+public class TestDirectIODirectory extends BaseDirectoryTestCase {
+  public void testWriteReadWithDirectIO() throws IOException {
+try(Directory dir = 
getDirectory(RandomizedTest.newTempDir(LifecycleScope.TEST))) {
+  final long blockSize = 
Files.getFileStore(createTempFile()).getBlockSize();
+  final long minBytesDirect = 
Double.valueOf(Math.ceil(DEFAULT_MIN_BYTES_DIRECT / blockSize)).longValue() *
+blockSize;
+  // Need to worry about overflows here?
+  final int writtenByteLength = Math.toIntExact(minBytesDirect);
+
+  MergeInfo mergeInfo = new MergeInfo(1000, Integer.MAX_VALUE, true, 1);
+  final IOContext context = new IOContext(mergeInfo);
+
+  IndexOutput indexOutput = dir.createOutput("test", context);
+  indexOutput.writeBytes(new byte[writtenByteLength], 0, 
writtenByteLength);
+  IndexInput indexInput = dir.openInput("test", context);
+
+  assertEquals("The length of bytes read should equal to written", 
writtenByteLength, indexInput.length());
+
+  indexOutput.close();
+  indexInput.close();
+}
+  }
+
+  @Override
+  protected Directory getDirectory(Path path) throws IOException {
+Directory delegate = FSDirectory.open(path);

Review comment:
   I've figured it out. Looks like more methods in `DirectIODirectory` need 
to be delegated. Could you please take a look at the latest commit, and let me 
know if if looks good?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14788) Solr: The Next Big Thing

2020-11-19 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235952#comment-17235952
 ] 

Mark Robert Miller commented on SOLR-14788:
---

And let me just say again, I don’t mean to offend me n anything in there. It’s 
looks to me like you came in and looked things over and also basically said 
“this overseer has no practical benefit, let’s rip it.”  That’s intelligent, 
that’s outside agitation, +1.  Our move from CAS to the Overseer was a huge 
loss in the position we were, given introducing an unnecessary layer completely 
for unrealized future pipe dreams. If you come in and look at that thing and 
say WTF, my hats off to you. 

> Solr: The Next Big Thing
> 
>
> Key: SOLR-14788
> URL: https://issues.apache.org/jira/browse/SOLR-14788
> Project: Solr
>  Issue Type: Task
>Reporter: Mark Robert Miller
>Assignee: Mark Robert Miller
>Priority: Critical
>
> h3. 
> [!https://www.unicode.org/consortium/aacimg/1F46E.png!|https://www.unicode.org/consortium/adopted-characters.html#b1F46E]{color:#00875a}*The
>  Policeman is on duty!*{color}
> {quote}_{color:#de350b}*When The Policeman is on duty, sit back, relax, and 
> have some fun. Try to make some progress. Don't stress too much about the 
> impact of your changes or maintaining stability and performance and 
> correctness so much. Until the end of phase 1, I've got your back. I have a 
> variety of tools and contraptions I have been building over the years and I 
> will continue training them on this branch. I will review your changes and 
> peer out across the land and course correct where needed. As Mike D will be 
> thinking, "Sounds like a bottleneck Mark." And indeed it will be to some 
> extent. Which is why once stage one is completed, I will flip The Policeman 
> to off duty. When off duty, I'm always* {color:#de350b}*occasionally*{color} 
> *down for some vigilante justice, but I won't be walking the beat, all that 
> stuff about sit back and relax goes out the window.*{color}_
> {quote}
>  
> I have stolen this title from Ishan or Noble and Ishan.
> This issue is meant to capture the work of a small team that is forming to 
> push Solr and SolrCloud to the next phase.
> I have kicked off the work with an effort to create a very fast and solid 
> base. That work is not 100% done, but it's ready to join the fight.
> Tim Potter has started giving me a tremendous hand in finishing up. Ishan and 
> Noble have already contributed support and testing and have plans for 
> additional work to shore up some of our current shortcomings.
> Others have expressed an interest in helping and hopefully they will pop up 
> here as well.
> Let's organize and discuss our efforts here and in various sub issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-14788) Solr: The Next Big Thing

2020-11-19 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235952#comment-17235952
 ] 

Mark Robert Miller edited comment on SOLR-14788 at 11/20/20, 7:26 AM:
--

And let me just say again, I don’t mean to offend you in anything in there. 
It’s looks to me like you came in and looked things over and also basically 
said “this overseer has no practical benefit, let’s rip it.”  That’s 
intelligent, that’s outside agitation, +1.  Our move from CAS to the Overseer 
was a huge loss in the position we were in, introducing an unnecessary layer 
completely for unrealized future pipe dreams. If you come in and look at that 
thing and say WTF, my hats off to you. 


was (Author: markrmiller):
And let me just say again, I don’t mean to offend me n anything in there. It’s 
looks to me like you came in and looked things over and also basically said 
“this overseer has no practical benefit, let’s rip it.”  That’s intelligent, 
that’s outside agitation, +1.  Our move from CAS to the Overseer was a huge 
loss in the position we were, given introducing an unnecessary layer completely 
for unrealized future pipe dreams. If you come in and look at that thing and 
say WTF, my hats off to you. 

> Solr: The Next Big Thing
> 
>
> Key: SOLR-14788
> URL: https://issues.apache.org/jira/browse/SOLR-14788
> Project: Solr
>  Issue Type: Task
>Reporter: Mark Robert Miller
>Assignee: Mark Robert Miller
>Priority: Critical
>
> h3. 
> [!https://www.unicode.org/consortium/aacimg/1F46E.png!|https://www.unicode.org/consortium/adopted-characters.html#b1F46E]{color:#00875a}*The
>  Policeman is on duty!*{color}
> {quote}_{color:#de350b}*When The Policeman is on duty, sit back, relax, and 
> have some fun. Try to make some progress. Don't stress too much about the 
> impact of your changes or maintaining stability and performance and 
> correctness so much. Until the end of phase 1, I've got your back. I have a 
> variety of tools and contraptions I have been building over the years and I 
> will continue training them on this branch. I will review your changes and 
> peer out across the land and course correct where needed. As Mike D will be 
> thinking, "Sounds like a bottleneck Mark." And indeed it will be to some 
> extent. Which is why once stage one is completed, I will flip The Policeman 
> to off duty. When off duty, I'm always* {color:#de350b}*occasionally*{color} 
> *down for some vigilante justice, but I won't be walking the beat, all that 
> stuff about sit back and relax goes out the window.*{color}_
> {quote}
>  
> I have stolen this title from Ishan or Noble and Ishan.
> This issue is meant to capture the work of a small team that is forming to 
> push Solr and SolrCloud to the next phase.
> I have kicked off the work with an effort to create a very fast and solid 
> base. That work is not 100% done, but it's ready to join the fight.
> Tim Potter has started giving me a tremendous hand in finishing up. Ishan and 
> Noble have already contributed support and testing and have plans for 
> additional work to shore up some of our current shortcomings.
> Others have expressed an interest in helping and hopefully they will pop up 
> here as well.
> Let's organize and discuss our efforts here and in various sub issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-15008) Avoid building OrdinalMap for each facet

2020-11-19 Thread Radu Gheorghe (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-15008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Radu Gheorghe updated SOLR-15008:
-
Attachment: writes_commits.png

> Avoid building OrdinalMap for each facet
> 
>
> Key: SOLR-15008
> URL: https://issues.apache.org/jira/browse/SOLR-15008
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Affects Versions: 8.7
>Reporter: Radu Gheorghe
>Priority: Major
>  Labels: performance
> Attachments: Screenshot 2020-11-19 at 12.01.55.png, writes_commits.png
>
>
> I'm running against the following scenario:
>  * [JSON] faceting on a high cardinality field
>  * few matching documents => few unique values
> Yet the query almost always takes a long time. Here's an example taking 
> almost 4s for ~300 documents and unique values (edited a bit):
>  
> {code:java}
> "QTime":3869,
> "params":{
>   "json":"{\"query\": \"*:*\",
>   \"filter\": [\"type:test_type\", \"date:[1603670360 TO 1604361599]\", 
> \"unique_id:49866\"]
>   \"facet\": 
> {\"keywords\":{\"type\":\"terms\",\"field\":\"keywords\",\"limit\":20,\"mincount\":20}}}",
>   "rows":"0"}},
>   
> "response":{"numFound":333,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[]
>   },
>   "facets":{
> "count":333,
> "keywords":{
>   "buckets":[{
>   "val":"value1",
>   "count":124},
>   ...
> {code}
> I did some [profiling with our Sematext 
> Monitoring|https://sematext.com/docs/monitoring/on-demand-profiling/] and it 
> points me to OrdinalMap building (see attached screenshot). If I read the 
> code right, an OrdinalMap is built with every facet. And it's expensive since 
> there are many unique values in the shard (previously, there we more smaller 
> shards, making latency better, but this approach doesn't scale for this 
> particular use-case).
> If I'm right up to this point, I see a couple of potential improvements, 
> [inspired from 
> Elasticsearch|#search-aggregations-bucket-terms-aggregation-execution-hint]:
>  # *Keep the OrdinalMap cached until the next softCommit*, so that only the 
> first query takes the penalty
>  # *Allow faceting on actual values (a Map) rather than ordinals*, for 
> situations like the one above where we have few matching documents. We could 
> potentially auto-detect this scenario (e.g. by configuring a threshold) and 
> use a Map when there are few documents
> I'm curious about what you're thinking:
>  * would a PR/patch be welcome for any of the two ideas above?
>  * do you see better options? am I missing something?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org