[jira] [Updated] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

2022-05-09 Thread Henrik Hertel (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henrik Hertel updated LUCENE-10562:
---
Description: 
I use Solr and have a large system with 1TB in one core and about 5 million 
documents. The textual content of large PDF files is indexed there. My query is 
extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
{code:java}
*searchvalue*
{code}
, even though I put a filter query in front of it that reduces to less than 20 
documents.

searchvalue -> less than 1 second
searchvalue* -> less than 1 second

My query:
{code:java}
select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
 {code}
I've tried everything imaginable. It doesn't make sense to me why a search over 
a small subset should take so long. If I omit the filter query 
metadataitemids_is:20950, so search the entire inventory, then it also takes 
the same amount of time. Therefore, I suspect that despite the filter query, 
the main query runs over the entire index.

  was:
I use Solr and have a large system with 1TB in one core and about 5 million 
documents. The textual content of large PDF files is indexed there. My query is 
extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
{code:java}
*searchvalue*
{code}
, even though I put a filter query in front of it that reduces to less than 20 
documents.

searchvalue -> less than 1 second
searchvalue* -> less than 1 second

My query:
{code:java}
select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fq=renditions_ss%3A&fl=id&rows=50&start=0
 {code}
I've tried everything imaginable. It doesn't make sense to me why a search over 
a small subset should take so long. If I omit the filter query 
metadataitemids_is:20950, so search the entire inventory, then it also takes 
the same amount of time. Therefore, I suspect that despite the filter query, 
the main query runs over the entire index.


> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

2022-05-09 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533735#comment-17533735
 ] 

Tomoko Uchida commented on LUCENE-10562:


Infix or suffix wildcard query is extremely slow in its nature and not 
recommended - see the documentation.
https://lucene.apache.org/core/8_11_0/core/org/apache/lucene/search/WildcardQuery.html

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

2022-05-09 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533739#comment-17533739
 ] 

Tomoko Uchida commented on LUCENE-10562:


As for "despite filter query", sorry, why you can assume filters are executed 
before wildcard queries?

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

2022-05-09 Thread Henrik Hertel (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533743#comment-17533743
 ] 

Henrik Hertel commented on LUCENE-10562:


Thanks for your answer.

Well, from my naive point of view, I would expect the textual search to be 
performed only over the subset. But this is probably not possible, is it? I 
have read various sources and there have been conflicting statements about this.

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

2022-05-09 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533747#comment-17533747
 ] 

Tomoko Uchida commented on LUCENE-10562:


I'm not fully sure how filters are implemented in Solr, but at least in recent 
Lucene, there is no substantial difference between filters and queries in 
implementation (the former is actually a normal query, that skips score 
calculation); and there is no means to control query execution order (as far as 
I know).

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

2022-05-09 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533751#comment-17533751
 ] 

Tomoko Uchida commented on LUCENE-10562:


One thing I could recommend is, that instead of using regex queries for suffix 
match,  you could "reverse" each term and convert suffix match into prefix 
match. ReverseStringFilter does the trick.
https://lucene.apache.org/core/7_4_0/analyzers-common/org/apache/lucene/analysis/reverse/ReverseStringFilter.html

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

2022-05-09 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533751#comment-17533751
 ] 

Tomoko Uchida edited comment on LUCENE-10562 at 5/9/22 10:57 AM:
-

One thing I could recommend is, that instead of using wildcard queries for 
suffix match,  you could "reverse" each term and convert suffix match into 
prefix match. ReverseStringFilter does the trick.
https://lucene.apache.org/core/7_4_0/analyzers-common/org/apache/lucene/analysis/reverse/ReverseStringFilter.html


was (Author: tomoko uchida):
One thing I could recommend is, that instead of using regex queries for suffix 
match,  you could "reverse" each term and convert suffix match into prefix 
match. ReverseStringFilter does the trick.
https://lucene.apache.org/core/7_4_0/analyzers-common/org/apache/lucene/analysis/reverse/ReverseStringFilter.html

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

2022-05-09 Thread Henrik Hertel (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533762#comment-17533762
 ] 

Henrik Hertel commented on LUCENE-10562:


Sure, that could help, but I guess that would increase my index size by some 
factor. I might evaluate that option, thanks alot.

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

2022-05-09 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533765#comment-17533765
 ] 

Tomoko Uchida commented on LUCENE-10562:


bq.  I guess that would increase my index size by some factor

You're right - this is the pain point of the "reversing term" strategy. 

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW

2022-05-09 Thread GitBox


mayya-sharipova commented on PR #872:
URL: https://github.com/apache/lucene/pull/872#issuecomment-1121144832

   @jtibshirani  Thanks for the comment. I've rerun the benchmarks as you 
suggested, and here are the new results
   
   ```txt
   
   kApproach  
Recall QPS
   10   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.571 
1874.073
   50   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.801 
 752.443
   100  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.865 
 463.214
   500  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.959 
 129.944
   800  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.974 
  87.815
   1000 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.980 
  73.514
   
   10   hnswlib ({'M': 32, 'efConstruction': 100})0.552 
   16745.433
   50   hnswlib ({'M': 32, 'efConstruction': 100})0.794 
5738.468
   100  hnswlib ({'M': 32, 'efConstruction': 100})0.860 
3336.386
   500  hnswlib ({'M': 32, 'efConstruction': 100})0.956 
 832.982
   800  hnswlib ({'M': 32, 'efConstruction': 100})0.973 
 541.097
   1000 hnswlib ({'M': 32, 'efConstruction': 100})0.979 
 442.163
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova opened a new pull request, #874: LUCENE-10471 Increse max dims for vectors to 2048

2022-05-09 Thread GitBox


mayya-sharipova opened a new pull request, #874:
URL: https://github.com/apache/lucene/pull/874

   Increase the maximum number of dims for KNN vectors to 2048.
   
   The current maximum allowed number of dimensions is equal to 1024.
   But we see in practice a number of models that produce vectors with > 1024
   dimensions, especially for image encoding (e.g mobilenet_v2 uses
1280d vectors, OpenAI / GPT-3 Babbage uses 2048d vectors).
   Increasing max dims to `2048` will satisfy these use cases.
   
   We will not recommend further increase of vector dims.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

2022-05-09 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533856#comment-17533856
 ] 

Uwe Schindler commented on LUCENE-10562:


Hi,
I think those question do not relate to Lucene and are no issues at all.

I think those quetsions should be asked on the Solr mailing list: 
us...@solr.apache.org.

This is not a bug and there is no way to improve this situation inside Lucene. 
Some additional hints:
- Consider using the reverse wildcard filter in Solr (there's documentation 
about this). But this won't help if you need a wildcard on both sides of the 
star
- Consider to disable wildcards for end-users in your case (the flexible or 
dismax query parser in Solr can do this)

In general, using wildcards in a full text search engine is showing that text 
analysis works wrong. Based on your name and profile, it looks like this is a 
typical "German language problem". In Germany, compounds are usual 
("Donaudampschiffahrtskapitän", the captain of a steam powered ship on the 
German river Donau) and then people using wildcards is always a sign for 
missing decompounding. This can be done with hyphenation-compound token filter 
in combination with dictionaries. An example and minimalized data files for 
German language is here: https://github.com/uschindler/german-decompounder

When you do decompounding, wildcards should not be needed.

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

2022-05-09 Thread Uwe Schindler (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler resolved LUCENE-10562.

Resolution: Won't Fix

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

2022-05-09 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533856#comment-17533856
 ] 

Uwe Schindler edited comment on LUCENE-10562 at 5/9/22 3:30 PM:


Hi,
I think those question do not relate to Lucene and are no issues at all. I 
think those questions should be asked on the Solr mailing list: 
us...@solr.apache.org.

This is not a bug and there is no way to improve this situation inside Lucene. 
Some additional hints:
- Consider using the reverse wildcard filter in Solr (there's documentation 
about this). But this won't help if you need a wildcard on both sides of the 
star
- Consider to disable wildcards for end-users in your case (the flexible or 
dismax query parser in Solr can do this)
- Decompounding may be needed (see below)

In general, using wildcards in a full text search engine is showing that text 
analysis works wrong. Based on your name and profile, it looks like this is a 
typical "German language problem". In Germany, compounds are usual 
("Donaudampschiffahrtskapitän", the captain of a steam powered ship on the 
German river Donau) and then people using wildcards is always a sign for 
missing decompounding. This can be done with hyphenation-compound token filter 
in combination with dictionaries. An example and minimalized data files for 
German language is here: https://github.com/uschindler/german-decompounder

When you do decompounding, wildcards should not be needed.


was (Author: thetaphi):
Hi,
I think those question do not relate to Lucene and are no issues at all.

I think those quetsions should be asked on the Solr mailing list: 
us...@solr.apache.org.

This is not a bug and there is no way to improve this situation inside Lucene. 
Some additional hints:
- Consider using the reverse wildcard filter in Solr (there's documentation 
about this). But this won't help if you need a wildcard on both sides of the 
star
- Consider to disable wildcards for end-users in your case (the flexible or 
dismax query parser in Solr can do this)

In general, using wildcards in a full text search engine is showing that text 
analysis works wrong. Based on your name and profile, it looks like this is a 
typical "German language problem". In Germany, compounds are usual 
("Donaudampschiffahrtskapitän", the captain of a steam powered ship on the 
German river Donau) and then people using wildcards is always a sign for 
missing decompounding. This can be done with hyphenation-compound token filter 
in combination with dictionaries. An example and minimalized data files for 
German language is here: https://github.com/uschindler/german-decompounder

When you do decompounding, wildcards should not be needed.

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

2022-05-09 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533856#comment-17533856
 ] 

Uwe Schindler edited comment on LUCENE-10562 at 5/9/22 3:31 PM:


Hi,
I think those question do not relate to Lucene and are no issues at all. I 
think those questions should be asked on the Solr mailing list: 
us...@solr.apache.org.

This is not a bug and there is no way to improve this situation inside Lucene. 
Some additional hints:
- Consider using the reverse wildcard filter in Solr (there's documentation 
about this). But this won't help if you need a wildcard on both sides of the 
term
- Consider to disable wildcards for end-users in your case (the flexible or 
dismax query parser in Solr can do this)
- Decompounding may be needed (see below)

In general, using wildcards in a full text search engine is showing that text 
analysis works wrong. Based on your name and profile, it looks like this is a 
typical "German language problem". In Germany, compounds are usual 
("Donaudampschiffahrtskapitän", the captain of a steam powered ship on the 
German river Donau) and then people using wildcards is always a sign for 
missing decompounding. This can be done with hyphenation-compound token filter 
in combination with dictionaries. An example and minimalized data files for 
German language is here: https://github.com/uschindler/german-decompounder

When you do decompounding, wildcards should not be needed.


was (Author: thetaphi):
Hi,
I think those question do not relate to Lucene and are no issues at all. I 
think those questions should be asked on the Solr mailing list: 
us...@solr.apache.org.

This is not a bug and there is no way to improve this situation inside Lucene. 
Some additional hints:
- Consider using the reverse wildcard filter in Solr (there's documentation 
about this). But this won't help if you need a wildcard on both sides of the 
star
- Consider to disable wildcards for end-users in your case (the flexible or 
dismax query parser in Solr can do this)
- Decompounding may be needed (see below)

In general, using wildcards in a full text search engine is showing that text 
analysis works wrong. Based on your name and profile, it looks like this is a 
typical "German language problem". In Germany, compounds are usual 
("Donaudampschiffahrtskapitän", the captain of a steam powered ship on the 
German river Donau) and then people using wildcards is always a sign for 
missing decompounding. This can be done with hyphenation-compound token filter 
in combination with dictionaries. An example and minimalized data files for 
German language is here: https://github.com/uschindler/german-decompounder

When you do decompounding, wildcards should not be needed.

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

2022-05-09 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533857#comment-17533857
 ] 

Uwe Schindler commented on LUCENE-10562:


As explanation why this is slow: It has nothing to do with filters or query 
first. The problem is already before that: Wildcard queries are expanded to 
filter bitsets / large OR queries in the query preprocessing (rewrite mode). 
This happens before the actualy query is executed. So as soon as you have a 
wildcard with many matching terms, the preprocessing takes a significant amount 
of time. The actual query execution is fast and can be optimized. Due to the 
way on how an inverted index is built, there's no way to use another query to 
limit the amount of preprocessing work.

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

2022-05-09 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533857#comment-17533857
 ] 

Uwe Schindler edited comment on LUCENE-10562 at 5/9/22 3:38 PM:


As explanation why this is slow: It has nothing to do with filters or query 
first. The problem is already before that: Wildcard queries are expanded to 
filter bitsets / large OR queries in the query preprocessing (rewrite mode). 
This happens before the actualy query is executed. So as soon as you have a 
wildcard with many matching terms, the preprocessing takes a significant amount 
of time. The actual query execution is fast and can be optimized. Due to the 
way on how an inverted index is built, there's no way to use another query to 
limit the amount of preprocessing work. The preprocessing time is linear to the 
total number of terms in a field, not size of index or number of documents.


was (Author: thetaphi):
As explanation why this is slow: It has nothing to do with filters or query 
first. The problem is already before that: Wildcard queries are expanded to 
filter bitsets / large OR queries in the query preprocessing (rewrite mode). 
This happens before the actualy query is executed. So as soon as you have a 
wildcard with many matching terms, the preprocessing takes a significant amount 
of time. The actual query execution is fast and can be optimized. Due to the 
way on how an inverted index is built, there's no way to use another query to 
limit the amount of preprocessing work.

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz opened a new pull request, #875: LUCENE-10560: Speed up OrdinalMap construction a bit.

2022-05-09 Thread GitBox


jpountz opened a new pull request, #875:
URL: https://github.com/apache/lucene/pull/875

   I benchmarked OrdinalMap construction over high-cardinality fields, and lots 
of
   time gets spent into `PriorityQueue#downHeap` due to entry comparisons. I 
added
   a small hack that speeds up these comparisons a bit by extracting the first 8
   bytes of the terms as a comparable unsigned long, and using this long 
whenever
   possible for comparisons.
   
   On a dataset that consists of 100M documents and 10M unique values that 
consist
   of 16-bytes random bytes, OrdinalMap construction went from 9.4s to 6.0s. On
   the same number of docs/values where values consist of the same 8-bytes 
prefix
   and then 8 random bytes to simulate a worst-case scenario for this change,
   OrdinalMap construction went from 9.6s to 10.1s. So this looks like it can
   yield a significant speedup in some scenarios, while the slowdown is 
contained
   in the worst-case scenario?
   
   Unfortunately, this worst-case scenario is not exactly unlikely, e.g. this is
   what you would get with a dataset of IPv4-mapped IPv6 addresses, where all
   values share the same 12-bytes prefix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

2022-05-09 Thread Henrik Hertel (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533874#comment-17533874
 ] 

Henrik Hertel commented on LUCENE-10562:


[~uschindler] you are correct, it is indeed a german problem :) Thank you for 
the explanations and the tips and I will definitely take a look at everything!

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9625) Benchmark KNN search with ann-benchmarks

2022-05-09 Thread Balmukund Mandal (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533911#comment-17533911
 ] 

Balmukund Mandal commented on LUCENE-9625:
--

I was trying to run the benchmark and has a couple of questions. Indexing takes 
a long time, so is there a way to configure the benchmark to use an already 
existing index for search? Also, is there a way to configure the benchmark to 
use multiple threads for indexing (looks to me that it’s a single-threaded 
indexing)?

> Benchmark KNN search with ann-benchmarks
> 
>
> Key: LUCENE-9625
> URL: https://issues.apache.org/jira/browse/LUCENE-9625
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In addition to benchmarking with luceneutil, it would be good to be able to 
> make use of ann-benchmarks, which is publishing results from many approximate 
> knn algorithms, including the hnsw implementation from its authors. We don't 
> expect to challenge the performance of these native code libraries, however 
> it would be good to know just how far off we are.
> I started looking into this and posted a fork of ann-benchmarks that uses 
> KnnGraphTester  class to run these: 
> https://github.com/msokolov/ann-benchmarks. It's still a WIP; you have to 
> manually copy jars and the KnnGraphTester.class to the test host machine 
> rather than downloading from a distribution. KnnGraphTester needs some 
> modifications in order to support this process - this issue is mostly about 
> that.
> One thing I noticed is that some of the index builds with higher fanout 
> (efConstruction) settings time out at 2h (on an AWS c5 instance), so this is 
> concerning and I'll open a separate issue for trying to improve that.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren in RangeFacetCounts

2022-05-09 Thread GitBox


gsmiller commented on PR #843:
URL: https://github.com/apache/lucene/pull/843#issuecomment-1121351525

   @Yuti-G would it make sense to close out this PR since I don't think we plan 
to merge this as it is?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10538) TopN is not being used in getTopChildren()

2022-05-09 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533912#comment-17533912
 ] 

Greg Miller commented on LUCENE-10538:
--

So I think the order of operations here is:
1. Deliver [LUCENE-10550|https://issues.apache.org/jira/browse/LUCENE-10550], 
which would effectively _copy_ the currently "top children" functionality of 
range faceting to a new API method for getting all children (which is what it's 
really doing).
2. Fix the existing "top children" functionality of range faceting to actually 
return top children (and honor the top-n parameter).

I think this issue now effectively captures #2, and is blocked until 
LUCENE-10550 is delivered. Does that sound right [~yutinggan]?

> TopN is not being used in getTopChildren()
> --
>
> Key: LUCENE-10538
> URL: https://issues.apache.org/jira/browse/LUCENE-10538
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/facet
>Reporter: Yuting Gan
>Priority: Minor
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When looking at the overridden implementation getTopChildren(int topN, String 
> dim, String... path) in RangeFacetCounts, I found that the topN parameter is 
> not being used in the code, and the unit tests did not test this function 
> properly. I will create a PR to fix this, and will look into other overridden 
> implementations and see if they have the same issue. Please let me know if 
> there is any question. Thanks!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a diff in pull request #870: LUCENE-10502: Refactor hnswVectors format

2022-05-09 Thread GitBox


jtibshirani commented on code in PR #870:
URL: https://github.com/apache/lucene/pull/870#discussion_r868240961


##
lucene/core/src/java/org/apache/lucene/codecs/lucene92/Lucene92HnswVectorsFormat.java:
##
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene92;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.KnnVectorsFormat;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.util.hnsw.HnswGraph;
+
+/**
+ * Lucene 9.2 vector format, which encodes numeric vector values and an 
optional associated graph
+ * connecting the documents having values. The graph is used to power HNSW 
search. The format
+ * consists of three files:
+ *
+ * .vec (vector data) file

Review Comment:
   Thanks! Could you please list them in order like we do for the `.vex` file 
below? I think it makes it more precise/ easier to read.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW

2022-05-09 Thread GitBox


jtibshirani commented on PR #872:
URL: https://github.com/apache/lucene/pull/872#issuecomment-1121379649

   Thanks, this looks the same as what I was seeing now! It's good motivation 
to add Lucene to ann-benchmarks so we can stop using a custom local benchmark 
set-up!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a diff in pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW

2022-05-09 Thread GitBox


jtibshirani commented on code in PR #872:
URL: https://github.com/apache/lucene/pull/872#discussion_r868255093


##
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsWriter.java:
##
@@ -53,12 +53,14 @@ public final class Lucene91HnswVectorsWriter extends 
KnnVectorsWriter {
   private final int maxDoc;
 
   private final int maxConn;
+  private final int maxConn0;
   private final int beamWidth;
   private boolean finished;
 
-  Lucene91HnswVectorsWriter(SegmentWriteState state, int maxConn, int 
beamWidth)
+  Lucene91HnswVectorsWriter(SegmentWriteState state, int maxConn, int 
maxConn0, int beamWidth)

Review Comment:
   I was thinking we could just keep a single configuration parameter here, and 
internally calculate `maxConn0 = 2 * M`. If we allow it to be passed as a 
parameter, it seems like it's important to be able to configure it, but that's 
not the case (it is not something users will change and should always be set to 
`2 * M`). Like that we could also avoid writing a new value `maxConn0` into the 
format, which doesn't seem necessary?
   
   If we are worried about naming, we could rename `maxConn` to `M`. From my 
perspective, it's okay to use single-letter variable names (with a clear 
comment!) when it directly corresponds to a paper's algorithm.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on pull request #870: LUCENE-10502: Refactor hnswVectors format

2022-05-09 Thread GitBox


mayya-sharipova commented on PR #870:
URL: https://github.com/apache/lucene/pull/870#issuecomment-1121489041

   @msokolov Thanks for your feedback on this PR. I am wondering if you have 
any further feedback for this work. It would be nice to get it merged for 9.2 
Lucene release.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

2022-05-09 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533982#comment-17533982
 ] 

Tomoko Uchida commented on LUCENE-10562:


Yes, it's not about "query execution (retrieving inverted index)" but "what 
postings you are traversing"; there's no way to optimize when you have so many 
postings that match the wildcard term. Just for a practical tip, I usually 
enforce two or three leading characters for wildcard queries when it's needed 
to reduce the search space.

bq. Consider using the reverse wildcard filter in Solr (there's documentation 
about this). But this won't help if you need a wildcard on both sides of the 
term

I think you could run conjunction queries of two fields (one for the normal 
text field, one for the reversed field) to support the infix wildcard query - 
not sure it is worth adding another text field when the index is already large. 
Anyway, dictionary-based decomposition looks promising to me in the case of 
German (though I have little knowledge of it except for the very basic one I've 
learned at the university lectures).


> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

2022-05-09 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533982#comment-17533982
 ] 

Tomoko Uchida edited comment on LUCENE-10562 at 5/9/22 7:43 PM:


Yes, it's not about "query execution (retrieving inverted index)" but "what 
postings you are traversing"; there's no way to optimize when you have so many 
postings that match the wildcard term. Just for a practical tip, I usually 
enforce two or three leading characters for wildcard queries when it's needed 
to reduce the search space.
{quote}Consider using the reverse wildcard filter in Solr (there's 
documentation about this). But this won't help if you need a wildcard on both 
sides of the term
{quote}
-I think you could run conjunction queries of two fields (one for the normal 
text field, one for the reversed field) to support the infix wildcard query- - 
not sure it is worth adding another text field when the index is already large. 
Anyway, dictionary-based decomposition looks promising to me in the case of 
German (though I have little knowledge of it except for the very basic one I've 
learned at the university lectures).

Correction: conjunction query does not work in this situation - sorry. nGram or 
more sophisticated term decomposition will be needed.


was (Author: tomoko uchida):
Yes, it's not about "query execution (retrieving inverted index)" but "what 
postings you are traversing"; there's no way to optimize when you have so many 
postings that match the wildcard term. Just for a practical tip, I usually 
enforce two or three leading characters for wildcard queries when it's needed 
to reduce the search space.

bq. Consider using the reverse wildcard filter in Solr (there's documentation 
about this). But this won't help if you need a wildcard on both sides of the 
term

I think you could run conjunction queries of two fields (one for the normal 
text field, one for the reversed field) to support the infix wildcard query - 
not sure it is worth adding another text field when the index is already large. 
Anyway, dictionary-based decomposition looks promising to me in the case of 
German (though I have little knowledge of it except for the very basic one I've 
learned at the university lectures).


> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10551) LowercaseAsciiCompression should return false when it's unable to compress

2022-05-09 Thread Peixin Li (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533993#comment-17533993
 ] 

Peixin Li commented on LUCENE-10551:


we have identified the issue is not related to code in Lucene. We are using 
GraalVM and its does some optimization for the part of the code and caused the 
issue.

we changed 
[https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/compress/LowercaseAsciiCompression.java#L55]
 from
{code:java}
while (i - previousExceptionIndex > 0xFF) {
++numExceptions;
previousExceptionIndex += 0xFF;
} {code}
to
{code:java}
while (i - previousExceptionIndex > 0xFF) {
log.trace("{}", previousExceptionIndex);
++numExceptions;
previousExceptionIndex += 0xFF;
} {code}
then that illegalState exception not show up anymore

> LowercaseAsciiCompression should return false when it's unable to compress
> --
>
> Key: LUCENE-10551
> URL: https://issues.apache.org/jira/browse/LUCENE-10551
> Project: Lucene - Core
>  Issue Type: Bug
> Environment: Lucene version 8.11.1
>Reporter: Peixin Li
>Priority: Major
> Attachments: LUCENE-10551-test.patch
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> {code:java}
>  Failed to commit..
> java.lang.IllegalStateException: 10 <> 5 
> cion1cion_desarrollociones_oraclecionesnaturacionesnatura2tedppsa-integrationdemotiontion
>  cloud gen2tion instance - dev1tion instance - 
> testtion-devbtion-instancetion-prdtion-promerication-qation064533tion535217tion697401tion761348tion892818tion_matrationcauto_simmonsintgic_testtioncloudprodictioncloudservicetiongateway10tioninstance-jtsundatamartprd??o
>         at 
> org.apache.lucene.util.compress.LowercaseAsciiCompression.compress(LowercaseAsciiCompression.java:115)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlock(BlockTreeTermsWriter.java:834)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlocks(BlockTreeTermsWriter.java:628)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.pushTerm(BlockTreeTermsWriter.java:947)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.write(BlockTreeTermsWriter.java:912)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTreeTermsWriter.java:318)
>         at 
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.write(PerFieldPostingsFormat.java:170)
>         at 
> org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:120)
>         at 
> org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:267)
>         at 
> org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:350)
>         at 
> org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:476)
>         at 
> org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:656)
>         at 
> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3364)
>         at 
> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770)
>         at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728)
>        {code}
> {code:java}
> key=och-live--WorkResource.renewAssignmentToken.ResourceTime[namespace=workflow,
>  resourceGroup=workflow-service-overlay]{availabilityDomain=iad-ad-1, 
> domainId=och-live, host=workflow-service-overlay-01341.node.ad1.us-ashburn-1})
> java.lang.IllegalStateException: 29 <> 16 
> analytics-platform-test/koala/cluster-tool:1.0-20220310151438.492,mesh_istio_examples-bookinfo-details-v1:1.16.2mesh_istio_examples-bookinfo-reviews-v3:1.16.2oce-clamav:1.0.219oce-tesseract:1.0.7oce-traefik:2.5.1oci-opensearch:1.2.4.8.103oda-digital-assistant-control-plane-train-pool-workflow-v6:22.02.14oke-coresvcs-k8s-dns-dnsmasq-nanny-amd64@sha256:41aa9160ceeaf712369ddb660d02e5ec06d1679965e6930351967c8cf5ed62d4oke-coresvcs-k8s-dns-kube-dns-amd64@sha256:2cf34b04106974952996c6ef1313f165ce65b4ad68a3051f51b1b8f91ba5f838oke-coresvcs-k8s-dns-sidecar-amd64@sha256:8a82c7288725cb4de9c7cd8d5a78279208e379f35751539b406077f9a3163dcdoke-coresvcs-node-problem-detector@sha256:9d54df11804a862c54276648702a45a6a0027a9d930a86becd69c34cc84bf510oke-coresvcs-oke-fluentd-lumberjack@sha256:5f3f10b187eb804ce4e84bc3672de1cf318c0f793f00dac01cd7da8beea8f269oke-etcd-operator@sha256:4353a2e5ef02bb0f6b046a8d6219b1af359a2c1141c358ff110e395f29d0bfc8oke-oke-hyperkube-amd64@sha256:3c734f46099400507f938090eb9a874338fa25cde425ac9409df4c885759752foke-public-busybox@sha256:4cee1979ba0bf7

[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

2022-05-09 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533982#comment-17533982
 ] 

Tomoko Uchida edited comment on LUCENE-10562 at 5/9/22 8:08 PM:


Yes, it's not about "query execution (retrieving inverted index)" but "what 
postings you are traversing"; there's no way to optimize when you have so many 
postings that match the wildcard term. Just for a practical tip, I usually 
enforce two or three leading characters for wildcard queries when it's needed 
to reduce the search space.
{quote}Consider using the reverse wildcard filter in Solr (there's 
documentation about this). But this won't help if you need a wildcard on both 
sides of the term
{quote}
-I think you could run conjunction queries of two fields (one for the normal 
text field, one for the reversed field) to support the infix wildcard query- - 
not sure it is worth adding another text field when the index is already large. 
Anyway, dictionary-based decomposition looks promising to me in the case of 
German (though I have little knowledge of it except for the very basic one I've 
learned at the university lectures).

Correction: conjunction query does not work in this situation - sorry. nGram or 
more sophisticated term decomposition will be needed. For example in my 
language (Japanese - which does not even has spaces between terms), the 
combination of off-the-shelf nGram filter and phrase search often works well.


was (Author: tomoko uchida):
Yes, it's not about "query execution (retrieving inverted index)" but "what 
postings you are traversing"; there's no way to optimize when you have so many 
postings that match the wildcard term. Just for a practical tip, I usually 
enforce two or three leading characters for wildcard queries when it's needed 
to reduce the search space.
{quote}Consider using the reverse wildcard filter in Solr (there's 
documentation about this). But this won't help if you need a wildcard on both 
sides of the term
{quote}
-I think you could run conjunction queries of two fields (one for the normal 
text field, one for the reversed field) to support the infix wildcard query- - 
not sure it is worth adding another text field when the index is already large. 
Anyway, dictionary-based decomposition looks promising to me in the case of 
German (though I have little knowledge of it except for the very basic one I've 
learned at the university lectures).

Correction: conjunction query does not work in this situation - sorry. nGram or 
more sophisticated term decomposition will be needed.

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10471) Increase the number of dims for KNN vectors to 2048

2022-05-09 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534018#comment-17534018
 ] 

Julie Tibshirani commented on LUCENE-10471:
---

I also don't have an objection to increasing it a bit. But along the same lines 
as Robert's point, it'd be good to think about our decision making process -- 
otherwise we'd be tempted to continuously increase it. I've already heard users 
requesting 12288 dims (to handle OpenAI DaVinci embeddings).

Two possible approaches I could see:
1. We do more research on the literature and decide on a reasonable max 
dimension. If a user wants to go beyond that, they should reconsider the model 
or perform dimensionality reduction. This would encourage users to think 
through their embedding strategy to optimize for performance. The improvements 
can be significant, since search time scales with vector dimensionality.
2. Or we take a flexible approach where we bump the limit to a high upper 
bound. This upper bound would be based on how much memory usage is reasonable 
for one vector (similar to the max term size?)

I feel a bit better about approach 2 because I'm not confident I could come up 
with a statement about a "reasonable max dimension", especially given the 
fast-moving research.

> Increase the number of dims for KNN vectors to 2048
> ---
>
> Key: LUCENE-10471
> URL: https://issues.apache.org/jira/browse/LUCENE-10471
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Mayya Sharipova
>Priority: Trivial
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The current maximum allowed number of dimensions is equal to 1024. But we see 
> in practice a couple well-known models that produce vectors with > 1024 
> dimensions (e.g 
> [mobilenet_v2|https://tfhub.dev/google/imagenet/mobilenet_v2_035_224/feature_vector/1]
>  uses 1280d vectors, OpenAI / GPT-3 Babbage uses 2048d vectors). Increasing 
> max dims to `2048` will satisfy these use cases.
> I am wondering if anybody has strong objections against this.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10538) TopN is not being used in getTopChildren()

2022-05-09 Thread Yuting Gan (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534019#comment-17534019
 ] 

Yuting Gan commented on LUCENE-10538:
-

Yes, thanks [~gsmiller]! I am working on LUCENE-10550 and should have a PR out 
soon, and then I will re-visit this issue and return real top-n children.

> TopN is not being used in getTopChildren()
> --
>
> Key: LUCENE-10538
> URL: https://issues.apache.org/jira/browse/LUCENE-10538
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/facet
>Reporter: Yuting Gan
>Priority: Minor
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> When looking at the overridden implementation getTopChildren(int topN, String 
> dim, String... path) in RangeFacetCounts, I found that the topN parameter is 
> not being used in the code, and the unit tests did not test this function 
> properly. I will create a PR to fix this, and will look into other overridden 
> implementations and see if they have the same issue. Please let me know if 
> there is any question. Thanks!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Yuti-G closed pull request #843: LUCENE-10538: TopN is not being used in getTopChildren in RangeFacetCounts

2022-05-09 Thread GitBox


Yuti-G closed pull request #843: LUCENE-10538: TopN is not being used in 
getTopChildren in RangeFacetCounts
URL: https://github.com/apache/lucene/pull/843


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #777: LUCENE-10488: Optimize Facets#getTopDims in ConcurrentSortedSetDocValuesFacetCounts

2022-05-09 Thread GitBox


gsmiller commented on PR #777:
URL: https://github.com/apache/lucene/pull/777#issuecomment-1121602253

   This looks great! Thanks @Yuti-G! It would be nice if we could create a 
common abstract class to hold some of the common logic between this and the 
non-concurrent implementation. Seems like a lot of copy/paste going on. This 
isn't a new problem though, so let's not do that work as part of this PR. What 
do you think of opening a separate issue to see if we can consolidate some 
common logic? You probably have a better idea of how feasible this is after 
working on these changes, so I'm curious what you think. Thanks again for 
taking this on!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW

2022-05-09 Thread GitBox


mayya-sharipova commented on code in PR #872:
URL: https://github.com/apache/lucene/pull/872#discussion_r868594873


##
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsWriter.java:
##
@@ -53,12 +53,14 @@ public final class Lucene91HnswVectorsWriter extends 
KnnVectorsWriter {
   private final int maxDoc;
 
   private final int maxConn;
+  private final int maxConn0;
   private final int beamWidth;
   private boolean finished;
 
-  Lucene91HnswVectorsWriter(SegmentWriteState state, int maxConn, int 
beamWidth)
+  Lucene91HnswVectorsWriter(SegmentWriteState state, int maxConn, int 
maxConn0, int beamWidth)

Review Comment:
   @jtibshirani Thanks for the comment, this is a great suggestion! Addressed 
in b1a6394402d8d535b092de73b227fb5a40015c65



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #875: LUCENE-10560: Speed up OrdinalMap construction a bit.

2022-05-09 Thread GitBox


rmuir commented on PR #875:
URL: https://github.com/apache/lucene/pull/875#issuecomment-1121772167

   I kinda feel like in this case we are trying to outsmart the JIT compiler 
with optimizations it has for `Arrays.equals()`. 
   
   I understand the idea that we could be smarter based on the data ... maybe 
... but I don't think we "know enough". 
   
   I'd rather us know exactly what that threshold is and write appropriate 
metadata (e.g. min/maxTermLength) if we can figure it out, better than causing 
regressions for some use-cases. bonus if any additional metadata metadata can 
be utilized by checkindex.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10551) LowercaseAsciiCompression should return false when it's unable to compress

2022-05-09 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534074#comment-17534074
 ] 

Robert Muir commented on LUCENE-10551:
--

Thanks [~irislpx] for solving the mystery. 

Maybe there is some cheaper tweak/workaround we can apply to dodge the GraalVM 
bug? If you happen to find one, can you please let us know?

I'd rather not spam users with logs, but if we can just prevent the problem by 
re-arranging the code without a performance hit, I think it is compelling?

> LowercaseAsciiCompression should return false when it's unable to compress
> --
>
> Key: LUCENE-10551
> URL: https://issues.apache.org/jira/browse/LUCENE-10551
> Project: Lucene - Core
>  Issue Type: Bug
> Environment: Lucene version 8.11.1
>Reporter: Peixin Li
>Priority: Major
> Attachments: LUCENE-10551-test.patch
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> {code:java}
>  Failed to commit..
> java.lang.IllegalStateException: 10 <> 5 
> cion1cion_desarrollociones_oraclecionesnaturacionesnatura2tedppsa-integrationdemotiontion
>  cloud gen2tion instance - dev1tion instance - 
> testtion-devbtion-instancetion-prdtion-promerication-qation064533tion535217tion697401tion761348tion892818tion_matrationcauto_simmonsintgic_testtioncloudprodictioncloudservicetiongateway10tioninstance-jtsundatamartprd??o
>         at 
> org.apache.lucene.util.compress.LowercaseAsciiCompression.compress(LowercaseAsciiCompression.java:115)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlock(BlockTreeTermsWriter.java:834)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlocks(BlockTreeTermsWriter.java:628)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.pushTerm(BlockTreeTermsWriter.java:947)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.write(BlockTreeTermsWriter.java:912)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTreeTermsWriter.java:318)
>         at 
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.write(PerFieldPostingsFormat.java:170)
>         at 
> org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:120)
>         at 
> org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:267)
>         at 
> org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:350)
>         at 
> org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:476)
>         at 
> org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:656)
>         at 
> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3364)
>         at 
> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770)
>         at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728)
>        {code}
> {code:java}
> key=och-live--WorkResource.renewAssignmentToken.ResourceTime[namespace=workflow,
>  resourceGroup=workflow-service-overlay]{availabilityDomain=iad-ad-1, 
> domainId=och-live, host=workflow-service-overlay-01341.node.ad1.us-ashburn-1})
> java.lang.IllegalStateException: 29 <> 16 
> analytics-platform-test/koala/cluster-tool:1.0-20220310151438.492,mesh_istio_examples-bookinfo-details-v1:1.16.2mesh_istio_examples-bookinfo-reviews-v3:1.16.2oce-clamav:1.0.219oce-tesseract:1.0.7oce-traefik:2.5.1oci-opensearch:1.2.4.8.103oda-digital-assistant-control-plane-train-pool-workflow-v6:22.02.14oke-coresvcs-k8s-dns-dnsmasq-nanny-amd64@sha256:41aa9160ceeaf712369ddb660d02e5ec06d1679965e6930351967c8cf5ed62d4oke-coresvcs-k8s-dns-kube-dns-amd64@sha256:2cf34b04106974952996c6ef1313f165ce65b4ad68a3051f51b1b8f91ba5f838oke-coresvcs-k8s-dns-sidecar-amd64@sha256:8a82c7288725cb4de9c7cd8d5a78279208e379f35751539b406077f9a3163dcdoke-coresvcs-node-problem-detector@sha256:9d54df11804a862c54276648702a45a6a0027a9d930a86becd69c34cc84bf510oke-coresvcs-oke-fluentd-lumberjack@sha256:5f3f10b187eb804ce4e84bc3672de1cf318c0f793f00dac01cd7da8beea8f269oke-etcd-operator@sha256:4353a2e5ef02bb0f6b046a8d6219b1af359a2c1141c358ff110e395f29d0bfc8oke-oke-hyperkube-amd64@sha256:3c734f46099400507f938090eb9a874338fa25cde425ac9409df4c885759752foke-public-busybox@sha256:4cee1979ba0bf7db9fc5d28fb7b798ca69ae95a47c5fecf46327720df4ff352doke-public-coredns@sha256:86f8cfc74497f04e181ab2e1d26d2fd8bd46c4b33ce24b55620efcdfcb214670oke-public-coredns@sha256:8cd974302f1f6108f6f31312f8181ae723b514e2022089cdcc3db10666c49228oke-public-etcd@sha256:b751e459bc2a8f079f6730dd8462671b253c7c8b0d0eb47c67888d5091c6bb77oke-public-etcd@sha256:d6a76200a6e9103681bc2cf7fefbcada0dd9372d52cf8964178d846b89959d14oke-public-etcd@sha256:f

[GitHub] [lucene] rmuir commented on pull request #777: LUCENE-10488: Optimize Facets#getTopDims in ConcurrentSortedSetDocValuesFacetCounts

2022-05-09 Thread GitBox


rmuir commented on PR #777:
URL: https://github.com/apache/lucene/pull/777#issuecomment-1121805275

   just an observation, this is a large amount of code changes for performance 
change that may be in the noise? I'm a bit confused.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10471) Increase the number of dims for KNN vectors to 2048

2022-05-09 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534086#comment-17534086
 ] 

Robert Muir commented on LUCENE-10471:
--

I think the major problem is still no Vector API in the java APIs. It changes 
this entire conversation completely when we think about this limit.

if openjdk would release this low level vector API, or barring that, maybe some 
way to MR-JAR for it, or barring that, maybe some intrinsics such as 
SloppyMath.dotProduct and SloppyMath.matrixMultiply, maybe java wouldn't become 
the next COBOL.

> Increase the number of dims for KNN vectors to 2048
> ---
>
> Key: LUCENE-10471
> URL: https://issues.apache.org/jira/browse/LUCENE-10471
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Mayya Sharipova
>Priority: Trivial
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The current maximum allowed number of dimensions is equal to 1024. But we see 
> in practice a couple well-known models that produce vectors with > 1024 
> dimensions (e.g 
> [mobilenet_v2|https://tfhub.dev/google/imagenet/mobilenet_v2_035_224/feature_vector/1]
>  uses 1280d vectors, OpenAI / GPT-3 Babbage uses 2048d vectors). Increasing 
> max dims to `2048` will satisfy these use cases.
> I am wondering if anybody has strong objections against this.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #874: LUCENE-10471 Increse max dims for vectors to 2048

2022-05-09 Thread GitBox


mocobeta commented on PR #874:
URL: https://github.com/apache/lucene/pull/874#issuecomment-1121814970

   I'm curious about how such large models (to me) are practically common or 
will be common in the near future (in the IR area).
   I don't have enough expertise to agree or disagree - it's just a general 
(and maybe naive) question.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang commented on a diff in pull request #870: LUCENE-10502: Refactor hnswVectors format

2022-05-09 Thread GitBox


LuXugang commented on code in PR #870:
URL: https://github.com/apache/lucene/pull/870#discussion_r868760034


##
lucene/core/src/java/org/apache/lucene/codecs/lucene92/Lucene92HnswVectorsFormat.java:
##
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene92;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.KnnVectorsFormat;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.util.hnsw.HnswGraph;
+
+/**
+ * Lucene 9.2 vector format, which encodes numeric vector values and an 
optional associated graph
+ * connecting the documents having values. The graph is used to power HNSW 
search. The format
+ * consists of three files:
+ *
+ * .vec (vector data) file

Review Comment:
   @jtibshirani  addressed in 
https://github.com/apache/lucene/pull/870/commits/d1a26e55a6f65277e875e77fd096aa962988ef49



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #832: LUCENE-10532: remove @Slow annotation

2022-05-09 Thread GitBox


rmuir commented on PR #832:
URL: https://github.com/apache/lucene/pull/832#issuecomment-1121826968

   Thanks @cpoerschke for correcting my dyslexia :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #832: LUCENE-10532: remove @Slow annotation

2022-05-09 Thread GitBox


rmuir commented on PR #832:
URL: https://github.com/apache/lucene/pull/832#issuecomment-1121832123

   > I'm fine with this. Reasons for Slow (and other test groups) are various. 
I use Slow in projects where certain tests are indeed slow by nature - have to 
unpack the distribution/ fork processes or start networking layers. These are 
typically integration tests. They run on the CI but they're not mandatory for 
local developer runs. I don't think this is needed in Lucene either.
   
   I like the idea of the test groups actually, it came up on another PR: 
https://github.com/apache/lucene/pull/633 . I just think lucene would do better 
with different test groups to better test what we need (e.g. something like 
suggested `@Concurrent`.  The `@Slow` doesn't help us IMO.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir merged pull request #832: LUCENE-10532: remove @Slow annotation

2022-05-09 Thread GitBox


rmuir merged PR #832:
URL: https://github.com/apache/lucene/pull/832


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10532) Remove @Slow annotation

2022-05-09 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534102#comment-17534102
 ] 

ASF subversion and git services commented on LUCENE-10532:
--

Commit 3edfeb5eb224344e35f3454f5d51288ab05452c1 in lucene's branch 
refs/heads/main from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=3edfeb5eb22 ]

LUCENE-10532: remove @Slow annotation (#832)

Remove `@Slow` annotation, for more consistency with CI and local jobs. All 
tests can be fast!

> Remove @Slow annotation
> ---
>
> Key: LUCENE-10532
> URL: https://issues.apache.org/jira/browse/LUCENE-10532
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This annotation is useless, people have gotten so lazy about using it, that 
> now there are proposals to mark tests that are not actually slow, with the 
> @Slow annotation.
> Let's remove the annotation. I can't imagine a situation where we mark a test 
> @Slow and i don't veto it. we can keep tests clean.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10532) Remove @Slow annotation

2022-05-09 Thread Robert Muir (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-10532.
--
Fix Version/s: 9.2
   Resolution: Fixed

> Remove @Slow annotation
> ---
>
> Key: LUCENE-10532
> URL: https://issues.apache.org/jira/browse/LUCENE-10532
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Fix For: 9.2
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This annotation is useless, people have gotten so lazy about using it, that 
> now there are proposals to mark tests that are not actually slow, with the 
> @Slow annotation.
> Let's remove the annotation. I can't imagine a situation where we mark a test 
> @Slow and i don't veto it. we can keep tests clean.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10532) Remove @Slow annotation

2022-05-09 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534112#comment-17534112
 ] 

ASF subversion and git services commented on LUCENE-10532:
--

Commit 87df4aa8511efaa6e21cac1d52c259b67ff248df in lucene's branch 
refs/heads/branch_9x from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=87df4aa8511 ]

LUCENE-10532: remove @Slow annotation (#832)

Remove `@Slow` annotation, for more consistency with CI and local jobs. All 
tests can be fast!


> Remove @Slow annotation
> ---
>
> Key: LUCENE-10532
> URL: https://issues.apache.org/jira/browse/LUCENE-10532
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Fix For: 9.2
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This annotation is useless, people have gotten so lazy about using it, that 
> now there are proposals to mark tests that are not actually slow, with the 
> @Slow annotation.
> Let's remove the annotation. I can't imagine a situation where we mark a test 
> @Slow and i don't veto it. we can keep tests clean.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9356) Add tests for corruptions caused by byte flips

2022-05-09 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534116#comment-17534116
 ] 

Robert Muir commented on LUCENE-9356:
-

The test seems wrong to me, for example it does not consider CRC-32 collisions. 
I think we should just remove the test for this reason.

> Add tests for corruptions caused by byte flips
> --
>
> Key: LUCENE-9356
> URL: https://issues.apache.org/jira/browse/LUCENE-9356
> Project: Lucene - Core
>  Issue Type: Test
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> We already have tests that file truncation and modification of the index 
> headers are caught correctly. I'd like to add another test that flipping a 
> byte in a way that modifies the checksum of the file is always caught 
> gracefully by Lucene.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9356) Add tests for corruptions caused by byte flips

2022-05-09 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534117#comment-17534117
 ] 

Robert Muir commented on LUCENE-9356:
-

by the way, if we just want to improve the exception path, test the real 
corruption path, it is probably more practical for test to modify {{checksum}} 
of the file rather than random byte flip. You still have the issue of 
collisions, but maybe it is more straightforward to test it in this way?

> Add tests for corruptions caused by byte flips
> --
>
> Key: LUCENE-9356
> URL: https://issues.apache.org/jira/browse/LUCENE-9356
> Project: Lucene - Core
>  Issue Type: Test
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> We already have tests that file truncation and modification of the index 
> headers are caught correctly. I'd like to add another test that flipping a 
> byte in a way that modifies the checksum of the file is always caught 
> gracefully by Lucene.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9356) Add tests for corruptions caused by byte flips

2022-05-09 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534118#comment-17534118
 ] 

Robert Muir commented on LUCENE-9356:
-

mulling on it more, this to me seems like the way to go. let's rewrite the test 
to explicitly change a byte in the expected checksum. random collisions no 
longer an issue, it should fail every time, right? This is easier to maintain 
and think about.

This should be able to flush out any bugs in codecs that aren't doing the right 
thing. Either their checksum is buggy or checkIntegrity() method is buggy and 
not validating checksums. But its easier to think about.

> Add tests for corruptions caused by byte flips
> --
>
> Key: LUCENE-9356
> URL: https://issues.apache.org/jira/browse/LUCENE-9356
> Project: Lucene - Core
>  Issue Type: Test
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> We already have tests that file truncation and modification of the index 
> headers are caught correctly. I'd like to add another test that flipping a 
> byte in a way that modifies the checksum of the file is always caught 
> gracefully by Lucene.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a diff in pull request #870: LUCENE-10502: Refactor hnswVectors format

2022-05-09 Thread GitBox


jtibshirani commented on code in PR #870:
URL: https://github.com/apache/lucene/pull/870#discussion_r868790940


##
lucene/core/src/java/org/apache/lucene/codecs/lucene92/Lucene92HnswVectorsFormat.java:
##
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene92;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.KnnVectorsFormat;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.util.hnsw.HnswGraph;
+
+/**
+ * Lucene 9.2 vector format, which encodes numeric vector values and an 
optional associated graph
+ * connecting the documents having values. The graph is used to power HNSW 
search. The format
+ * consists of three files:
+ *
+ * .vec (vector data) file

Review Comment:
   Sorry to make so many small notes, but why is OrdToDoc in its own sublist 
instead of the top-level list? Also the note "only in sparse case" applies to 
both DocIds and OrdToDoc right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a diff in pull request #870: LUCENE-10502: Refactor hnswVectors format

2022-05-09 Thread GitBox


jtibshirani commented on code in PR #870:
URL: https://github.com/apache/lucene/pull/870#discussion_r868790940


##
lucene/core/src/java/org/apache/lucene/codecs/lucene92/Lucene92HnswVectorsFormat.java:
##
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene92;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.KnnVectorsFormat;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.util.hnsw.HnswGraph;
+
+/**
+ * Lucene 9.2 vector format, which encodes numeric vector values and an 
optional associated graph
+ * connecting the documents having values. The graph is used to power HNSW 
search. The format
+ * consists of three files:
+ *
+ * .vec (vector data) file

Review Comment:
   Sorry to make so many small notes, but why is OrdToDoc in its own sublist 
instead of the top-level list? Also the note "only in sparse case" applies to 
both DocIds and OrdToDoc right? The same comments apply to the javadoc about 
`.vem` below.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang commented on a diff in pull request #870: LUCENE-10502: Refactor hnswVectors format

2022-05-09 Thread GitBox


LuXugang commented on code in PR #870:
URL: https://github.com/apache/lucene/pull/870#discussion_r868794312


##
lucene/core/src/java/org/apache/lucene/codecs/lucene92/Lucene92HnswVectorsFormat.java:
##
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene92;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.KnnVectorsFormat;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.util.hnsw.HnswGraph;
+
+/**
+ * Lucene 9.2 vector format, which encodes numeric vector values and an 
optional associated graph
+ * connecting the documents having values. The graph is used to power HNSW 
search. The format
+ * consists of three files:
+ *
+ * .vec (vector data) file

Review Comment:
Thanks @jtibshirani ,you are right, I like these spotless things, addressed 
in 
https://github.com/apache/lucene/pull/870/commits/e3ee29f24541d9b7e9b947ca183d665583200dfb
 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10564) SparseFixedBitSet#or doesn't update memory accounting

2022-05-09 Thread Julie Tibshirani (Jira)
Julie Tibshirani created LUCENE-10564:
-

 Summary: SparseFixedBitSet#or doesn't update memory accounting
 Key: LUCENE-10564
 URL: https://issues.apache.org/jira/browse/LUCENE-10564
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Julie Tibshirani


While debugging why a cache was using way more memory than expected, one of my 
colleagues noticed that {{SparseFixedBitSet#or}} doesn't update 
{{{}ramBytesUsed{}}}. Here's a unit test that demonstrates this:
{code:java}
  public void testRamBytesUsed() throws IOException {
BitSet bitSet = new SparseFixedBitSet(1000);
long initialBytesUsed = bitSet.ramBytesUsed();

DocIdSetIterator disi = DocIdSetIterator.all(1000);
bitSet.or(disi);
assertTrue(bitSet.ramBytesUsed() > initialBytesUsed);
  }
{code}
It also looks like we don't have any tests for {{SparseFixedBitSet}} memory 
accounting (unless I've missed them!) It'd be nice to add more coverage there 
too.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9356) Add tests for corruptions caused by byte flips

2022-05-09 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534124#comment-17534124
 ] 

Robert Muir commented on LUCENE-9356:
-

i'd disable compound file as a first pass on any test too. this one tries to 
take that on, but let's make progress :) especially considering how checksums 
work in compound files. likely the result of some crazy failures on this one.

> Add tests for corruptions caused by byte flips
> --
>
> Key: LUCENE-9356
> URL: https://issues.apache.org/jira/browse/LUCENE-9356
> Project: Lucene - Core
>  Issue Type: Test
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> We already have tests that file truncation and modification of the index 
> headers are caught correctly. I'd like to add another test that flipping a 
> byte in a way that modifies the checksum of the file is always caught 
> gracefully by Lucene.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9356) Add tests for corruptions caused by byte flips

2022-05-09 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534126#comment-17534126
 ] 

Robert Muir commented on LUCENE-9356:
-

I think the "problem" in current test is the inherent double-checksumming that 
happens inside CFE, it defeats your test's guess at "didnt change checksum". 
Definitely too difficult to think about the problem this way. That's why i 
encourage changing the "expected" value.

> Add tests for corruptions caused by byte flips
> --
>
> Key: LUCENE-9356
> URL: https://issues.apache.org/jira/browse/LUCENE-9356
> Project: Lucene - Core
>  Issue Type: Test
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> We already have tests that file truncation and modification of the index 
> headers are caught correctly. I'd like to add another test that flipping a 
> byte in a way that modifies the checksum of the file is always caught 
> gracefully by Lucene.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

2022-05-09 Thread Henrik Hertel (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534164#comment-17534164
 ] 

Henrik Hertel commented on LUCENE-10562:


[~tomoko] thanks for the additional tips

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org