[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

Tomoko Uchida (Jira) Mon, 09 May 2022 12:44:07 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533982#comment-17533982
 ]


Tomoko Uchida edited comment on LUCENE-10562 at 5/9/22 7:43 PM:
----------------------------------------------------------------

Yes, it's not about "query execution (retrieving inverted index)" but "what 
postings you are traversing"; there's no way to optimize when you have so many 
postings that match the wildcard term. Just for a practical tip, I usually 
enforce two or three leading characters for wildcard queries when it's needed 
to reduce the search space.
{quote}Consider using the reverse wildcard filter in Solr (there's 
documentation about this). But this won't help if you need a wildcard on both 
sides of the term
{quote}
-I think you could run conjunction queries of two fields (one for the normal 
text field, one for the reversed field) to support the infix wildcard query- - 
not sure it is worth adding another text field when the index is already large. 
Anyway, dictionary-based decomposition looks promising to me in the case of 
German (though I have little knowledge of it except for the very basic one I've 
learned at the university lectures).

Correction: conjunction query does not work in this situation - sorry. nGram or 
more sophisticated term decomposition will be needed.


was (Author: tomoko uchida):
Yes, it's not about "query execution (retrieving inverted index)" but "what 
postings you are traversing"; there's no way to optimize when you have so many 
postings that match the wildcard term. Just for a practical tip, I usually 
enforce two or three leading characters for wildcard queries when it's needed 
to reduce the search space.

bq. Consider using the reverse wildcard filter in Solr (there's documentation 
about this). But this won't help if you need a wildcard on both sides of the 
term

I think you could run conjunction queries of two fields (one for the normal 
text field, one for the reversed field) to support the infix wildcard query - 
not sure it is worth adding another text field when the index is already large. 
Anyway, dictionary-based decomposition looks promising to me in the case of 
German (though I have little knowledge of it except for the very basic one I've 
learned at the university lectures).


> Large system: Wildcard search leads to full index scan despite filter query
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-10562
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10562
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 8.11.1
>            Reporter: Henrik Hertel
>            Priority: Major
>              Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

Reply via email to