[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

Tomoko Uchida (Jira) Mon, 09 May 2022 13:09:30 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533982#comment-17533982
 ]


Tomoko Uchida edited comment on LUCENE-10562 at 5/9/22 8:08 PM:
----------------------------------------------------------------

Yes, it's not about "query execution (retrieving inverted index)" but "what 
postings you are traversing"; there's no way to optimize when you have so many 
postings that match the wildcard term. Just for a practical tip, I usually 
enforce two or three leading characters for wildcard queries when it's needed 
to reduce the search space.
{quote}Consider using the reverse wildcard filter in Solr (there's 
documentation about this). But this won't help if you need a wildcard on both 
sides of the term
{quote}
-I think you could run conjunction queries of two fields (one for the normal 
text field, one for the reversed field) to support the infix wildcard query- - 
not sure it is worth adding another text field when the index is already large. 
Anyway, dictionary-based decomposition looks promising to me in the case of 
German (though I have little knowledge of it except for the very basic one I've 
learned at the university lectures).

Correction: conjunction query does not work in this situation - sorry. nGram or 
more sophisticated term decomposition will be needed. For example in my 
language (Japanese - which does not even has spaces between terms), the 
combination of off-the-shelf nGram filter and phrase search often works well.


was (Author: tomoko uchida):
Yes, it's not about "query execution (retrieving inverted index)" but "what 
postings you are traversing"; there's no way to optimize when you have so many 
postings that match the wildcard term. Just for a practical tip, I usually 
enforce two or three leading characters for wildcard queries when it's needed 
to reduce the search space.
{quote}Consider using the reverse wildcard filter in Solr (there's 
documentation about this). But this won't help if you need a wildcard on both 
sides of the term
{quote}
-I think you could run conjunction queries of two fields (one for the normal 
text field, one for the reversed field) to support the infix wildcard query- - 
not sure it is worth adding another text field when the index is already large. 
Anyway, dictionary-based decomposition looks promising to me in the case of 
German (though I have little knowledge of it except for the very basic one I've 
learned at the university lectures).

Correction: conjunction query does not work in this situation - sorry. nGram or 
more sophisticated term decomposition will be needed.

> Large system: Wildcard search leads to full index scan despite filter query
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-10562
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10562
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 8.11.1
>            Reporter: Henrik Hertel
>            Priority: Major
>              Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

Reply via email to