[
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533856#comment-17533856
]
Uwe Schindler commented on LUCENE-10562:
----------------------------------------
Hi,
I think those question do not relate to Lucene and are no issues at all.
I think those quetsions should be asked on the Solr mailing list:
[email protected].
This is not a bug and there is no way to improve this situation inside Lucene.
Some additional hints:
- Consider using the reverse wildcard filter in Solr (there's documentation
about this). But this won't help if you need a wildcard on both sides of the
star
- Consider to disable wildcards for end-users in your case (the flexible or
dismax query parser in Solr can do this)
In general, using wildcards in a full text search engine is showing that text
analysis works wrong. Based on your name and profile, it looks like this is a
typical "German language problem". In Germany, compounds are usual
("Donaudampschiffahrtskapitän", the captain of a steam powered ship on the
German river Donau) and then people using wildcards is always a sign for
missing decompounding. This can be done with hyphenation-compound token filter
in combination with dictionaries. An example and minimalized data files for
German language is here: https://github.com/uschindler/german-decompounder
When you do decompounding, wildcards should not be needed.
> Large system: Wildcard search leads to full index scan despite filter query
> ---------------------------------------------------------------------------
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/search
> Affects Versions: 8.11.1
> Reporter: Henrik Hertel
> Priority: Major
> Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million
> documents. The textual content of large PDF files is indexed there. My query
> is extremely slow (more than 30 seconds) as soon as I use wildcards e.g.
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
> {code}
> I've tried everything imaginable. It doesn't make sense to me why a search
> over a small subset should take so long. If I omit the filter query
> metadataitemids_is:20950, so search the entire inventory, then it also takes
> the same amount of time. Therefore, I suspect that despite the filter query,
> the main query runs over the entire index.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]