Re: Solr Wildcard Search for large amount of text

Erick Erickson Sat, 27 Jun 2015 08:41:59 -0700

Try it and see ;).

My experience is that wildcards work fine, although
what "fine" is up to you to decide _if_ you restrict
it to requiring at least two leading "real" characters,
and I actually prefer three. I.e.
ab* or abc*. Note that if you require leading
wildcards, use the reverse wildcard filter.


I will vociferously argue that single-letter wildcards are
not useful anyway. I mean every single document in your
corpus will probably match every single-letter wildcard
(a*, b*, whatever), providing no benefit to the user.

And, the need for wildcards can often be reduced or
eliminated if you use can autosuggest or autocomplete.
Of course if you're trying to satisfy more complex use
cases where the user is composing their own complex
clauses that may not apply.

FWIW,
Erick

On Sat, Jun 27, 2015 at 10:06 AM, Shawn Heisey <apa...@elyograg.org> wrote:
> On 6/27/2015 4:27 AM, octopus wrote:
>> Hi, I'm looking at Solr's features for wildcard search used for a large
>> amount of text. I read on the net that solr.EdgeNGramFilterFactory is used
>> to generate tokens for wildcard searching.
>>
>> For Nigerian => "ni", "nig", "nige", "niger", "nigeri", "nigeria",
>> "nigeria", "nigerian"
>>
>> However, I have a large amount of text out there which requires wildcard
>> search and it's not viable to use EdgeNGrameFilterFactory as the amount of
>> processing will be too huge. Do you have any suggestions/advice please?
>
> Both edgengrams and wildcards are ways to do this.  There are advantages
> and disadvantages to both ways.
>
> To do a wildcard search, Solr (Lucene really) must look up all the
> matching terms in the index and substitute them into the query so that
> it becomes a large number of simple string matches.  If you have a large
> number of terms in your index, that can be slow.  The expensive work
> (expanding the terms) is done for every single query.
>
> The edgengram filter does similar work, but it does it at *index* time,
> rather than query time.  At query time, you are doing a simple string
> match with one term, although the index contains many more terms,
> because the very expensive work was done at index time.
>
> It's difficult to know which approach will be more efficient on *your*
> index without experimentation, but there is a general rule when it comes
> to Solr performance: As much as possible, do the expensive work at index
> time.
>
> Thanks,
> Shawn
>

Re: Solr Wildcard Search for large amount of text

Reply via email to