Re: Prefix + Suffix Wildcards in Searches

Erick Erickson Tue, 30 Jun 2020 05:25:59 -0700

That’s not quite the question I was asking.

Let’s take "…that don’t contain the characters ‘paid’ “.

Start with the fact that no matter what the mechanics of
implementing pre-and-post wildcards, something like

*:* -tags:*paid*

would exclude a doc with a tag of "credit-ms-reply-unpaid" or
"ms-reply-unpaid-2019”. I really think this is an XY problem,
You’re assuming that the solution is pre-and-post wildcards
without a precise definition of the problem you’re trying to solve.

Do they want to exclude things with the characters ‘ia’ or ‘id’? Or
is their “unit of exclusion” the _entire_ word ‘paid’? Or can we
define it so? Because if we can, what I wrote yesterday about
using proper tokenization and phrase queries will work.

If you break up all your tags in your example into individual
tokens on non-alphanumerics, then your problem is much simpler,
excluding “*paid*” becomes

-tags:paid

excluding “*ms-reply*” becomes 

-tags:”ms reply”

trying to exclude “*ms-unpaid*”

would _not_ exclude the doc with the tag "credit-ms-reply-unpaid”
because “ms” and “unpaid” are not sequential.

_Including_ is the same argument.

BTW, this is where “positionIncrementGap” comes in. If they can
define multiple tags in each document, phrase searching with
a gap greater than 1 (100 is the usual default) _and_ each tag
is an entry in a multiValued field, you can prevent matching
across tags with phrase searches. Consider two tags “ms-tag1”
and “paid-2019”. You don’t want “*tag1-paid*” to exclude this
doc I’d imagine. The positionIncrementGap takes care of this in the
phrase case. Remember that in this solution, the dashes aren’t
included in each token.

prefix only or postfix only would be a little tricky, one idea would be
to copyField into an _untokenized_ field and search
there in those cases. But even here, you need to determine precisely
what you expect. What would “*d-2019” return? Would it return 
something ending in “ms-reply-paid-2019”?

Alternatively, you wouldn’t need a copyField if you introduced
special tokens before and after each tag, so indexing “invoice-paid”
would index tokens:
specialbegintoken invoice paid specialendtoken
and searching for 

*paid 

becomes tag:“paid specialendtoken"

Best,
Erick

> On Jun 30, 2020, at 7:29 AM, Chris Dempsey <cdal...@gmail.com> wrote:
> 
> @Mikhail
> 
> Thanks for the link! I'll read through that.
> 
> On Tue, Jun 30, 2020 at 6:28 AM Chris Dempsey <cdal...@gmail.com> wrote:
> 
>> @Erick,
>> 
>> You've got the idea. Basically the users can attach zero or more tags (*that
>> they create*) to a document. So as an example say they've created the
>> tags (this example is just a small subset of the total tags):
>> 
>>   - paid
>>   - invoice-paid
>>   - ms-reply-unpaid-2019
>>   - credit-ms-reply-unpaid
>>   - ms-reply-paid-2019
>>   - ms-reply-paid-2020
>> 
>> and attached them in various combinations to documents. They then want to
>> find all documents by tag that don't contain the characters "paid" anywhere
>> in the tag, don't contain tags with the characters "ms-reply-unpaid", but
>> do include documents tagged with the characters "ms-reply-paid".
>> 
>> The obvious suggestion would be to have the users just use the entire tag
>> (i.e. don't let them do a "contains") as a condition to eliminate the
>> wildcards - which would work -  but unfortunately we have customers with 
>> (*not
>> joking*) over 100K different tags (*why have a taxonomy like that is yet
>> a different issue*). I'm willing to accept that in our scenario n-grams
>> might be the Solr-based answer (the other being to change what "contains"
>> means within our application) but thought I'd check I hadn't overlooked any
>> other options. :)
>> 
>> On Mon, Jun 29, 2020 at 3:54 PM Mikhail Khludnev <m...@apache.org> wrote:
>> 
>>> Hello, Chris.
>>> I suppose index time analysis can yield these terms:
>>> "paid","ms-reply-unpaid","ms-reply-paid", and thus let you avoid these
>>> expensive wildcard queries. Here's why it's worth to avoid them
>>> 
>>> https://www.slideshare.net/lucidworks/search-like-sql-mikhail-khludnev-epam
>>> 
>>> On Mon, Jun 29, 2020 at 6:17 PM Chris Dempsey <cdal...@gmail.com> wrote:
>>> 
>>>> Hello, all! I'm relatively new to Solr and Lucene (*using Solr 7.7.1*)
>>> but
>>>> I'm looking into options for optimizing something like this:
>>>> 
>>>>> fq=(tag:* -tag:*paid*) OR (tag:* -tag:*ms-reply-unpaid*) OR
>>>> tag:*ms-reply-paid*
>>>> 
>>>> It's probably not a surprise that we're seeing performance issues with
>>>> something like this. My understanding is that using the wildcard on both
>>>> ends forces a full-text index search. Something like the above can't
>>> take
>>>> advantage of something like the ReverseWordFilter either. I believe
>>>> constructing `n-grams` is an option (*at the expense of index size*)
>>> but is
>>>> there anything I'm overlooking as a possible avenue to look into?
>>>> 
>>> 
>>> 
>>> --
>>> Sincerely yours
>>> Mikhail Khludnev
>>> 
>>

Re: Prefix + Suffix Wildcards in Searches

Reply via email to