I was afraid of “totally arbitrary”

OK, this field type is going to surprise the heck out of you. Whitespace
tokenizer is really stupid. It’ll include punctuation for instance. Take
a look at the admin UI/analysis page and pick your field and put some
creative entries in and you’ll see what I mean.

So let’s get some use-cases in place. Can users enter tags like
blahms-reply-unpaidnonsense and expect to find it with *ms-reply-unpaid*?
Or is the entry something like
my dog has ms-reply-unpaid and is mangy
? If the latter, simple token searching will work fine, there’s no need for
wildcards at all.

FWIW,
Erick

> On Jun 29, 2020, at 11:46 AM, Chris Dempsey <cdal...@gmail.com> wrote:
> 
> First off, thanks for taking a look, Erick! I see you helping lots of folks
> out here and I've learned a lot from your answers. Much appreciated!
> 
>> How regular are your patterns? Are they arbitrary?
> 
> Good question. :) That's data that I should have included in the initial
> post but both the values in the `tag` field and the search query itself are
> totally arbitrary (*i.e. user entered values*). I see where you're going if
> the set of either part was limited.
> 
>> What’s the field type anyway? Is this field tokenized?
> 
> <field name="tag" type="text_kwt_fd_lc" indexed="true" stored="true"
> multiValued="true"/>
> 
> <fieldType name="text_kwt_fd_lc" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>    <analyzer type="index">
>        <charFilter class="solr.HTMLStripCharFilterFactory"/>
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.ASCIIFoldingFilterFactory"
> preserveOriginal="true" />
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>        <filter class="solr.ReversedWildcardFilterFactory"
> withOriginal="true" maxPosAsterisk="2" maxPosQuestion="1" minTrailing="2"
> maxFractionAsterisk="0"/>
>    </analyzer>
>    <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>    </analyzer>
> </fieldType>
> 
> On Mon, Jun 29, 2020 at 10:33 AM Erick Erickson <erickerick...@gmail.com>
> wrote:
> 
>> How regular are your patterns? Are they arbitrary?
>> What I’m wondering is if you could shift your work the the
>> indexing end, perhaps even in an auxiliary field. Could you,
>> say, just index “paid”, “ms-reply-unpaid” etc? Then there
>> are no wildcards at all. This akin to “concept search”.
>> 
>> Otherwise ngramming is your best bet.
>> 
>> What’s the field type anyway? Is this field tokenized?
>> 
>> There are lots of options, but soooo much depends on whether
>> you can process the data such that you won’t need wildcards.
>> 
>> Best,
>> Erick
>> 
>>> On Jun 29, 2020, at 11:16 AM, Chris Dempsey <cdal...@gmail.com> wrote:
>>> 
>>> Hello, all! I'm relatively new to Solr and Lucene (*using Solr 7.7.1*)
>> but
>>> I'm looking into options for optimizing something like this:
>>> 
>>>> fq=(tag:* -tag:*paid*) OR (tag:* -tag:*ms-reply-unpaid*) OR
>>> tag:*ms-reply-paid*
>>> 
>>> It's probably not a surprise that we're seeing performance issues with
>>> something like this. My understanding is that using the wildcard on both
>>> ends forces a full-text index search. Something like the above can't take
>>> advantage of something like the ReverseWordFilter either. I believe
>>> constructing `n-grams` is an option (*at the expense of index size*) but
>> is
>>> there anything I'm overlooking as a possible avenue to look into?
>> 
>> 

Reply via email to