I was afraid of “totally arbitrary” OK, this field type is going to surprise the heck out of you. Whitespace tokenizer is really stupid. It’ll include punctuation for instance. Take a look at the admin UI/analysis page and pick your field and put some creative entries in and you’ll see what I mean.
So let’s get some use-cases in place. Can users enter tags like blahms-reply-unpaidnonsense and expect to find it with *ms-reply-unpaid*? Or is the entry something like my dog has ms-reply-unpaid and is mangy ? If the latter, simple token searching will work fine, there’s no need for wildcards at all. FWIW, Erick > On Jun 29, 2020, at 11:46 AM, Chris Dempsey <cdal...@gmail.com> wrote: > > First off, thanks for taking a look, Erick! I see you helping lots of folks > out here and I've learned a lot from your answers. Much appreciated! > >> How regular are your patterns? Are they arbitrary? > > Good question. :) That's data that I should have included in the initial > post but both the values in the `tag` field and the search query itself are > totally arbitrary (*i.e. user entered values*). I see where you're going if > the set of either part was limited. > >> What’s the field type anyway? Is this field tokenized? > > <field name="tag" type="text_kwt_fd_lc" indexed="true" stored="true" > multiValued="true"/> > > <fieldType name="text_kwt_fd_lc" class="solr.TextField" > positionIncrementGap="100" autoGeneratePhraseQueries="true"> > <analyzer type="index"> > <charFilter class="solr.HTMLStripCharFilterFactory"/> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.ASCIIFoldingFilterFactory" > preserveOriginal="true" /> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > <filter class="solr.ReversedWildcardFilterFactory" > withOriginal="true" maxPosAsterisk="2" maxPosQuestion="1" minTrailing="2" > maxFractionAsterisk="0"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > </fieldType> > > On Mon, Jun 29, 2020 at 10:33 AM Erick Erickson <erickerick...@gmail.com> > wrote: > >> How regular are your patterns? Are they arbitrary? >> What I’m wondering is if you could shift your work the the >> indexing end, perhaps even in an auxiliary field. Could you, >> say, just index “paid”, “ms-reply-unpaid” etc? Then there >> are no wildcards at all. This akin to “concept search”. >> >> Otherwise ngramming is your best bet. >> >> What’s the field type anyway? Is this field tokenized? >> >> There are lots of options, but soooo much depends on whether >> you can process the data such that you won’t need wildcards. >> >> Best, >> Erick >> >>> On Jun 29, 2020, at 11:16 AM, Chris Dempsey <cdal...@gmail.com> wrote: >>> >>> Hello, all! I'm relatively new to Solr and Lucene (*using Solr 7.7.1*) >> but >>> I'm looking into options for optimizing something like this: >>> >>>> fq=(tag:* -tag:*paid*) OR (tag:* -tag:*ms-reply-unpaid*) OR >>> tag:*ms-reply-paid* >>> >>> It's probably not a surprise that we're seeing performance issues with >>> something like this. My understanding is that using the wildcard on both >>> ends forces a full-text index search. Something like the above can't take >>> advantage of something like the ReverseWordFilter either. I believe >>> constructing `n-grams` is an option (*at the expense of index size*) but >> is >>> there anything I'm overlooking as a possible avenue to look into? >> >>