Re: edge ngram/find as you type sorting

Erick Erickson Thu, 26 Mar 2020 04:59:11 -0700

the ngramming is a time/space tradeoff. Typically,
if you restrict the wildcards to have three or more
“real” characters performance is fine. One real
character (i.e. a*) will be your worst-case. I’ve
seen requiring two characters in the prefix work well
too. It Depends (tm).


Conceptually what happens here is that Lucene has
to enumerate all of the terms that start with the prefix
and create a ginormous OR clause. The term
enumeration will take longer the more terms there are.
Things are more efficient than that, but still...

So make sure you’re testing with a real corpus. Having
a test index with just a few terms will be misleading.

Best,
Erick

> On Mar 25, 2020, at 9:37 PM, matthew sporleder <msporle...@gmail.com> wrote:
> 
> Okay confirmed-
> I am getting a more predictable results set after adding an additional field:
>  <fieldType name="string_alpha" class="solr.TextField"
> sortMissingLast="true" omitNorms="true">
>     <analyzer>
>          <tokenizer class="solr.KeywordTokenizerFactory"/>
>          <filter class="solr.LowerCaseFilterFactory" />
>          <filter class="solr.PatternReplaceFilterFactory"
> pattern="\p{Punct}" replacement=""/>
>     </analyzer>
>  </fieldType>
> 
> q=slug:what_is_lo*&fl=slug&rows=1000&wt=csv&sort=slug_alpha%20asc
> 
> So it appears I can skip edge ngram entirely using this method as
> slug:foo* appears to be the exact same results as fayt:foo, but I have
> the cost of the alphaOnly field :)
> 
> I will try to figure out some benchmarks or something to decide how to go.
> 
> Thanks again for the help so far.
> 
> 
> On Wed, Mar 25, 2020 at 2:39 PM Erick Erickson <erickerick...@gmail.com> 
> wrote:
>> 
>> You’re getting the correct sorted order… The underscore character is 
>> confusing you.
>> 
>> It’s ascii code for underscore is %2d which sorts before any letter, 
>> uppercase or lowercase.
>> 
>> See the alphaOnlySort type for a way to remove this, although the output 
>> there can also
>> be confusing.
>> 
>> Best,
>> Erick
>> 
>>> On Mar 25, 2020, at 1:30 PM, matthew sporleder <msporle...@gmail.com> wrote:
>>> 
>>> What_is_Lov_Holtz_known_for
>>> What_is_lova_after_it_harddens
>>> What_is_Lova_Moor's_birthday
>>> What_is_lovable_in_Spanish
>>> What_is_lovage
>>> What_is_Lovagny's_population
>>> What_is_lovan_for
>>> What_is_lovanox
>>> What_is_lovarstan_for
>>> What_is_Lovasatin
>>

Re: edge ngram/find as you type sorting

Reply via email to