Re: edge ngram/find as you type sorting

matthew sporleder Thu, 26 Mar 2020 05:53:12 -0700

That explains the OOM's I've been getting in the initial test cycle.
I'm working with about 50M (small) documents.


On Thu, Mar 26, 2020 at 7:58 AM Erick Erickson <erickerick...@gmail.com> wrote:
>
> the ngramming is a time/space tradeoff. Typically,
> if you restrict the wildcards to have three or more
> “real” characters performance is fine. One real
> character (i.e. a*) will be your worst-case. I’ve
> seen requiring two characters in the prefix work well
> too. It Depends (tm).
>
> Conceptually what happens here is that Lucene has
> to enumerate all of the terms that start with the prefix
> and create a ginormous OR clause. The term
> enumeration will take longer the more terms there are.
> Things are more efficient than that, but still...
>
> So make sure you’re testing with a real corpus. Having
> a test index with just a few terms will be misleading.
>
> Best,
> Erick
>
> > On Mar 25, 2020, at 9:37 PM, matthew sporleder <msporle...@gmail.com> wrote:
> >
> > Okay confirmed-
> > I am getting a more predictable results set after adding an additional 
> > field:
> >  <fieldType name="string_alpha" class="solr.TextField"
> > sortMissingLast="true" omitNorms="true">
> >     <analyzer>
> >          <tokenizer class="solr.KeywordTokenizerFactory"/>
> >          <filter class="solr.LowerCaseFilterFactory" />
> >          <filter class="solr.PatternReplaceFilterFactory"
> > pattern="\p{Punct}" replacement=""/>
> >     </analyzer>
> >  </fieldType>
> >
> > q=slug:what_is_lo*&fl=slug&rows=1000&wt=csv&sort=slug_alpha%20asc
> >
> > So it appears I can skip edge ngram entirely using this method as
> > slug:foo* appears to be the exact same results as fayt:foo, but I have
> > the cost of the alphaOnly field :)
> >
> > I will try to figure out some benchmarks or something to decide how to go.
> >
> > Thanks again for the help so far.
> >
> >
> > On Wed, Mar 25, 2020 at 2:39 PM Erick Erickson <erickerick...@gmail.com> 
> > wrote:
> >>
> >> You’re getting the correct sorted order… The underscore character is 
> >> confusing you.
> >>
> >> It’s ascii code for underscore is %2d which sorts before any letter, 
> >> uppercase or lowercase.
> >>
> >> See the alphaOnlySort type for a way to remove this, although the output 
> >> there can also
> >> be confusing.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Mar 25, 2020, at 1:30 PM, matthew sporleder <msporle...@gmail.com> 
> >>> wrote:
> >>>
> >>> What_is_Lov_Holtz_known_for
> >>> What_is_lova_after_it_harddens
> >>> What_is_Lova_Moor's_birthday
> >>> What_is_lovable_in_Spanish
> >>> What_is_lovage
> >>> What_is_Lovagny's_population
> >>> What_is_lovan_for
> >>> What_is_lovanox
> >>> What_is_lovarstan_for
> >>> What_is_Lovasatin
> >>
>

Re: edge ngram/find as you type sorting

Reply via email to