That explains the OOM's I've been getting in the initial test cycle. I'm working with about 50M (small) documents.
On Thu, Mar 26, 2020 at 7:58 AM Erick Erickson <erickerick...@gmail.com> wrote: > > the ngramming is a time/space tradeoff. Typically, > if you restrict the wildcards to have three or more > “real” characters performance is fine. One real > character (i.e. a*) will be your worst-case. I’ve > seen requiring two characters in the prefix work well > too. It Depends (tm). > > Conceptually what happens here is that Lucene has > to enumerate all of the terms that start with the prefix > and create a ginormous OR clause. The term > enumeration will take longer the more terms there are. > Things are more efficient than that, but still... > > So make sure you’re testing with a real corpus. Having > a test index with just a few terms will be misleading. > > Best, > Erick > > > On Mar 25, 2020, at 9:37 PM, matthew sporleder <msporle...@gmail.com> wrote: > > > > Okay confirmed- > > I am getting a more predictable results set after adding an additional > > field: > > <fieldType name="string_alpha" class="solr.TextField" > > sortMissingLast="true" omitNorms="true"> > > <analyzer> > > <tokenizer class="solr.KeywordTokenizerFactory"/> > > <filter class="solr.LowerCaseFilterFactory" /> > > <filter class="solr.PatternReplaceFilterFactory" > > pattern="\p{Punct}" replacement=""/> > > </analyzer> > > </fieldType> > > > > q=slug:what_is_lo*&fl=slug&rows=1000&wt=csv&sort=slug_alpha%20asc > > > > So it appears I can skip edge ngram entirely using this method as > > slug:foo* appears to be the exact same results as fayt:foo, but I have > > the cost of the alphaOnly field :) > > > > I will try to figure out some benchmarks or something to decide how to go. > > > > Thanks again for the help so far. > > > > > > On Wed, Mar 25, 2020 at 2:39 PM Erick Erickson <erickerick...@gmail.com> > > wrote: > >> > >> You’re getting the correct sorted order… The underscore character is > >> confusing you. > >> > >> It’s ascii code for underscore is %2d which sorts before any letter, > >> uppercase or lowercase. > >> > >> See the alphaOnlySort type for a way to remove this, although the output > >> there can also > >> be confusing. > >> > >> Best, > >> Erick > >> > >>> On Mar 25, 2020, at 1:30 PM, matthew sporleder <msporle...@gmail.com> > >>> wrote: > >>> > >>> What_is_Lov_Holtz_known_for > >>> What_is_lova_after_it_harddens > >>> What_is_Lova_Moor's_birthday > >>> What_is_lovable_in_Spanish > >>> What_is_lovage > >>> What_is_Lovagny's_population > >>> What_is_lovan_for > >>> What_is_lovanox > >>> What_is_lovarstan_for > >>> What_is_Lovasatin > >> >