Re: edge ngram/find as you type sorting

matthew sporleder Wed, 25 Mar 2020 08:08:36 -0700

My original goal was to avoid indexing the string length because I
wanted edge ngram to "score" based on how "exact" the match was:


q=abc
"abc" has a high score
"abcd" has a lower score
"abcde" has an even lower score

You say sorting by by the original field will do that but in practice
it is not happening so I am probably missing something.

I *am* getting a close version of what I said above with sorting on
the length, which I added to the index.

searching for my keyword-lowercase field:abc* + sorting by length is
also working so maybe I can skip the edge ngram field entirely and
just do that but I was hoping the trade some disk space for
performance.  This field will get queried a lot.


On Wed, Mar 25, 2020 at 10:39 AM Erick Erickson <erickerick...@gmail.com> wrote:
>
> Why do you want to deal with score at all? Sorting
> overrides score-based sorting. Well, unless you
> specify score as a secondary sort. But since you’re
> sorting by length anyway, trying to score
> based on proximity to the end does nothing.
>
> The weirdness you’re going to get here, though, is
> that the order of the results will not be alphabetical.
> Say you have two docs, one with abcd and one with
> abce. Now say you search on abc. Whether abcd or
> abce comes first is indeterminant.
>
> If you simply stored the keyword-lowercased value
> in a copyfield and sorted on _that_, you wouldn’t have
> this problem. But if you’re really worried about space,
> that might not be an option.
>
> Best,
> Erick
>
> > On Mar 25, 2020, at 9:49 AM, matthew sporleder <msporle...@gmail.com> wrote:
> >
> > Where I landed:
> >
> >  <fieldType name="string_ci" class="solr.TextField"
> > sortMissingLast="true" omitNorms="false">
> >     <analyzer>
> >          <tokenizer class="solr.KeywordTokenizerFactory"/>
> >          <filter class="solr.LowerCaseFilterFactory" />
> >     </analyzer>
> >  </fieldType>
> >
> > <fieldType name="edgytext" class="solr.TextField" 
> > positionIncrementGap="100">
> > <analyzer type="index">
> >   <filter class="solr.LowerCaseFilterFactory" />
> >   <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> > maxGramSize="25" />
> >   <tokenizer class="solr.KeywordTokenizerFactory"/>
> > </analyzer>
> > <analyzer type="query">
> >   <tokenizer class="solr.KeywordTokenizerFactory"/>
> >   <filter class="solr.LowerCaseFilterFactory"/>
> > </analyzer>
> > </fieldType>
> >
> >
> >  <field name="slug" type="string_ci" indexed="true" stored="true"
> > multiValued="false" />
> >  <field name="fayt" type="edgytext" indexed="true" stored="false"
> > omitNorms="false" omitTermFreqAndPositions="false" multiValued="true"
> > />
> >  <field name="qt_len" type="int" indexed="true" stored="true"
> > multiValued="false" />
> >
> > ---
> >
> > I can then do a search for
> >
> > q=fayt:my_article_slu&sort=qt_len asc
> >
> > to get the shortest/most exact find-as-you-type match.  I couldn't get
> > around all results having the same score (can I boost proximity to the
> > end of a string?) in the edge ngram search but I am hoping this is the
> > fastest way to do this type of search since I can avoid wildcards
> > "my_article_slu*" and stuff.
> >
> > More suggestions welcome and thanks for the help.  I will re-index
> > with omitNorms=true again to see if I can save a little space.
> >
> >
> >
> >
> >
> > On Tue, Mar 24, 2020 at 11:39 AM matthew sporleder <msporle...@gmail.com> 
> > wrote:
> >>
> >> Okay I appreciate you responding.
> >>
> >> Switching "slug" from "string_ci" class="solr.StrField" accomplished
> >> about the same results, which makes sense to me now :)
> >>
> >> The previous definition of string_ci was:
> >>  <fieldType name="string_ci" class="solr.TextField"
> >> sortMissingLast="true" omitNorms="true">
> >>     <analyzer>
> >>          <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>          <filter class="solr.LowerCaseFilterFactory" />
> >>     </analyzer>
> >>  </fieldType>
> >>
> >> So lowercase + KeywordTokenizerFactory;
> >>
> >> I am trying again with omitNorms=false  to see if I can get the more
> >> "exact" matches to score better this time around.
> >>
> >>
> >> On Tue, Mar 24, 2020 at 9:54 AM Erick Erickson <erickerick...@gmail.com> 
> >> wrote:
> >>>
> >>> Won’t work. String types are totally unanalyzed. Your string_ci fieldType 
> >>> is what I was looking for.
> >>>
> >>> No, you shouldn’t kill the lowercasefilter unless you want all of your 
> >>> searches will then be case-sensitive.
> >>>
> >>> So you should try:
> >>>
> >>> q=edgy_text:whatever&sort=string_ci asc
> >>>
> >>> Please use the admin>>pick_core>>analysis page when thinking about 
> >>> changing your schema, it’ll answer a _lot_ of these questions immediately.
> >>>
> >>> Best,
> >>> Erick
> >>>
> >>>> On Mar 24, 2020, at 8:37 AM, matthew sporleder <msporle...@gmail.com> 
> >>>> wrote:
> >>>>
> >>>> Oh maybe a schema bug!
> >>>>
> >>>> my string_ci:
> >>>> <fieldType name="string_ci" class="solr.TextField"
> >>>> sortMissingLast="true" omitNorms="true">
> >>>>    <analyzer>
> >>>>         <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>>>         <filter class="solr.LowerCaseFilterFactory" />
> >>>>    </analyzer>
> >>>> </fieldType>
> >>>>
> >>>> going to try this instead:
> >>>> <fieldType name="string_lctoken" class="solr.StrField"
> >>>> sortMissingLast="true" omitNorms="true">
> >>>>    <analyzer>
> >>>>         <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>>>         <filter class="solr.LowerCaseFilterFactory" />
> >>>>    </analyzer>
> >>>> </fieldType>
> >>>>
> >>>> Then I can probably kill the lowercasefilter on edgeytext:
> >>>>
> >>>>
> >>>>
> >>>> On Tue, Mar 24, 2020 at 7:44 AM Erick Erickson <erickerick...@gmail.com> 
> >>>> wrote:
> >>>>>
> >>>>> Sort by the full field. You’ll need to copy to a field with 
> >>>>> keywordTokenizer and lowercaseFilter (string_ci? assuming it’s not 
> >>>>> really a :”string”) type.
> >>>>>
> >>>>> Best,
> >>>>> Erick
> >>>>>
> >>>>>> On Mar 24, 2020, at 7:10 AM, matthew sporleder <msporle...@gmail.com> 
> >>>>>> wrote:
> >>>>>>
> >>>>>> I have added an edge ngram field to my index and get decent results
> >>>>>> with partial words but the results appear randomly sorted and all
> >>>>>> contain the same score.  Ideally I would like to sort by shortest
> >>>>>> ngram match within my other qualifiers.
> >>>>>>
> >>>>>> Is there a canonical solution to this?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Matt
> >>>>>>
> >>>>>> p.s. I mostly followed
> >>>>>> https://lucidworks.com/post/auto-suggest-from-popular-queries-using-edgengrams/
> >>>>>>
> >>>>>> schema bits:
> >>>>>>
> >>>>>> <fieldType name="edgytext" class="solr.TextField" 
> >>>>>> positionIncrementGap="100">
> >>>>>> <analyzer type="index">
> >>>>>> <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>>>>> <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> >>>>>> maxGramSize="25" />
> >>>>>> </analyzer>
> >>>>>>
> >>>>>> <field name="slug" type="string_ci" indexed="true" stored="true"
> >>>>>> multiValued="false" />
> >>>>>>
> >>>>>> <field name="fayt" type="edgytext" indexed="true" stored="false"
> >>>>>> omitNorms="false" omitTermFreqAndPositions="true" multiValued="true"
> >>>>>> />
> >>>>>>
> >>>>>>
> >>>>>> <copyField source="slug" dest="fayt" maxChars="65" />
> >>>>>
> >>>
>

Re: edge ngram/find as you type sorting

Reply via email to