Re: edge ngram/find as you type sorting

matthew sporleder Wed, 25 Mar 2020 10:31:22 -0700

Okay.  I am getting pretty much a random order of documents containing
the prefix.


Does my "string_ci" defined below count as
"keywordtokenizer+lowecasefilter"?  (assumption)
Does my "fayt" copy field below look right? (assumption)

I have a bunch of web pages indexed with "slug" fields with the prefix
"what_is_lov"
so I search:
select?q=fayt:what_is_lov&fl=slug&rows=1000&sort=slug%20asc&wt=csv

and get:
slug
What_is_Lov_Holtz_known_for
What_is_lova_after_it_harddens
What_is_Lova_Moor's_birthday
What_is_lovable_in_Spanish
What_is_lovage
What_is_Lovagny's_population
What_is_lovan_for
What_is_lovanox
What_is_lovarstan_for
What_is_Lovasatin



On Wed, Mar 25, 2020 at 1:15 PM Erick Erickson <erickerick...@gmail.com> wrote:
>
> What _is_ happening? Please provide examples of the inputs
> and outputs that don’t work for you. ‘cause
> the sort order should be “nothing comes before something"
> so sorting ascending on a keywordtokenizer+lowecasefilter
> should give you exactly what you’re asking for with no
> need for a length field.
>
> Best,
> Erick
>
> > On Mar 25, 2020, at 11:07 AM, matthew sporleder <msporle...@gmail.com> 
> > wrote:
> >
> > My original goal was to avoid indexing the string length because I
> > wanted edge ngram to "score" based on how "exact" the match was:
> >
> > q=abc
> > "abc" has a high score
> > "abcd" has a lower score
> > "abcde" has an even lower score
> >
> > You say sorting by by the original field will do that but in practice
> > it is not happening so I am probably missing something.
> >
> > I *am* getting a close version of what I said above with sorting on
> > the length, which I added to the index.
> >
> > searching for my keyword-lowercase field:abc* + sorting by length is
> > also working so maybe I can skip the edge ngram field entirely and
> > just do that but I was hoping the trade some disk space for
> > performance.  This field will get queried a lot.
> >
> >
> > On Wed, Mar 25, 2020 at 10:39 AM Erick Erickson <erickerick...@gmail.com> 
> > wrote:
> >>
> >> Why do you want to deal with score at all? Sorting
> >> overrides score-based sorting. Well, unless you
> >> specify score as a secondary sort. But since you’re
> >> sorting by length anyway, trying to score
> >> based on proximity to the end does nothing.
> >>
> >> The weirdness you’re going to get here, though, is
> >> that the order of the results will not be alphabetical.
> >> Say you have two docs, one with abcd and one with
> >> abce. Now say you search on abc. Whether abcd or
> >> abce comes first is indeterminant.
> >>
> >> If you simply stored the keyword-lowercased value
> >> in a copyfield and sorted on _that_, you wouldn’t have
> >> this problem. But if you’re really worried about space,
> >> that might not be an option.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Mar 25, 2020, at 9:49 AM, matthew sporleder <msporle...@gmail.com> 
> >>> wrote:
> >>>
> >>> Where I landed:
> >>>
> >>> <fieldType name="string_ci" class="solr.TextField"
> >>> sortMissingLast="true" omitNorms="false">
> >>>    <analyzer>
> >>>         <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>>         <filter class="solr.LowerCaseFilterFactory" />
> >>>    </analyzer>
> >>> </fieldType>
> >>>
> >>> <fieldType name="edgytext" class="solr.TextField" 
> >>> positionIncrementGap="100">
> >>> <analyzer type="index">
> >>>  <filter class="solr.LowerCaseFilterFactory" />
> >>>  <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> >>> maxGramSize="25" />
> >>>  <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>> </analyzer>
> >>> <analyzer type="query">
> >>>  <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>>  <filter class="solr.LowerCaseFilterFactory"/>
> >>> </analyzer>
> >>> </fieldType>
> >>>
> >>>
> >>> <field name="slug" type="string_ci" indexed="true" stored="true"
> >>> multiValued="false" />
> >>> <field name="fayt" type="edgytext" indexed="true" stored="false"
> >>> omitNorms="false" omitTermFreqAndPositions="false" multiValued="true"
> >>> />
> >>> <field name="qt_len" type="int" indexed="true" stored="true"
> >>> multiValued="false" />
> >>>
> >>> ---
> >>>
> >>> I can then do a search for
> >>>
> >>> q=fayt:my_article_slu&sort=qt_len asc
> >>>
> >>> to get the shortest/most exact find-as-you-type match.  I couldn't get
> >>> around all results having the same score (can I boost proximity to the
> >>> end of a string?) in the edge ngram search but I am hoping this is the
> >>> fastest way to do this type of search since I can avoid wildcards
> >>> "my_article_slu*" and stuff.
> >>>
> >>> More suggestions welcome and thanks for the help.  I will re-index
> >>> with omitNorms=true again to see if I can save a little space.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Tue, Mar 24, 2020 at 11:39 AM matthew sporleder <msporle...@gmail.com> 
> >>> wrote:
> >>>>
> >>>> Okay I appreciate you responding.
> >>>>
> >>>> Switching "slug" from "string_ci" class="solr.StrField" accomplished
> >>>> about the same results, which makes sense to me now :)
> >>>>
> >>>> The previous definition of string_ci was:
> >>>> <fieldType name="string_ci" class="solr.TextField"
> >>>> sortMissingLast="true" omitNorms="true">
> >>>>    <analyzer>
> >>>>         <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>>>         <filter class="solr.LowerCaseFilterFactory" />
> >>>>    </analyzer>
> >>>> </fieldType>
> >>>>
> >>>> So lowercase + KeywordTokenizerFactory;
> >>>>
> >>>> I am trying again with omitNorms=false  to see if I can get the more
> >>>> "exact" matches to score better this time around.
> >>>>
> >>>>
> >>>> On Tue, Mar 24, 2020 at 9:54 AM Erick Erickson <erickerick...@gmail.com> 
> >>>> wrote:
> >>>>>
> >>>>> Won’t work. String types are totally unanalyzed. Your string_ci 
> >>>>> fieldType is what I was looking for.
> >>>>>
> >>>>> No, you shouldn’t kill the lowercasefilter unless you want all of your 
> >>>>> searches will then be case-sensitive.
> >>>>>
> >>>>> So you should try:
> >>>>>
> >>>>> q=edgy_text:whatever&sort=string_ci asc
> >>>>>
> >>>>> Please use the admin>>pick_core>>analysis page when thinking about 
> >>>>> changing your schema, it’ll answer a _lot_ of these questions 
> >>>>> immediately.
> >>>>>
> >>>>> Best,
> >>>>> Erick
> >>>>>
> >>>>>> On Mar 24, 2020, at 8:37 AM, matthew sporleder <msporle...@gmail.com> 
> >>>>>> wrote:
> >>>>>>
> >>>>>> Oh maybe a schema bug!
> >>>>>>
> >>>>>> my string_ci:
> >>>>>> <fieldType name="string_ci" class="solr.TextField"
> >>>>>> sortMissingLast="true" omitNorms="true">
> >>>>>>   <analyzer>
> >>>>>>        <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>>>>>        <filter class="solr.LowerCaseFilterFactory" />
> >>>>>>   </analyzer>
> >>>>>> </fieldType>
> >>>>>>
> >>>>>> going to try this instead:
> >>>>>> <fieldType name="string_lctoken" class="solr.StrField"
> >>>>>> sortMissingLast="true" omitNorms="true">
> >>>>>>   <analyzer>
> >>>>>>        <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>>>>>        <filter class="solr.LowerCaseFilterFactory" />
> >>>>>>   </analyzer>
> >>>>>> </fieldType>
> >>>>>>
> >>>>>> Then I can probably kill the lowercasefilter on edgeytext:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Mar 24, 2020 at 7:44 AM Erick Erickson 
> >>>>>> <erickerick...@gmail.com> wrote:
> >>>>>>>
> >>>>>>> Sort by the full field. You’ll need to copy to a field with 
> >>>>>>> keywordTokenizer and lowercaseFilter (string_ci? assuming it’s not 
> >>>>>>> really a :”string”) type.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Erick
> >>>>>>>
> >>>>>>>> On Mar 24, 2020, at 7:10 AM, matthew sporleder 
> >>>>>>>> <msporle...@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> I have added an edge ngram field to my index and get decent results
> >>>>>>>> with partial words but the results appear randomly sorted and all
> >>>>>>>> contain the same score.  Ideally I would like to sort by shortest
> >>>>>>>> ngram match within my other qualifiers.
> >>>>>>>>
> >>>>>>>> Is there a canonical solution to this?
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Matt
> >>>>>>>>
> >>>>>>>> p.s. I mostly followed
> >>>>>>>> https://lucidworks.com/post/auto-suggest-from-popular-queries-using-edgengrams/
> >>>>>>>>
> >>>>>>>> schema bits:
> >>>>>>>>
> >>>>>>>> <fieldType name="edgytext" class="solr.TextField" 
> >>>>>>>> positionIncrementGap="100">
> >>>>>>>> <analyzer type="index">
> >>>>>>>> <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>>>>>>> <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>>>> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> >>>>>>>> maxGramSize="25" />
> >>>>>>>> </analyzer>
> >>>>>>>>
> >>>>>>>> <field name="slug" type="string_ci" indexed="true" stored="true"
> >>>>>>>> multiValued="false" />
> >>>>>>>>
> >>>>>>>> <field name="fayt" type="edgytext" indexed="true" stored="false"
> >>>>>>>> omitNorms="false" omitTermFreqAndPositions="true" multiValued="true"
> >>>>>>>> />
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> <copyField source="slug" dest="fayt" maxChars="65" />
> >>>>>>>
> >>>>>
> >>
>

Re: edge ngram/find as you type sorting

Reply via email to