Okay. I am getting pretty much a random order of documents containing the prefix.
Does my "string_ci" defined below count as "keywordtokenizer+lowecasefilter"? (assumption) Does my "fayt" copy field below look right? (assumption) I have a bunch of web pages indexed with "slug" fields with the prefix "what_is_lov" so I search: select?q=fayt:what_is_lov&fl=slug&rows=1000&sort=slug%20asc&wt=csv and get: slug What_is_Lov_Holtz_known_for What_is_lova_after_it_harddens What_is_Lova_Moor's_birthday What_is_lovable_in_Spanish What_is_lovage What_is_Lovagny's_population What_is_lovan_for What_is_lovanox What_is_lovarstan_for What_is_Lovasatin On Wed, Mar 25, 2020 at 1:15 PM Erick Erickson <erickerick...@gmail.com> wrote: > > What _is_ happening? Please provide examples of the inputs > and outputs that don’t work for you. ‘cause > the sort order should be “nothing comes before something" > so sorting ascending on a keywordtokenizer+lowecasefilter > should give you exactly what you’re asking for with no > need for a length field. > > Best, > Erick > > > On Mar 25, 2020, at 11:07 AM, matthew sporleder <msporle...@gmail.com> > > wrote: > > > > My original goal was to avoid indexing the string length because I > > wanted edge ngram to "score" based on how "exact" the match was: > > > > q=abc > > "abc" has a high score > > "abcd" has a lower score > > "abcde" has an even lower score > > > > You say sorting by by the original field will do that but in practice > > it is not happening so I am probably missing something. > > > > I *am* getting a close version of what I said above with sorting on > > the length, which I added to the index. > > > > searching for my keyword-lowercase field:abc* + sorting by length is > > also working so maybe I can skip the edge ngram field entirely and > > just do that but I was hoping the trade some disk space for > > performance. This field will get queried a lot. > > > > > > On Wed, Mar 25, 2020 at 10:39 AM Erick Erickson <erickerick...@gmail.com> > > wrote: > >> > >> Why do you want to deal with score at all? Sorting > >> overrides score-based sorting. Well, unless you > >> specify score as a secondary sort. But since you’re > >> sorting by length anyway, trying to score > >> based on proximity to the end does nothing. > >> > >> The weirdness you’re going to get here, though, is > >> that the order of the results will not be alphabetical. > >> Say you have two docs, one with abcd and one with > >> abce. Now say you search on abc. Whether abcd or > >> abce comes first is indeterminant. > >> > >> If you simply stored the keyword-lowercased value > >> in a copyfield and sorted on _that_, you wouldn’t have > >> this problem. But if you’re really worried about space, > >> that might not be an option. > >> > >> Best, > >> Erick > >> > >>> On Mar 25, 2020, at 9:49 AM, matthew sporleder <msporle...@gmail.com> > >>> wrote: > >>> > >>> Where I landed: > >>> > >>> <fieldType name="string_ci" class="solr.TextField" > >>> sortMissingLast="true" omitNorms="false"> > >>> <analyzer> > >>> <tokenizer class="solr.KeywordTokenizerFactory"/> > >>> <filter class="solr.LowerCaseFilterFactory" /> > >>> </analyzer> > >>> </fieldType> > >>> > >>> <fieldType name="edgytext" class="solr.TextField" > >>> positionIncrementGap="100"> > >>> <analyzer type="index"> > >>> <filter class="solr.LowerCaseFilterFactory" /> > >>> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" > >>> maxGramSize="25" /> > >>> <tokenizer class="solr.KeywordTokenizerFactory"/> > >>> </analyzer> > >>> <analyzer type="query"> > >>> <tokenizer class="solr.KeywordTokenizerFactory"/> > >>> <filter class="solr.LowerCaseFilterFactory"/> > >>> </analyzer> > >>> </fieldType> > >>> > >>> > >>> <field name="slug" type="string_ci" indexed="true" stored="true" > >>> multiValued="false" /> > >>> <field name="fayt" type="edgytext" indexed="true" stored="false" > >>> omitNorms="false" omitTermFreqAndPositions="false" multiValued="true" > >>> /> > >>> <field name="qt_len" type="int" indexed="true" stored="true" > >>> multiValued="false" /> > >>> > >>> --- > >>> > >>> I can then do a search for > >>> > >>> q=fayt:my_article_slu&sort=qt_len asc > >>> > >>> to get the shortest/most exact find-as-you-type match. I couldn't get > >>> around all results having the same score (can I boost proximity to the > >>> end of a string?) in the edge ngram search but I am hoping this is the > >>> fastest way to do this type of search since I can avoid wildcards > >>> "my_article_slu*" and stuff. > >>> > >>> More suggestions welcome and thanks for the help. I will re-index > >>> with omitNorms=true again to see if I can save a little space. > >>> > >>> > >>> > >>> > >>> > >>> On Tue, Mar 24, 2020 at 11:39 AM matthew sporleder <msporle...@gmail.com> > >>> wrote: > >>>> > >>>> Okay I appreciate you responding. > >>>> > >>>> Switching "slug" from "string_ci" class="solr.StrField" accomplished > >>>> about the same results, which makes sense to me now :) > >>>> > >>>> The previous definition of string_ci was: > >>>> <fieldType name="string_ci" class="solr.TextField" > >>>> sortMissingLast="true" omitNorms="true"> > >>>> <analyzer> > >>>> <tokenizer class="solr.KeywordTokenizerFactory"/> > >>>> <filter class="solr.LowerCaseFilterFactory" /> > >>>> </analyzer> > >>>> </fieldType> > >>>> > >>>> So lowercase + KeywordTokenizerFactory; > >>>> > >>>> I am trying again with omitNorms=false to see if I can get the more > >>>> "exact" matches to score better this time around. > >>>> > >>>> > >>>> On Tue, Mar 24, 2020 at 9:54 AM Erick Erickson <erickerick...@gmail.com> > >>>> wrote: > >>>>> > >>>>> Won’t work. String types are totally unanalyzed. Your string_ci > >>>>> fieldType is what I was looking for. > >>>>> > >>>>> No, you shouldn’t kill the lowercasefilter unless you want all of your > >>>>> searches will then be case-sensitive. > >>>>> > >>>>> So you should try: > >>>>> > >>>>> q=edgy_text:whatever&sort=string_ci asc > >>>>> > >>>>> Please use the admin>>pick_core>>analysis page when thinking about > >>>>> changing your schema, it’ll answer a _lot_ of these questions > >>>>> immediately. > >>>>> > >>>>> Best, > >>>>> Erick > >>>>> > >>>>>> On Mar 24, 2020, at 8:37 AM, matthew sporleder <msporle...@gmail.com> > >>>>>> wrote: > >>>>>> > >>>>>> Oh maybe a schema bug! > >>>>>> > >>>>>> my string_ci: > >>>>>> <fieldType name="string_ci" class="solr.TextField" > >>>>>> sortMissingLast="true" omitNorms="true"> > >>>>>> <analyzer> > >>>>>> <tokenizer class="solr.KeywordTokenizerFactory"/> > >>>>>> <filter class="solr.LowerCaseFilterFactory" /> > >>>>>> </analyzer> > >>>>>> </fieldType> > >>>>>> > >>>>>> going to try this instead: > >>>>>> <fieldType name="string_lctoken" class="solr.StrField" > >>>>>> sortMissingLast="true" omitNorms="true"> > >>>>>> <analyzer> > >>>>>> <tokenizer class="solr.KeywordTokenizerFactory"/> > >>>>>> <filter class="solr.LowerCaseFilterFactory" /> > >>>>>> </analyzer> > >>>>>> </fieldType> > >>>>>> > >>>>>> Then I can probably kill the lowercasefilter on edgeytext: > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Tue, Mar 24, 2020 at 7:44 AM Erick Erickson > >>>>>> <erickerick...@gmail.com> wrote: > >>>>>>> > >>>>>>> Sort by the full field. You’ll need to copy to a field with > >>>>>>> keywordTokenizer and lowercaseFilter (string_ci? assuming it’s not > >>>>>>> really a :”string”) type. > >>>>>>> > >>>>>>> Best, > >>>>>>> Erick > >>>>>>> > >>>>>>>> On Mar 24, 2020, at 7:10 AM, matthew sporleder > >>>>>>>> <msporle...@gmail.com> wrote: > >>>>>>>> > >>>>>>>> I have added an edge ngram field to my index and get decent results > >>>>>>>> with partial words but the results appear randomly sorted and all > >>>>>>>> contain the same score. Ideally I would like to sort by shortest > >>>>>>>> ngram match within my other qualifiers. > >>>>>>>> > >>>>>>>> Is there a canonical solution to this? > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> Matt > >>>>>>>> > >>>>>>>> p.s. I mostly followed > >>>>>>>> https://lucidworks.com/post/auto-suggest-from-popular-queries-using-edgengrams/ > >>>>>>>> > >>>>>>>> schema bits: > >>>>>>>> > >>>>>>>> <fieldType name="edgytext" class="solr.TextField" > >>>>>>>> positionIncrementGap="100"> > >>>>>>>> <analyzer type="index"> > >>>>>>>> <tokenizer class="solr.KeywordTokenizerFactory"/> > >>>>>>>> <filter class="solr.LowerCaseFilterFactory"/> > >>>>>>>> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" > >>>>>>>> maxGramSize="25" /> > >>>>>>>> </analyzer> > >>>>>>>> > >>>>>>>> <field name="slug" type="string_ci" indexed="true" stored="true" > >>>>>>>> multiValued="false" /> > >>>>>>>> > >>>>>>>> <field name="fayt" type="edgytext" indexed="true" stored="false" > >>>>>>>> omitNorms="false" omitTermFreqAndPositions="true" multiValued="true" > >>>>>>>> /> > >>>>>>>> > >>>>>>>> > >>>>>>>> <copyField source="slug" dest="fayt" maxChars="65" /> > >>>>>>> > >>>>> > >> >