Re: strange sorting results: each word in field is sorted

Erik Hatcher Wed, 19 Aug 2009 13:30:31 -0700


On Aug 19, 2009, at 3:50 PM, Paul Rosen wrote:

I'm surprised you're not seeing an exception when trying to sort ontitle given this configuration. Sorting must be done on singlevalued indexed fields, that have at most a single term indexed perdocument. I recommend you use copyField to copy title totitle_sort and configure a title_sort field as a "string" or afield type that analyzes only to a single term (like simply keywordtokenizing -> lower case filter.
   Erik
I want to double check this (since you probably remember how long ittakes to recreate the indexes). I think you're saying to add thesetwo lines, then re-index:
<field name="title_sort" type="string" indexed="true" stored="true"/>
<copyField source="title" dest="title_sort"/>

For the simplest case, yes. You do have to be careful the sort fieldis not multiValued - and I believe the NINES model allowed formultiple titles. So it might be necessary for your indexing client tospecify the single sort field value instead of leveraging copyField.

Now, this is case-sensitive, right? So would this make it case-insensitive?


Yes, the above would be case sensitive.

<fieldtype name="sort_string"class="solr.StrField"sortMissingLast="true">
 <analyzer>
   <filter class="solr.LowerCaseFilterFactory"/>
 </analyzer>
</fieldtype>
<field name="title_sort" type="sort_string" indexed="true"stored="true"/>
<copyField source="title" dest="title_sort"/>

That <analyzer> definition isn't quite right - you must have at leasta tokenizer. The KeywordTokenizer "tokenizes" the entire string intoa single token, though. In Solr's example schema there is a fieldtype like this:

<fieldType name="alphaOnlySort" class="solr.TextField"sortMissingLast="true" omitNorms="true">

      <analyzer>
        <!-- KeywordTokenizer does no actual tokenizing, so the entire
             input string is preserved as a single token
          -->
        <tokenizer class="solr.KeywordTokenizerFactory"/>

<!-- The LowerCase TokenFilter does what you expect, whichcan be

             when you want your sorting to be case insensitive
          -->
        <filter class="solr.LowerCaseFilterFactory" />

        <filter class="solr.TrimFilterFactory" />
        <!-- The PatternReplaceFilter gives you the flexibility to use

Java Regular expression to replace any sequence ofcharacters

             matching a pattern with an arbitrary replacement string,

which may include back references to portions of theoriginal

             string matched by the pattern.

             See the Java Regular Expression documentation for more
             information on pattern and replacement string syntax.

             
http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/package-summary.html
          -->
        <filter class="solr.PatternReplaceFilterFactory"
                pattern="([^a-z])" replacement="" replace="all"
        />
      </analyzer>
    </fieldType>

Also, I'm guessing from seeing the current results that thiswouldn't collate the characters with diacritical marks correctly. Isthere a way to indicate that, for instance, A-grave would sort nextto A?

Yes, you can incorporate the diacritic normalizing filter into theanalyzer definition above. AsciiFoldingFilter or the ISO Latin1 one.

And, while I'm on the subject, I have to do the same thing with theAuthor field, but unfortunately, that is sometimes "First Last" andsometimes "Last, First". Is there any way to sort those by lastname, or do I just have to encourage the index people to be moreconsistent?


Good luck with getting consistency in your domain!  :)

But it certainly makes sense to request that from the data providers,in at least some form that can be turned into the sortable value.

I can think of a fairly simple algorithm, but am not sure where toimplement it:
- if the word "and" or "&" appears, just look at the left side ofthe field (in other words, sort by the first name that appears.)- if there is a comma, but it is part of ", jr." or some othercommon suffixes like that, ignore it.- otherwise, if there is no comma, sort by the last word, unless itis "jr", "sr", "III", etc., then sort by the word before that.
- otherwise, sort by the first word.

Probably best to implement that in the indexing client code, butsimple transformations could be implemented using thePatternReplaceFilter like above.


        Erik

Re: strange sorting results: each word in field is sorted

Reply via email to