On Aug 19, 2009, at 3:50 PM, Paul Rosen wrote:
I'm surprised you're not seeing an exception when trying to sort on title given this configuration. Sorting must be done on single valued indexed fields, that have at most a single term indexed per document. I recommend you use copyField to copy title to title_sort and configure a title_sort field as a "string" or a field type that analyzes only to a single term (like simply keyword tokenizing -> lower case filter.
   Erik

I want to double check this (since you probably remember how long it takes to recreate the indexes). I think you're saying to add these two lines, then re-index:

<field name="title_sort" type="string" indexed="true" stored="true"/>
<copyField source="title" dest="title_sort"/>

For the simplest case, yes. You do have to be careful the sort field is not multiValued - and I believe the NINES model allowed for multiple titles. So it might be necessary for your indexing client to specify the single sort field value instead of leveraging copyField.

Now, this is case-sensitive, right? So would this make it case- insensitive?

Yes, the above would be case sensitive.

<fieldtype name="sort_string"class="solr.StrField" sortMissingLast="true">
 <analyzer>
   <filter class="solr.LowerCaseFilterFactory"/>
 </analyzer>
</fieldtype>
<field name="title_sort" type="sort_string" indexed="true" stored="true"/>
<copyField source="title" dest="title_sort"/>

That <analyzer> definition isn't quite right - you must have at least a tokenizer. The KeywordTokenizer "tokenizes" the entire string into a single token, though. In Solr's example schema there is a field type like this:

<fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" omitNorms="true">
      <analyzer>
        <!-- KeywordTokenizer does no actual tokenizing, so the entire
             input string is preserved as a single token
          -->
        <tokenizer class="solr.KeywordTokenizerFactory"/>
<!-- The LowerCase TokenFilter does what you expect, which can be
             when you want your sorting to be case insensitive
          -->
        <filter class="solr.LowerCaseFilterFactory" />
<!-- The TrimFilter removes any leading or trailing whitespace -->
        <filter class="solr.TrimFilterFactory" />
        <!-- The PatternReplaceFilter gives you the flexibility to use
Java Regular expression to replace any sequence of characters
             matching a pattern with an arbitrary replacement string,
which may include back references to portions of the original
             string matched by the pattern.

             See the Java Regular Expression documentation for more
             information on pattern and replacement string syntax.

             
http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/package-summary.html
          -->
        <filter class="solr.PatternReplaceFilterFactory"
                pattern="([^a-z])" replacement="" replace="all"
        />
      </analyzer>
    </fieldType>

Also, I'm guessing from seeing the current results that this wouldn't collate the characters with diacritical marks correctly. Is there a way to indicate that, for instance, A-grave would sort next to A?

Yes, you can incorporate the diacritic normalizing filter into the analyzer definition above. AsciiFoldingFilter or the ISO Latin1 one.

And, while I'm on the subject, I have to do the same thing with the Author field, but unfortunately, that is sometimes "First Last" and sometimes "Last, First". Is there any way to sort those by last name, or do I just have to encourage the index people to be more consistent?

Good luck with getting consistency in your domain!  :)

But it certainly makes sense to request that from the data providers, in at least some form that can be turned into the sortable value.

I can think of a fairly simple algorithm, but am not sure where to implement it:

- if the word "and" or "&" appears, just look at the left side of the field (in other words, sort by the first name that appears.) - if there is a comma, but it is part of ", jr." or some other common suffixes like that, ignore it. - otherwise, if there is no comma, sort by the last word, unless it is "jr", "sr", "III", etc., then sort by the word before that.
- otherwise, sort by the first word.

Probably best to implement that in the indexing client code, but simple transformations could be implemented using the PatternReplaceFilter like above.

        Erik

Reply via email to