On Aug 19, 2009, at 3:50 PM, Paul Rosen wrote:
I'm surprised you're not seeing an exception when trying to sort on
title given this configuration. Sorting must be done on single
valued indexed fields, that have at most a single term indexed per
document. I recommend you use copyField to copy title to
title_sort and configure a title_sort field as a "string" or a
field type that analyzes only to a single term (like simply keyword
tokenizing -> lower case filter.
Erik
I want to double check this (since you probably remember how long it
takes to recreate the indexes). I think you're saying to add these
two lines, then re-index:
<field name="title_sort" type="string" indexed="true" stored="true"/>
<copyField source="title" dest="title_sort"/>
For the simplest case, yes. You do have to be careful the sort field
is not multiValued - and I believe the NINES model allowed for
multiple titles. So it might be necessary for your indexing client to
specify the single sort field value instead of leveraging copyField.
Now, this is case-sensitive, right? So would this make it case-
insensitive?
Yes, the above would be case sensitive.
<fieldtype name="sort_string"class="solr.StrField"
sortMissingLast="true">
<analyzer>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldtype>
<field name="title_sort" type="sort_string" indexed="true"
stored="true"/>
<copyField source="title" dest="title_sort"/>
That <analyzer> definition isn't quite right - you must have at least
a tokenizer. The KeywordTokenizer "tokenizes" the entire string into
a single token, though. In Solr's example schema there is a field
type like this:
<fieldType name="alphaOnlySort" class="solr.TextField"
sortMissingLast="true" omitNorms="true">
<analyzer>
<!-- KeywordTokenizer does no actual tokenizing, so the entire
input string is preserved as a single token
-->
<tokenizer class="solr.KeywordTokenizerFactory"/>
<!-- The LowerCase TokenFilter does what you expect, which
can be
when you want your sorting to be case insensitive
-->
<filter class="solr.LowerCaseFilterFactory" />
<!-- The TrimFilter removes any leading or trailing
whitespace -->
<filter class="solr.TrimFilterFactory" />
<!-- The PatternReplaceFilter gives you the flexibility to use
Java Regular expression to replace any sequence of
characters
matching a pattern with an arbitrary replacement string,
which may include back references to portions of the
original
string matched by the pattern.
See the Java Regular Expression documentation for more
information on pattern and replacement string syntax.
http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/package-summary.html
-->
<filter class="solr.PatternReplaceFilterFactory"
pattern="([^a-z])" replacement="" replace="all"
/>
</analyzer>
</fieldType>
Also, I'm guessing from seeing the current results that this
wouldn't collate the characters with diacritical marks correctly. Is
there a way to indicate that, for instance, A-grave would sort next
to A?
Yes, you can incorporate the diacritic normalizing filter into the
analyzer definition above. AsciiFoldingFilter or the ISO Latin1 one.
And, while I'm on the subject, I have to do the same thing with the
Author field, but unfortunately, that is sometimes "First Last" and
sometimes "Last, First". Is there any way to sort those by last
name, or do I just have to encourage the index people to be more
consistent?
Good luck with getting consistency in your domain! :)
But it certainly makes sense to request that from the data providers,
in at least some form that can be turned into the sortable value.
I can think of a fairly simple algorithm, but am not sure where to
implement it:
- if the word "and" or "&" appears, just look at the left side of
the field (in other words, sort by the first name that appears.)
- if there is a comma, but it is part of ", jr." or some other
common suffixes like that, ignore it.
- otherwise, if there is no comma, sort by the last word, unless it
is "jr", "sr", "III", etc., then sort by the word before that.
- otherwise, sort by the first word.
Probably best to implement that in the indexing client code, but
simple transformations could be implemented using the
PatternReplaceFilter like above.
Erik