Hi Joel,

On 12/08/2008 at 5:37 PM, Joel Karlsson wrote:
> Is there any way to get Solr to sort properly on a text field containing
> international, in my case swedish, letters?  It doesn't sort å,ä and ö
> in the proper order.

I wrote a Lucene patch that stores CollationKeys generated by a user-specified 
Collator as index terms: <https://issues.apache.org/jira/browse/LUCENE-1435> - 
note that this patch depends on another Lucene patch I wrote to convert 
arbitrary byte sequences into indexable String terms: 
<https://issues.apache.org/jira/browse/LUCENE-1434>.  There are two versions of 
the filter/analyzer in the patch: one that uses Java's built-in Collator, and 
another that depends on ICU4J for collation.

I haven't written a Solr factory to hook these in, but theoretically :) it 
would be fairly simple to do so.  That would allow you to copyField from an 
indexed-as-is field to one that has CollationKeyAnalyzer or 
ICUCollationKeyAnalyzer in its analyzer chain, and then include a sort param 
over the collation key field in your query.

Vote for the patch if you'd like to see it included in Lucene.

Caveats: 

1. Mike McCandless posted to the LUCENE-1435 issue 
<https://issues.apache.org/jira/browse/LUCENE-1435?focusedCommentId=12646525#action_12646525>
 that the approach taken is not ideal, and that the Lucene index should 
directly handle collation.  (See his other comments on the issue for more info.)

2. CollationKeys are fragile: to remain comparable, you must insure that the 
algorithm used to generate them remains constant.  The implementation can 
differ by JVM vendor and/or version, so the only safe thing to do is to fix the 
JVM vendor and version.  When you change JVM, you should re-index.

> Also, is there any way to get Solr to sort, i.e, á, à or â together with the 
> "regular" a's?

Assuming you can use the approach outlined above, check out RuleBasedCollator 
(Java 1.4.2: 
<http://java.sun.com/j2se/1.4.2/docs/api/java/text/RuleBasedCollator.html>; 
ICU4J: 
<http://icu-project.org/apiref/icu4j/com/ibm/icu/text/RuleBasedCollator.html>) 
- you can write your own collation rules to handle this situation.

Steve

Reply via email to