Hi Joel, On 12/08/2008 at 5:37 PM, Joel Karlsson wrote: > Is there any way to get Solr to sort properly on a text field containing > international, in my case swedish, letters? It doesn't sort å,ä and ö > in the proper order.
I wrote a Lucene patch that stores CollationKeys generated by a user-specified Collator as index terms: <https://issues.apache.org/jira/browse/LUCENE-1435> - note that this patch depends on another Lucene patch I wrote to convert arbitrary byte sequences into indexable String terms: <https://issues.apache.org/jira/browse/LUCENE-1434>. There are two versions of the filter/analyzer in the patch: one that uses Java's built-in Collator, and another that depends on ICU4J for collation. I haven't written a Solr factory to hook these in, but theoretically :) it would be fairly simple to do so. That would allow you to copyField from an indexed-as-is field to one that has CollationKeyAnalyzer or ICUCollationKeyAnalyzer in its analyzer chain, and then include a sort param over the collation key field in your query. Vote for the patch if you'd like to see it included in Lucene. Caveats: 1. Mike McCandless posted to the LUCENE-1435 issue <https://issues.apache.org/jira/browse/LUCENE-1435?focusedCommentId=12646525#action_12646525> that the approach taken is not ideal, and that the Lucene index should directly handle collation. (See his other comments on the issue for more info.) 2. CollationKeys are fragile: to remain comparable, you must insure that the algorithm used to generate them remains constant. The implementation can differ by JVM vendor and/or version, so the only safe thing to do is to fix the JVM vendor and version. When you change JVM, you should re-index. > Also, is there any way to get Solr to sort, i.e, á, à or â together with the > "regular" a's? Assuming you can use the approach outlined above, check out RuleBasedCollator (Java 1.4.2: <http://java.sun.com/j2se/1.4.2/docs/api/java/text/RuleBasedCollator.html>; ICU4J: <http://icu-project.org/apiref/icu4j/com/ibm/icu/text/RuleBasedCollator.html>) - you can write your own collation rules to handle this situation. Steve