Re: Unicode collation

Paul Davis Sat, 14 Aug 2010 18:27:00 -0700

On Sat, Aug 14, 2010 at 2:01 AM, Daniel Ehrenberg <[email protected]> wrote:
> Hi,
>
> I noticed that CouchDB uses ICU for Unicode collation. Great job on
> that decision! I've been interested in Unicode for a while, so I
> looked into the implementation of this. I saw a couple things that
> confused me, though.
>
> In the Version 0.3.0 changelog, it says that locale-specific collation
> is supported, but I don't see how this works in the current
> implementation. couch_icu_driver.c initializes a case-sensitive
> collator and a case-insensitive collator both with calls to the ICU
> function as ucol_open("", &status). But from the ICU documentation, it
> looks like passing "" as the locale (the first argument) selects the
> default collation rules as specified in the UCA and DUCET. Is there
> some other way that the locale is being passed to ICU?


We don't currently support changing the collation locale away from the
system default. its been discussed before to allow setting this on a
much finer detail (database or even view possibly) but no one has been
motivated enough to make a patch. Any effort on this front would be
appreciated.

> It looks like strings are being compared in CouchDB using the
> col_strcollIter call. From what I understand, this is fine if used in
> a simple binary comparison, but when comparing strings multiple times
> (as in a B-tree), it can be more efficient to pre-calculate a
> collation key using ucol_getSortKey (or, to be really fancy,
> calculating only the used part of the collation key, on-demand, with
> ucol_nextSortKeyPart, though this may be difficult to reconcile with
> an append-only file structure). Has anyone evaluated this strategy
> within CouchDB to see if it might yield better performance?
>
> Dan
>

I haven't investigated the ICU api very much at all beyond what's used
in the ICU driver. I doubt that anyone really did a thorough look
after it started working. That's actually one of the older untouched
areas of the source tree. Off the top of my head I wouldn't expect to
get *too* much of a speedup from doing fancy caching things there
judging by the difference between raw collation and unicode. Though it
could be noticeable, so don't let me keep you from trying anything.

If you're wanting to sink your teeth into either of those issues, feel
free to send any questions you have. I would be quite happy to help
you with any questions regarding the build system and Erlang
integration you might have.

HTH,
Paul Davis

Re: Unicode collation

Reply via email to