On Sat, Aug 14, 2010 at 2:01 AM, Daniel Ehrenberg <[email protected]> wrote: > Hi, > > I noticed that CouchDB uses ICU for Unicode collation. Great job on > that decision! I've been interested in Unicode for a while, so I > looked into the implementation of this. I saw a couple things that > confused me, though. > > In the Version 0.3.0 changelog, it says that locale-specific collation > is supported, but I don't see how this works in the current > implementation. couch_icu_driver.c initializes a case-sensitive > collator and a case-insensitive collator both with calls to the ICU > function as ucol_open("", &status). But from the ICU documentation, it > looks like passing "" as the locale (the first argument) selects the > default collation rules as specified in the UCA and DUCET. Is there > some other way that the locale is being passed to ICU?
We don't currently support changing the collation locale away from the system default. its been discussed before to allow setting this on a much finer detail (database or even view possibly) but no one has been motivated enough to make a patch. Any effort on this front would be appreciated. > It looks like strings are being compared in CouchDB using the > col_strcollIter call. From what I understand, this is fine if used in > a simple binary comparison, but when comparing strings multiple times > (as in a B-tree), it can be more efficient to pre-calculate a > collation key using ucol_getSortKey (or, to be really fancy, > calculating only the used part of the collation key, on-demand, with > ucol_nextSortKeyPart, though this may be difficult to reconcile with > an append-only file structure). Has anyone evaluated this strategy > within CouchDB to see if it might yield better performance? > > Dan > I haven't investigated the ICU api very much at all beyond what's used in the ICU driver. I doubt that anyone really did a thorough look after it started working. That's actually one of the older untouched areas of the source tree. Off the top of my head I wouldn't expect to get *too* much of a speedup from doing fancy caching things there judging by the difference between raw collation and unicode. Though it could be noticeable, so don't let me keep you from trying anything. If you're wanting to sink your teeth into either of those issues, feel free to send any questions you have. I would be quite happy to help you with any questions regarding the build system and Erlang integration you might have. HTH, Paul Davis
