+1 to ICU. This is not very easy to build (I've build one). The 'Java Normalizer' is not until 1.6.
On Sat, May 9, 2009 at 6:15 PM, Michael Glavassevich <[email protected]>wrote: > Hi Richard, > > Richard Kelly <[email protected]> wrote on 05/08/2009 02:12:56 AM: > > > Hi everyone, > > > > Just thought I would give an update on how I've been preparing for my > > GSoC work. I managed to get my environment set up and I've been > > building some basic XNI components to get a feel for the code. > > > Sounds good. > > > > I've also been researching the different > > options of implementing the Unicode normalization functions. > > Here are some pros/cons of the various approaches that I've thought of: > > > > ICU4J: [1] > > (This is effectively the reference implementation of unicode > normalization) > > Pros: > > - Currently compiles with Java 1.3 > > - Is fully tested with all the exception > > - Implements 'quick check' optimizations which allows you to pass > > documents many times faster. > > - License seems to be compatible with Xerces license. > > Yes, I think it is. It's been reviewed before on the legal-discuss list [3] > and I believe there are other Apache projects (e.g. Harmony [4]) that > already bundle it. > > > > - Normalization code can be built as a modular component, so you > > don't need the whole ICU4J library. > > Cons: > > - Future versions of ICU4J are not guaranteed to compile Java 1.3 in > > future versions > > - requires an additional license file to be added to the distribution > > - adds a ~500kb jar file to the build > > > > > > Java Normalizer [2] > > Pros: > > - No additional libraries needed. > > - Functionality built into java so smaller file size. > > - No license required. > > Cons: > > - Not available until Java 1.4+ > > - Doesn't implement 'quick check' optimizations so its much slower. > > > > > > Build from scratch: > > Pros: > > - Complete control of source code > > - Can ensure that code compiles with Java 1.3 > > Cons: > > - Although the main functionality is fairly straight-forward, > > some legacy Unicode requirements and edge cases make implementing the > code > > pretty complicated. > > - Additional code maintenance if unicode standards change > > > > > > > > I am leaning towards the first option (ICU4J) but welcome any other > > input / comments before I decide. > > +1. I think that's the best choice of the three you've presented. No sense > reinventing the wheel (building it from scratch) if we don't need to and > can't depend on [2] because it's only available in Java 6. > > > In this case, since the ICU4J license needs to be attached, would itbe > okay to > > create a text file called "LICENSE.normalizer" to handle this > requirement? > > Yes, that's exactly how we handle the licenses for other dependencies. It > should get included in the packages produced by the build. > > > Thanks, > > - Richard > > > > [1] http://site.icu-project.org/ > > [2] http://java.sun.com/javase/6/docs/api/java/text/Normalizer.html > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > Thanks. > > [3] http://markmail.org/thread/rkdg4u5ziusxnqat > [4] http://harmony.markmail.org/search/?q=ICU > > Michael Glavassevich > XML Parser Development > IBM Toronto Lab > E-mail: [email protected] > E-mail: [email protected] >
