Hi Richard,

Richard Kelly <[email protected]> wrote on 05/08/2009 02:12:56 AM:

> Hi everyone,
>
> Just thought I would give an update on how I've been preparing for my
> GSoC work.  I managed to get my environment set up and I've been
> building some basic XNI components to get a feel for the code.

Sounds good.

> I've also been researching the different
> options of implementing the Unicode normalization functions.
> Here are some pros/cons of the various approaches that I've thought of:
>
> ICU4J: [1]
> (This is effectively the reference implementation of unicode
normalization)
> Pros:
>   - Currently compiles with Java 1.3
>   - Is fully tested with all the exception
>   - Implements 'quick check' optimizations which allows you to pass
> documents many times faster.
>   - License seems to be compatible with Xerces license.

Yes, I think it is. It's been reviewed before on the legal-discuss list [3]
and I believe there are other Apache projects (e.g. Harmony [4]) that
already bundle it.

>   - Normalization code can be built as a modular component, so you
> don't need the whole ICU4J library.
> Cons:
>   - Future versions of ICU4J are not guaranteed to compile Java 1.3 in
> future versions
>   - requires an additional license file to be added to the distribution
>   - adds a ~500kb jar file to the build
>
>
> Java Normalizer [2]
> Pros:
>   - No additional libraries needed.
>   - Functionality built into java so smaller file size.
>   - No license required.
> Cons:
>   - Not available until Java 1.4+
>   - Doesn't implement 'quick check' optimizations so its much slower.
>
>
> Build from scratch:
> Pros:
>   - Complete control of source code
>   - Can ensure that code compiles with Java 1.3
> Cons:
>   - Although the main functionality is fairly straight-forward,
>     some legacy Unicode requirements and edge cases make implementing the
code
>     pretty complicated.
>   - Additional code maintenance if unicode standards change
>
>
>
> I am leaning towards the first option (ICU4J) but welcome any other
> input / comments before I decide.

+1. I think that's the best choice of the three you've presented. No sense
reinventing the wheel (building it from scratch) if we don't need to and
can't depend on [2] because it's only available in Java 6.

> In this case, since the ICU4J license needs to be attached, would itbe
okay to
> create a text file called "LICENSE.normalizer" to handle this
requirement?

Yes, that's exactly how we handle the licenses for other dependencies. It
should get included in the packages produced by the build.

> Thanks,
> - Richard
>
> [1] http://site.icu-project.org/
> [2] http://java.sun.com/javase/6/docs/api/java/text/Normalizer.html
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]

Thanks.

[3] http://markmail.org/thread/rkdg4u5ziusxnqat
[4] http://harmony.markmail.org/search/?q=ICU

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: [email protected]
E-mail: [email protected]

Reply via email to