Hi Richard,

Richard Kelly <[email protected]> wrote on 07/12/2009 05:59:36 AM:

> Hi everyone,
>
> I've made some progress on my character normalization, and I
> would like to get some feedback on my work to ensure I'm on the
> right path.

I've had an opportunity to review your code. What you have so far is
looking really good. Great work!

> I've uploaded the current state of my patches on JIRA [1].

I do have some suggestions for improvements which I'll attach to the JIRA
issue.

> CharacterNormalizer.java is the new component that does the actual work.
> CharacterNormalizer.patch is all the changes to existing files that I
> needed to make.
>
> The relevant SAX [2] and DOM [3][4] character normalization features
> do appear to be working as intended with these changes (except for the
> tasks mentioned below).  I've implemented it as an XNI component as we
> discussed and use two Xerces features to control this component and
> determined whether or not it gets added to the pipeline.
>
> Still on my to do list:
> - DOM Level 3 normalizeDocument() and Node.normalize() functions:
> These functions don't use the pipeline so I am planning to add code to
> directly call the component from within these functions.
> - Multiple character data stream events are not handled correctly:
> Since unicode characters can be larger than 16-bits they may get split
> up across multiple calls to 'characters' events.  If this happens the
> character may not be normalized correctly.  In order to avoid this, I
> plan to use a buffer within my component to keep track of characters
> that overlap these events.
> - A comprehensive set of tests to check that the features work as
> described in the standards.  I've done basic testing for a number of
> cases (which it passed successfully) but obviously we would want
> something more comprehensive and also do some performance testing.
>
> If anyone would like to take a look and see if there are any obvious
> problems, that would be great.
>
> thanks,
> Richard
>
> [1] https://issues.apache.org/jira/browse/XERCESJ-1383
> [2] http://www.saxproject.org/apidoc/org/xml/sax/package-summary.
> html#package_description
> [3] http://www.w3.org/TR/DOM-Level-3-Core/core.html#parameter-check-
> character-normalization
> [4] http://www.w3.org/TR/DOM-Level-3-Core/core.html#parameter-
> normalize-characters
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]

Thanks.

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: [email protected]
E-mail: [email protected]

Reply via email to