Hi Richard, Richard Kelly <[email protected]> wrote on 06/23/2009 12:00:18 PM:
> Hi all, > > I'm finally finishing my exams this week, so I'll be able to dedicate > more time to this project. I thought I'd give an update of where I'm > at. > So far, I've done this: > - Created a character normalization component that performs unicode > normalization. > - Modified XML11Configuration to handle the new features and to add > and remove the component from the pipeline when appropriate. > - Modified AbstractSAXParser to handle the SAX character normalization flags. > - Created basic test files to ensure the features are working as expected. > - Extended the character normalization component to deal with > composing characters. > - Updated the XML messages for character normalization errors > - Built the ICU4J component and updated build.xml to use it. This sounds really good. Looking forward to seeing your first patch. > At the moment, I'm trying to map the 'relevant constructs' [1] in the > XML specfication to relevant Document Handler events. These > constructs consist of: > 1. The replacement text of all parsed entities > 2. All text matching, in context, one of the following > productions: CData, CharData, content, Name, Nmtoken. > > After looking through the XML specification and correlating the above > with DocumentHandler functions [2], I've interpreted this to mean: > - normalize the text of 'characters' events (since this event matches > replacement text, CData, CharData and content productions) > - normalize QNames and XMLAttributes in any events where they occur > (this matches most Name and Nmtoken productions) > - normalize name parameters in doctypeDecl, startGeneralEntity, > processingInstruction, and endGeneralEntity events (additional > structures in which Name productions occur) Possibly more than that. I think normalization applies to all content in the document (including comments) with an additional requirement "that none of the relevant constructs listed above begins (after character references are expanded) with a composing character as defined by B Definitions for Character Normalization". > If anyone can think of other events in which these productions are > used, I would be most grateful if you could point them out. > > Thanks for all your assistance so far, it has been a great help. > regards, > Richard > > > [1] http://www.w3.org/TR/xml11/#sec-normalization-checking > [2] http://xerces.apache.org/xerces2-j/javadocs/xni/org/apache/xerces/xni/XMLDocumentHandler.html Thanks. Michael Glavassevich XML Parser Development IBM Toronto Lab E-mail: [email protected] E-mail: [email protected]
