Re: [PROPOSAL] New OOXML import framework

Kay Schenk Mon, 19 May 2014 15:31:35 -0700

[top posting for a moment]

Thank you for this initial introduction to planning better support for
OOXML. The reality is this is necessary, and I would imagine most
involved in the project realize this.  OK, just a bit more below.


On 05/19/2014 06:39 AM, Andre Fischer wrote:
> As one of the first tasks in the OOXML area I would like to propose to
> redesign and re-implement the OOXML parser.
> 
> At the moment each application has its own OOXML import design. Those of
> Impress and Calc are basically classic hand written push parser designs
> while that of Writer is semi-automatically derived from the
> WordprocessingML specification.  For all three designs there is hardly
> any documentation and their implementation is hard to understand and
> hard to maintain. All that means that you have to work hard to obtain a
> working knowledge about the OOXML parser for one application and then,
> once you have it, can not transfer it to the other applications.
> 
> I propose a new and unified approach that will essentially replace the
> current design and implementation.  Using the same framework in all
> applications has several advantages:
> 
> - You only have to learn how to use one well documented framework
> instead of three different and badly documented XML import techniques.
> 
> - It exploits the information given by the OOXML schema to produce
> automatically some of the code that has to be hand written today.
> 
> - It allows automatic analysis of the coverage of the OOXML
> specification so that we can easily see which parts have already been
> implemented and which are still missing.
> 
> - It will be much more easily understandable than the current OOXML
> import (especially that of Writer).
> 
> The one big downside is that the new design requires basically a
> reimplementation of the OOXML import.  But to everyone who has seen the
> current implementation might not see that as a downside at all :-)
> 
> 
> 
> Development and migration
> 
> I propose to do the implementation in a new module (possibly called
> main/ooxml/) with the goal to eventually (i.e. in a couple of releases)
> replace main/oox/ and other places that contain OOXML import code.  It
> will not be active by default until every one agrees that it is release
> ready.  Of course, there will be switches to easily (but not
> accidentally) activate it for development builds.
> 
> I also propose to focus first on Impress.  Its complexity regarding
> OOXML is less than that of Writer and Calc and the still existing
> expertise in this area of OpenOffice is probably larger than in Writer
> and definitely larger than in Calc.
> 
> Development will start with implementation of the new framework that is
> hinted at above and explained in more detail below.  Then the existing
> Impress import is migrated to the new design by copying and adapting the
> code.  The existing import in main/oox/ remains unchanged.
> 
> 
> 
> The new framework
> 
> The design of the new framework is based on exploiting the OOXML
> specification (plural because there are different versions, migration
> addendums and MS Office specific extensions).  A parser generator reads
> the specs and creates the actual OOXML parser from that.  The generated
> parser will basically be a (nested) stack automaton where each state
> corresponds roughly to a complex type as defined by the spec. 
> Transitions from on state to another correspond to start and end tags
> that move from one complex type to another.
> 
> The actions that are executed on transitions and which do the actual
> import work, still have to be provided manually.  With an intermediate
> DSL (domain specific language) that represents the interface between
> OOXML parser and developer, even this step will become more easy and
> more robust.
> 
> The use of an intermediate DSL also allows tweaking of the rules derived
> from the OOXML specification should the need arise (to e.g. cope with
> OOXML files that are not 100% conformant to the specs).
> 
> The compile time part of the framework is to be implemented in Java to
> allow an efficient and fast development process. 

Does this basically mean that we will need to use both Java and C++ for
future builds?

 The runtime part of
> the framework, including the generated parser will be implemented in C++
> and be an integral part of OpenOffice.
> 
> 
> 
> Details
> 
> At the moment we are using a bare bones XML push parser for reading
> OOXML files.  That means that as the XML parser reads the stream of XML
> elements it asks the OOXML import code to handle start tags, end tags,
> and the text in between.  It is the task of these callbacks to provide
> so called contexts for each element. These contexts can then be used to
> make information like attribute values (which the parser only provides
> to start tags) accessible to the callbacks of text and end tags.
> The creation of contexts and persistence of intermediate data is done
> manually in the existing import code.  The new import framework,
> however, will create it automatically, based on the OOXML specifications
> and semi automatically based on DSL requests.  The automatic part is
> extracted from the specs and responsible for preprocessing attribute
> value (e.g. conversion from string to boolean, integer, float/double or
> enumerations). The semi automatic part is driven by developer supplied
> information in DSL files and defines the subset of attributes that are
> really evaluated by the import code.
> 
> An example of a DSL file snippet could look like this:
> 
> DefineContext(p:CT_Slide, p_CT_Slide_context, attribute bool show,
> attribute bool showMasterSp, int nSlideCounter);
> ProcessTypeStart(p:CT_Slide, p_CT_Slide_context aContext)
> {
>     // C++ code to import a single slide
>     if (aContext.show)
>        <do-something>
>     ++aContext.nSlideCounter;
> }
> ProcessTypeEnd(p:CT_Slide, p_CT_Slide_context aContext)
> {
>     cout << aContext.nSlideCounter << endl;
> }
> 
> 
> It centers on the CT_Slide complex type that is started by the top level
> 'sld' element in namespace
> http://schemas.openxmlformats.org/presentationml/2006/main which is
> typically abreviated as 'p'.  It defines a context class
> p_CT_Slide_context that contains two attributes show and showMasterSp
> and an additional variable nSlideCounter.  The attributes are filled
> automatically with values when the 'sld' start tag is seen.  Two code
> snippets are defined to handle the 'sld' start and end tags.  Both are
> provided with an object of the p_CT_Slide_context and can read and write
> its values.
> 
> I have made several experiments regarding the reading of the
> specification and generation of parsers and am confident that the
> outlined approach will work.  The details, like syntax of the DSL, are
> not yet fixed.
> 
> This may sound like a fixed concept that just needs implementation.  It
> is not.  Many details have yet to be figured out.  Help on all levels
> (design, implementation, testing, documentation) is needed and welcome.
> 
> 
> Best regards,
> Andre
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 

-- 
-------------------------------------------------------------------------
MzK

"Life is either a daring adventure, or nothing."
                               -- Helen Keller


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PROPOSAL] New OOXML import framework

Reply via email to