[top posting for a moment]
Thank you for this initial introduction to planning better support for
OOXML. The reality is this is necessary, and I would imagine most
involved in the project realize this. OK, just a bit more below.
On 05/19/2014 06:39 AM, Andre Fischer wrote:
> As one of the first tasks in the OOXML area I would like to propose to
> redesign and re-implement the OOXML parser.
>
> At the moment each application has its own OOXML import design. Those of
> Impress and Calc are basically classic hand written push parser designs
> while that of Writer is semi-automatically derived from the
> WordprocessingML specification. For all three designs there is hardly
> any documentation and their implementation is hard to understand and
> hard to maintain. All that means that you have to work hard to obtain a
> working knowledge about the OOXML parser for one application and then,
> once you have it, can not transfer it to the other applications.
>
> I propose a new and unified approach that will essentially replace the
> current design and implementation. Using the same framework in all
> applications has several advantages:
>
> - You only have to learn how to use one well documented framework
> instead of three different and badly documented XML import techniques.
>
> - It exploits the information given by the OOXML schema to produce
> automatically some of the code that has to be hand written today.
>
> - It allows automatic analysis of the coverage of the OOXML
> specification so that we can easily see which parts have already been
> implemented and which are still missing.
>
> - It will be much more easily understandable than the current OOXML
> import (especially that of Writer).
>
> The one big downside is that the new design requires basically a
> reimplementation of the OOXML import. But to everyone who has seen the
> current implementation might not see that as a downside at all :-)
>
>
>
> Development and migration
>
> I propose to do the implementation in a new module (possibly called
> main/ooxml/) with the goal to eventually (i.e. in a couple of releases)
> replace main/oox/ and other places that contain OOXML import code. It
> will not be active by default until every one agrees that it is release
> ready. Of course, there will be switches to easily (but not
> accidentally) activate it for development builds.
>
> I also propose to focus first on Impress. Its complexity regarding
> OOXML is less than that of Writer and Calc and the still existing
> expertise in this area of OpenOffice is probably larger than in Writer
> and definitely larger than in Calc.
>
> Development will start with implementation of the new framework that is
> hinted at above and explained in more detail below. Then the existing
> Impress import is migrated to the new design by copying and adapting the
> code. The existing import in main/oox/ remains unchanged.
>
>
>
> The new framework
>
> The design of the new framework is based on exploiting the OOXML
> specification (plural because there are different versions, migration
> addendums and MS Office specific extensions). A parser generator reads
> the specs and creates the actual OOXML parser from that. The generated
> parser will basically be a (nested) stack automaton where each state
> corresponds roughly to a complex type as defined by the spec.
> Transitions from on state to another correspond to start and end tags
> that move from one complex type to another.
>
> The actions that are executed on transitions and which do the actual
> import work, still have to be provided manually. With an intermediate
> DSL (domain specific language) that represents the interface between
> OOXML parser and developer, even this step will become more easy and
> more robust.
>
> The use of an intermediate DSL also allows tweaking of the rules derived
> from the OOXML specification should the need arise (to e.g. cope with
> OOXML files that are not 100% conformant to the specs).
>
> The compile time part of the framework is to be implemented in Java to
> allow an efficient and fast development process.
Does this basically mean that we will need to use both Java and C++ for
future builds?
The runtime part of
> the framework, including the generated parser will be implemented in C++
> and be an integral part of OpenOffice.
>
>
>
> Details
>
> At the moment we are using a bare bones XML push parser for reading
> OOXML files. That means that as the XML parser reads the stream of XML
> elements it asks the OOXML import code to handle start tags, end tags,
> and the text in between. It is the task of these callbacks to provide
> so called contexts for each element. These contexts can then be used to
> make information like attribute values (which the parser only provides
> to start tags) accessible to the callbacks of text and end tags.
> The creation of contexts and persistence of intermediate data is done
> manually in the existing import code. The new import framework,
> however, will create it automatically, based on the OOXML specifications
> and semi automatically based on DSL requests. The automatic part is
> extracted from the specs and responsible for preprocessing attribute
> value (e.g. conversion from string to boolean, integer, float/double or
> enumerations). The semi automatic part is driven by developer supplied
> information in DSL files and defines the subset of attributes that are
> really evaluated by the import code.
>
> An example of a DSL file snippet could look like this:
>
> DefineContext(p:CT_Slide, p_CT_Slide_context, attribute bool show,
> attribute bool showMasterSp, int nSlideCounter);
> ProcessTypeStart(p:CT_Slide, p_CT_Slide_context aContext)
> {
> // C++ code to import a single slide
> if (aContext.show)
> <do-something>
> ++aContext.nSlideCounter;
> }
> ProcessTypeEnd(p:CT_Slide, p_CT_Slide_context aContext)
> {
> cout << aContext.nSlideCounter << endl;
> }
>
>
> It centers on the CT_Slide complex type that is started by the top level
> 'sld' element in namespace
> http://schemas.openxmlformats.org/presentationml/2006/main which is
> typically abreviated as 'p'. It defines a context class
> p_CT_Slide_context that contains two attributes show and showMasterSp
> and an additional variable nSlideCounter. The attributes are filled
> automatically with values when the 'sld' start tag is seen. Two code
> snippets are defined to handle the 'sld' start and end tags. Both are
> provided with an object of the p_CT_Slide_context and can read and write
> its values.
>
> I have made several experiments regarding the reading of the
> specification and generation of parsers and am confident that the
> outlined approach will work. The details, like syntax of the DSL, are
> not yet fixed.
>
> This may sound like a fixed concept that just needs implementation. It
> is not. Many details have yet to be figured out. Help on all levels
> (design, implementation, testing, documentation) is needed and welcome.
>
>
> Best regards,
> Andre
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
--
-------------------------------------------------------------------------
MzK
"Life is either a daring adventure, or nothing."
-- Helen Keller
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]