As one of the first tasks in the OOXML area I would like to propose to
redesign and re-implement the OOXML parser.
At the moment each application has its own OOXML import design. Those of
Impress and Calc are basically classic hand written push parser designs
while that of Writer is semi-automatically derived from the
WordprocessingML specification. For all three designs there is hardly
any documentation and their implementation is hard to understand and
hard to maintain. All that means that you have to work hard to obtain a
working knowledge about the OOXML parser for one application and then,
once you have it, can not transfer it to the other applications.
I propose a new and unified approach that will essentially replace the
current design and implementation. Using the same framework in all
applications has several advantages:
- You only have to learn how to use one well documented framework
instead of three different and badly documented XML import techniques.
- It exploits the information given by the OOXML schema to produce
automatically some of the code that has to be hand written today.
- It allows automatic analysis of the coverage of the OOXML
specification so that we can easily see which parts have already been
implemented and which are still missing.
- It will be much more easily understandable than the current OOXML
import (especially that of Writer).
The one big downside is that the new design requires basically a
reimplementation of the OOXML import. But to everyone who has seen the
current implementation might not see that as a downside at all :-)
Development and migration
I propose to do the implementation in a new module (possibly called
main/ooxml/) with the goal to eventually (i.e. in a couple of releases)
replace main/oox/ and other places that contain OOXML import code. It
will not be active by default until every one agrees that it is release
ready. Of course, there will be switches to easily (but not
accidentally) activate it for development builds.
I also propose to focus first on Impress. Its complexity regarding
OOXML is less than that of Writer and Calc and the still existing
expertise in this area of OpenOffice is probably larger than in Writer
and definitely larger than in Calc.
Development will start with implementation of the new framework that is
hinted at above and explained in more detail below. Then the existing
Impress import is migrated to the new design by copying and adapting the
code. The existing import in main/oox/ remains unchanged.
The new framework
The design of the new framework is based on exploiting the OOXML
specification (plural because there are different versions, migration
addendums and MS Office specific extensions). A parser generator reads
the specs and creates the actual OOXML parser from that. The generated
parser will basically be a (nested) stack automaton where each state
corresponds roughly to a complex type as defined by the spec.
Transitions from on state to another correspond to start and end tags
that move from one complex type to another.
The actions that are executed on transitions and which do the actual
import work, still have to be provided manually. With an intermediate
DSL (domain specific language) that represents the interface between
OOXML parser and developer, even this step will become more easy and
more robust.
The use of an intermediate DSL also allows tweaking of the rules derived
from the OOXML specification should the need arise (to e.g. cope with
OOXML files that are not 100% conformant to the specs).
The compile time part of the framework is to be implemented in Java to
allow an efficient and fast development process. The runtime part of
the framework, including the generated parser will be implemented in C++
and be an integral part of OpenOffice.
Details
At the moment we are using a bare bones XML push parser for reading
OOXML files. That means that as the XML parser reads the stream of XML
elements it asks the OOXML import code to handle start tags, end tags,
and the text in between. It is the task of these callbacks to provide
so called contexts for each element. These contexts can then be used to
make information like attribute values (which the parser only provides
to start tags) accessible to the callbacks of text and end tags.
The creation of contexts and persistence of intermediate data is done
manually in the existing import code. The new import framework,
however, will create it automatically, based on the OOXML specifications
and semi automatically based on DSL requests. The automatic part is
extracted from the specs and responsible for preprocessing attribute
value (e.g. conversion from string to boolean, integer, float/double or
enumerations). The semi automatic part is driven by developer supplied
information in DSL files and defines the subset of attributes that are
really evaluated by the import code.
An example of a DSL file snippet could look like this:
DefineContext(p:CT_Slide, p_CT_Slide_context, attribute bool show,
attribute bool showMasterSp, int nSlideCounter);
ProcessTypeStart(p:CT_Slide, p_CT_Slide_context aContext)
{
// C++ code to import a single slide
if (aContext.show)
<do-something>
++aContext.nSlideCounter;
}
ProcessTypeEnd(p:CT_Slide, p_CT_Slide_context aContext)
{
cout << aContext.nSlideCounter << endl;
}
It centers on the CT_Slide complex type that is started by the top level
'sld' element in namespace
http://schemas.openxmlformats.org/presentationml/2006/main which is
typically abreviated as 'p'. It defines a context class
p_CT_Slide_context that contains two attributes show and showMasterSp
and an additional variable nSlideCounter. The attributes are filled
automatically with values when the 'sld' start tag is seen. Two code
snippets are defined to handle the 'sld' start and end tags. Both are
provided with an object of the p_CT_Slide_context and can read and write
its values.
I have made several experiments regarding the reading of the
specification and generation of parsers and am confident that the
outlined approach will work. The details, like syntax of the DSL, are
not yet fixed.
This may sound like a fixed concept that just needs implementation. It
is not. Many details have yet to be figured out. Help on all levels
(design, implementation, testing, documentation) is needed and welcome.
Best regards,
Andre
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]