On Monday, March 25, 2013 17:54:53 matus.u...@gmail.com wrote: > Hi, > > sorry for not discussing earlier, but I did not have much free time last > two weeks. > > I think we should continue the parser type discussion in order to also > improve state of things in libmsooxml. What we have there is a PULL > parser. And I identified the following problems (Would be cool is Lassi > could check those): > > 1. OOXML sometimes requires us to run the parser twice at one element in > order to first collect selected information required to convert the content > of child elements. > > 2. There are situations when conversion of the 1st child of the root > element requires information from the last child of the root element.
It would be interesting to see some examples of these two issues. > 3. Interpretation of OOXML elements differs based on the namespace and that > happens in scope of one single filter implementation (The namespace is not > only limited to WordprocessingML, DrawingML and VML - that would be the > docx filter for example). That forces us to maintain a context in order to > interpret attribute values properly. There also might be totally different > child elements. It's good that namespace is always checked, because that > avoids creation of invalid ODF, but it also ignores an element in an > unexpected namespace. That foo:xxx and bar:xxx are different tags is not a property of OOXML only. It's a property of any XML tree that uses namespaces. So yes, we need to check the namespace for all tags. > 4. Variations of 1, 2 and 3. > > It sounds like we need to adopt attributes of a SAX parser in order to > solve point 3. And the code becomes a bit fluffy when we try to solve 1, 2 > and 4, which is not an attribute of a PULL parser. I don't see why this follows. As long as we make sure that we parse the tagnames including namespace correctly it shouldn't make any difference for the correctness alone which method (SAX, PULL, DOM) that we use to traverse the XML tree. > We will also need to fight with this when doing the ODF->OOXML conversion. > As Inge wrote, the current plan is to export text and simple formatting > into DOCX. But I'm afraid we will hit one of the problems soon. > > I have also read comments from Jos about using XSLT to do the conversion. > Do you think it would be easier to solve points 1,2,3 and 4 that way? > When I imagine the code in XSLT using XPath, it could be Ok. But not that > Ok in means of performance. I am against using XSLT for this for several reasons: 1. It leads to unreadable code. There are some famous XSLT filters that even those who wrote them fear to fix bugs in. 2. As far as I know, it's a one-stop solution. I don't think you can mix XSLT and other types of data conversion. And since both ODT and especially OOXML spread the data into many different subfiles it doesn't fit very well. 3. As Jos wrote (I think in the review request),XSLT has difficulties with some constructs, especially those that you solve by sending in a context. To show some of the big picture of what I'm trying to do: I want to create a so called recursive descent parser for ODF. This type of parser has one funciton per non-terminal in the grammar and normally uses 1 token look-ahead. In the XML case we can simulate this by using an XML parser as the tokenizer and analyze the XML tree. The parser functions call each other recursively as the input is parsed. In the epub and html filter in filters/words/epub/ you can see this applied to ODT and with HTML as output. But the odfparser library takes this one step further: Instead of using the parser functions themselves to generate the output it allows a "backend" to be plugged into it, where the actuall output is generated. This allows us to use the same ODF parser for all export filters. Some filters with very simple output can even ignore most of the input by not implementing the corresponding backend functions. A good example of this can be seen in the ascii (actually text) export filter in filters/words/ascii. Now, there have been discussions of how to parse the XML of ODF to implement the tokenizer to this recursive descent parser. Jos suggested that the DOM approach taken by the KoXmlReader is not very efficient in a case like this, and he is right. It would be more efficient to use QXmlStreamReader which uses a PULL approach. It might even be even more efficient to use SAX, but my experience is that that would lead to more difficult to read code. It has also been suggested that the parser should be autogenerated from the RelaxNg schema and that too is right. But that's a big project which could perhaps be a good GSoC project. In any case, I don't see that it would change the API to the backend, which is where the actual file conversion will take place. So whether we will use PULL or SAX in the long term or whether we will stick with the KoXml DOM approach out of laziness, the actual filter in the backend can still be written without concern. And I suspect that it's also correct that some constructs need to be parsed twice: once to collect information and once when the output is generated. This can also be seen in the EPUB export filter since EPUB contains several HTML files in a ZIP container and before the odt is parsed once it's not clear which of these HTMLfiles that internal links should point to. This can be done by using two different backends in the two passes. -Inge > br, > > Matus Uzak _______________________________________________ calligra-devel mailing list calligra-devel@kde.org https://mail.kde.org/mailman/listinfo/calligra-devel