-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://git.reviewboard.kde.org/r/102577/
-----------------------------------------------------------

Review request for Calligra.


Summary
-------

A huge patch: but also a huge speedup?

The last week, I have been busy with porting the calligra filters from using 
one type of parser to another. Both of these parsers are generated from the 
same XML description of large parts of the Microsoft Office binary file format, 
notably drawings and PowerPoint slides.

Both parsers create a hierarchical object tree that is easy to use in the 
filter code. The big difference is in the implementation of this tree. The tree 
contains objects with optional members, arrays and members have a variable 
type, I call these choices. The optional members, arrays and choices all 
require some runtime way of representing them. This is the point where the two 
parsers differ.

The parser that is currently in uses QVector, QList and QByteArray for arrays 
and QSharedPointer for the optional members and the choices. These classes 
require memory allocation and reference counting, which is a significant 
overhead. Also, the data is consumed: all information that is read is stored in 
the object tree.

The new parser takes a different approach. It does not allocate any memory on 
the heap. This means that parsing is much faster. The is however a downside: an 
array has a variable size and to store the parsed data, one must allocate 
memory. The alternative is the re-parse the data when needed, and that is what 
this parser does when an array member is accessed.

For a single pass this new parsers is 5 to 8 times as fast. When converting 
from PPT to ODP, some parts of the tree have to be accessed multiple times. I 
was hoping that the number of accesses was low enough that only a few reparses 
were needed.

Most of last week was spent changing the API from the one parser to the other. 
This was a lot of work and benchmarks could only be done at the end. And the 
results were not good: ppttoodp is now 5 times slower than before. When I first 
measured this, was hopeful that I could find a few obvious new bottlenecks.

It turns out that the main culprit is in the drawings. These contain a 
structure called OfficeArtFOPTEChoice. For a test file with 100 slides, this 
structure is parsed 10000 times in the old code and 28000 times in the new 
code. OfficeArtFOPTEChoice instances contain information about many things and 
they occur in an array. When a linewidth is needed, all these are traversed and 
checked for information about linewidth. In the old parser this meant looping 
through the array, in the new parser this means reparsing all the instances in 
the array.

This could be fixed having an object that collects all different types of 
information once, but that requires quite a bit of rewriting in the current 
code.

Now I'm wondering how the effors of the week can be best saved. The new parser 
has an API which does not expose the internals of the code. E.g.

Old:
  public:
    QSharedPointer<MenuNameAtom> menuNameAtom;
    QList<LPStd> rglpstd;
New:
  public:
    inline const MSONullable<MenuNameAtom>& menuNameAtom() const;
    inline const MSOArray<LPStd>& rglpstd() const;

The class MSONullable can be implemented in many ways, the point of the 
abstraction is to signal to the caller that this member may not be present. 
MSOArray is a simple abstraction over an iteratable and random accessable array.

However, since Calligra is a Qt project, hiding these classes behind an 
abstraction is only useful if we want the change the implementation at some 
point. Keeping two wildly different implementations under a single API and 
being able to swap it out with a compile flag is very appealing and possible, 
but we have to be realistic about the cost: it will take an additional week of 
work. And this work would need to be done pretty soon, because keeping this 
large patch/branch mergeable is quite some work too.

Keeping an abstracted interface would allow seamless replacements of a number 
of QList instances with QVector or allow keeping a thread-safe parser and using 
that when required or allow doing away with the throwing of exceptions in the 
current parser (which takes up 26% of cpu time at the moment).

Another alternative is to keep things as they are.

I will mail two callgrind calls and this patch to the mailing list.

Note: do not approve this review request: it is up hear so it's easy to browse 
the patch. The patch in it's current form slows down ppt conversion 5x.


Diffs
-----

  filters/libmso/generated/api.h PRE-CREATION 
  filters/libmso/drawstyle.cpp 1a07dff 
  filters/libmso/drawstyle.h db2d8d0 
  filters/libmso/CMakeLists.txt 3dc62f5 
  filters/libmso/ODrawToOdf.h 0016f8b 
  filters/libmso/ODrawToOdf.cpp 6bf6557 
  filters/libmso/generated/api.cpp PRE-CREATION 
  filters/libmso/generated/leinput.h PRE-CREATION 
  filters/libmso/generated/leinputstream.h 9f2e0f1 
  filters/libmso/generated/simpleParser.h e5d64c0 
  filters/libmso/generated/simpleParser.cpp 21b866fc 
  filters/libmso/pictures.h 6425a39 
  filters/libmso/pictures.cpp f8bfa7c 
  filters/libmso/shapes.cpp 81964bb 
  filters/libmso/shapes2.cpp 9f794e2 
  filters/stage/powerpoint/ParsedPresentation.h e8f8bb9 
  filters/stage/powerpoint/ParsedPresentation.cpp 1f4e2db 
  filters/stage/powerpoint/PptToOdp.h 6f56e53 
  filters/stage/powerpoint/PptToOdp.cpp d89f44e 
  filters/stage/powerpoint/globalobjectcollectors.h 5e0a9ac 
  filters/stage/powerpoint/pptstyle.h fba157b 
  filters/stage/powerpoint/pptstyle.cpp 1bcbacd 
  filters/tables/excel/import/ExcelImport.cpp 9d6d456 
  filters/tables/excel/import/ODrawClient.h 83ed45c 
  filters/tables/excel/import/ODrawClient.cpp 5a6507b 
  filters/tables/excel/import/excelimporttoods.cc 287aef7 
  filters/tables/excel/sidewinder/cell.h bc080f9 
  filters/tables/excel/sidewinder/excel.h 2be1f5a 
  filters/tables/excel/sidewinder/excel.cpp b475e5a 
  filters/tables/excel/sidewinder/objects.cpp ea60560 
  filters/tables/excel/sidewinder/sheet.h 9ff346d 
  filters/tables/excel/sidewinder/sheet.cpp 440f44b 
  filters/tables/excel/sidewinder/workbook.h 2e3c2fa 
  filters/tables/excel/sidewinder/workbook.cpp 9ad61bc 
  filters/tables/excel/sidewinder/worksheetsubstreamhandler.cpp 879f76f 
  filters/words/html-odf/document.h 1fefccc 
  filters/words/html-odf/document.cpp 83da55b 
  filters/words/msword-odf/document.h acaced9 
  filters/words/msword-odf/document.cpp f876230 
  filters/words/msword-odf/drawclient.cpp f4689fa 
  filters/words/msword-odf/graphicshandler.h 801c678 
  filters/words/msword-odf/graphicshandler.cpp 9478895 
  filters/words/msword-odf/mswordodfimport.cpp fdb8382 

Diff: http://git.reviewboard.kde.org/r/102577/diff


Testing
-------

This patch has been tested on hundreds of ppt files. None of them caused a 
crash. Correctness of the resulting odp file has been assessed by opening the 
files in calligra and libreoffice.


Thanks,

Jos

_______________________________________________
calligra-devel mailing list
calligra-devel@kde.org
https://mail.kde.org/mailman/listinfo/calligra-devel

Reply via email to