On Thursday, June 16, 2011 02:24:57 AM Sebastian Sauer wrote: > > The current parser that Calligra uses, uses QSharedPointer, QList, > > QVector and QByteArray. api.h does not use any of these. > > In the MSWord-filter we do; > > QBuffer buffer; > QByteArray array; > array.resize(stream.size()); > unsigned long r = stream.read((unsigned char*)array.data(), stream.size()); > buffer.setData(array); > LEInputStream wdstm(&buff1); > > where the stream.read takes according to massif >70% of the mem during the > doc=>odt conversation. Your note above made me think if we cannot save that > allocation and operate direct on the stream...
This is how the new parser (api) works and at the same time not how it works. Let me explain. The old approach (simpleParser) is to use a stream. The stream reads data which is converted to memory structures. In the old parser, there is no need to read the stream content in memory at once, yet this is done. One can improve on that by reading the data in small pieces. Doing so will give you the same amount of memory use after converting the data to memory structures. For converting from ppt to odp one needs all of the data in memory at once in the current implementation. The ppt memory structures are converted to xml and to do this, information is collected from various places in the original data. In the old parser the memory usage of the parsed information has a lot of overhead. There are three types of data with overhead: - choices: in a position a number of different structures may occur - arrays: a variable number of structures may occur - optional structures: again, a variable number may occur To make these types possible simpleParser uses QSharedPointer, QList and QVector. These are convenient, but costly. They require memory allocation and bookingkeeping overhead. The memory allocation also adds fragmentation and cache misses. In the new approach, no memory is allocated on the heap. If you parse a structure Xyz, you do: Xyz(array.data(), stream.size()); This copies the data into a struct on the stack, except in the cases where the size is unknown. In these cases, only the size and position of that information is retained. When that structure is actually needed, it is parsed again. This parsing is not expensive, and most parts are only parsed a few times. So in the new method the data stream is read into memory completely. And this is more efficient, because the original stream is kept. in simpleParser, the data is blown up into a scattered dynamically allocated structure. Note that the main mso stream typically does not contain large pictures and is usually less than a megabyte. So in summary, the memory optimization can by done by keeping the stream in memory. In cases where you only need to read a substructure, this is still possible. And as an added note, we could probably improve even more by not copying any of the data, not even onto the stack, but only keep track of the position and size pointers. This would invole a large, but simple change in interface though. Since instead of just reading a structure member, you'd need to always use a read function. Since the heap is compact and warm, this is fast. Changing this would mean a large change involving adding '()' to a lot of parts in the code. The parser is not ready for that, so let's not do it and first see what improvement we get from boud's current work. Cheers, Jos -- Jos van den Oever, software architect +49 391 25 19 15 53 074 3491911 http://kogmbh.com/legal/ _______________________________________________ calligra-devel mailing list calligra-devel@kde.org https://mail.kde.org/mailman/listinfo/calligra-devel