On Mon, 2012-07-09 at 00:25 +0100, Flavio Moringa wrote:
> nice to ear from someone so "up the ranks" like you.. makes me feel
> much more important :-)
Ho hum; we try to avoid unpleasant hierarchy as much as possible.
> I'll probably wont't be able to do a conversion engine by myself...
> but I can definitely mess around with code...
Great :-)
> Yes, it's definitely something I can do... I do believe that the
> harder part is getting that " large corpus of documents out
> there...". At least as my experience goes, I've found that it's hard
> to get users to send us documents they use... either due to privacy
> questions or enterprise policies... But a tool like that makes a lot
> of sense
Oh - so; getting the documents is not -that- hard; Google has a
document-type search that can be automated; just search for:
filetype:docx
And start scraping; as well as 7 million files, we get to take
advantage of Google's popularity ranking to get the most popular first
100 or whatever :-)
> For now then I'll start doing as you suggest and look in bugzilla for
> documents with conversion problems to try and compile as much examples
> as I can. Then maybe using the latest beta to do the conversion and
> see which problems are still there. Then maybe starting a perl script
> that can scrap the OOXML files to find the most used tags... and start
> from there...
We also have tools for dumping all the documents out of bugzilla - see
the main 'core' repository:
bin/get-bugzilla-attachments-by-mimetype
so really the fun piece is writing the parser & element / attribute
value parser / database to analyse what pieces are popular and provide a
pretty UI or command-line for hackers to grok that.
It'd be just great to have that data in hand.
Thanks !
Michael.
--
[email protected] <><, Pseudo Engineer, itinerant idiot
_______________________________________________
List Name: Libreoffice-qa mailing list
Mail address: [email protected]
Change settings: http://lists.freedesktop.org/mailman/listinfo/libreoffice-qa
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://lists.freedesktop.org/archives/libreoffice-qa/