Control: severity -1 wishlist Control: tags -1 +help On Fri, Sep 20, 2013 at 04:15:29PM +0200, Sébastien Hinderer wrote: > Hello Olly, thanks for your e-mail. > > > I'm not expecting absolute proof, but it'd be good to test it on a > > selection of word documents, and compare output with and without > > the patch. > > Okay, will do once the patch is ready, which as I said will not happen > shortly because it's a lot of work.
Did you give up on the idea of working on a patch for this? > > It might be worth trying some of the other options (if you haven't > > already). > > So far I tried catdoc and maybe wordview, which were not more > successful. > > > wv has a command line extractor (wvText), which in my experience handles > > some files better than antiword (and others less well). Sadly it isn't > > actively maintained upstream either these days (last release was just > > under 3 years ago). ISTR antiword is faster than wvText. > > > > There's wv2, but that doesn't come with a command line tool - it's > > just a library. That's also not active upstream (last release nearly 4 > > years ago). > > > > There's also unoconv which uses libreoffice to do the extraction - that > > means the extraction code is actively maintained upstream, and it seems > > to work with most files I've tried. The downside is it is rather slow > > and memory hungry, and I've found it randomly fails sometimes. I think > > the issues stem from trying to remote control libreoffice, which of > > course thinks it's a GUI application rather than a command line tool > > or library. > > Will give a new look to all these, thanks. I think libreoffice also > misses the conversion. There's also now lloconv as an alternative way to use libreoffice to convert files. But I checked lloconv and wvText and they produce similar output to antiword with your example file, so it seems none of the available tools on Linux handle such files. > I don't knowbutthis font-as-codepag trick > seemsnot very well supported. It looks as if people aremostly unaware of > the problem. Perhaps i's because it has been used only for exotic fonts > such as tibetan and sanskrit ones. I know it used to be used in a more limited way for Maori which has macrons on some vowels - in the days of 8-bit character sets it was common to instead use vowels with umlauts and a font which displayed these as macrons. This variant of the trick is less problematic though as most characters match and the ones which don't are at least visually similar to the correct character. However font-as-codepage really doesn't seem a very common trick, so I'm lowering the severity to wishlist. I've also tagged it "help" so it's more discoverable as a bug looking for a patch. Cheers, Olly