Mike Cantrell created TIKA-1713:
-----------------------------------
Summary: RTF parser misses text content
Key: TIKA-1713
URL: https://issues.apache.org/jira/browse/TIKA-1713
Project: Tika
Issue Type: Bug
Affects Versions: 1.10
Reporter: Mike Cantrell
We have a lot of Outlook msg files that have RTF body content. Tika is not
fixing any text within these messages. It appears to be a mixture of RTF and
HTML.
I've extracted an example RTF body (see attachment) for use with the following
test case:
{code}
ByteArrayOutputStream bytes = new ByteArrayOutputStream()
rtfParser.parse(
this.class.getResourceAsStream("/problems/no-text.rtf"),
new EmbeddedContentHandler(new BodyContentHandler(bytes)),
new Metadata(), new ParseContext()
);
assertTrue("Document is missing required text", bytes.toByteArray().length > 0)
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)