Mike Cantrell created TIKA-1713:
-----------------------------------

             Summary: RTF parser misses text content 
                 Key: TIKA-1713
                 URL: https://issues.apache.org/jira/browse/TIKA-1713
             Project: Tika
          Issue Type: Bug
    Affects Versions: 1.10
            Reporter: Mike Cantrell


We have a lot of Outlook msg files that have RTF body content. Tika is not 
fixing any text within these messages. It appears to be a mixture of RTF and 
HTML.

I've extracted an example RTF body (see attachment) for use with the following 
test case:

{code}
ByteArrayOutputStream bytes = new ByteArrayOutputStream()
rtfParser.parse(
        this.class.getResourceAsStream("/problems/no-text.rtf"),
        new EmbeddedContentHandler(new BodyContentHandler(bytes)),
        new Metadata(), new ParseContext()
);
assertTrue("Document is missing required text", bytes.toByteArray().length > 0)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to