[
https://issues.apache.org/jira/browse/TIKA-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mike Cantrell updated TIKA-1713:
--------------------------------
Description:
We have a lot of Outlook msg files that have RTF body content. Tika is not
finding any text within these messages. It appears to be a mixture of RTF and
HTML.
I've extracted an example RTF body (see attachment) for use with the following
test case:
{code}
ByteArrayOutputStream bytes = new ByteArrayOutputStream()
rtfParser.parse(
this.class.getResourceAsStream("/problems/no-text.rtf"),
new EmbeddedContentHandler(new BodyContentHandler(bytes)),
new Metadata(), new ParseContext()
);
assertTrue("Document is missing required text", bytes.toByteArray().length > 0)
{code}
was:
We have a lot of Outlook msg files that have RTF body content. Tika is not
fixing any text within these messages. It appears to be a mixture of RTF and
HTML.
I've extracted an example RTF body (see attachment) for use with the following
test case:
{code}
ByteArrayOutputStream bytes = new ByteArrayOutputStream()
rtfParser.parse(
this.class.getResourceAsStream("/problems/no-text.rtf"),
new EmbeddedContentHandler(new BodyContentHandler(bytes)),
new Metadata(), new ParseContext()
);
assertTrue("Document is missing required text", bytes.toByteArray().length > 0)
{code}
> RTF parser misses text content
> -------------------------------
>
> Key: TIKA-1713
> URL: https://issues.apache.org/jira/browse/TIKA-1713
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.10
> Reporter: Mike Cantrell
> Attachments: no-text.rtf
>
>
> We have a lot of Outlook msg files that have RTF body content. Tika is not
> finding any text within these messages. It appears to be a mixture of RTF and
> HTML.
> I've extracted an example RTF body (see attachment) for use with the
> following test case:
> {code}
> ByteArrayOutputStream bytes = new ByteArrayOutputStream()
> rtfParser.parse(
> this.class.getResourceAsStream("/problems/no-text.rtf"),
> new EmbeddedContentHandler(new BodyContentHandler(bytes)),
> new Metadata(), new ParseContext()
> );
> assertTrue("Document is missing required text", bytes.toByteArray().length >
> 0)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)