[
https://issues.apache.org/jira/browse/TIKA-4701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18068481#comment-18068481
]
Tim Allison commented on TIKA-4701:
-----------------------------------
We're losing alt= text in a few Italian files. And, there are two files where
there are actually embedded image+wmf in \pict elements but not in the mapi
content. You can't make this up.
I propose opening a follow on ticket to deal with the embedded content in rtf.
I ran this against ~1100 msgs that I pulled out of commoncrawl. I'll add these
to the corpora server shortly.
> Use unencapsulated HTML body when it exists in MSGs
> ---------------------------------------------------
>
> Key: TIKA-4701
> URL: https://issues.apache.org/jira/browse/TIKA-4701
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Minor
> Attachments: eval-reports.tar.gz
>
>
> We recently added a hack to decapsulate html from RTF in msgs for the
> purposes of identifying inline images.
> On a set of msgs from recent commoncrawls, it is clear that encapsulated html
> within RTF is a major thing. I propose improving our decapsulate code and
> using the decapsulated html as the body text.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)