[ 
https://issues.apache.org/jira/browse/TIKA-4701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18068481#comment-18068481
 ] 

Tim Allison commented on TIKA-4701:
-----------------------------------

We're losing alt= text in a few Italian files. And, there are two files where 
there are actually embedded image+wmf in  \pict elements but not in the mapi 
content. You can't make this up.

I propose opening a follow on ticket to deal with the embedded content in rtf.

I ran this against ~1100 msgs that I pulled out of commoncrawl. I'll add these 
to the corpora server shortly.

> Use unencapsulated HTML body when it exists in MSGs
> ---------------------------------------------------
>
>                 Key: TIKA-4701
>                 URL: https://issues.apache.org/jira/browse/TIKA-4701
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: eval-reports.tar.gz
>
>
> We recently added a hack to decapsulate html from RTF in msgs for the 
> purposes of identifying inline images.
> On a set of msgs from recent commoncrawls, it is clear that encapsulated html 
> within RTF is a major thing. I propose improving our decapsulate code and 
> using the decapsulated html as the body text.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to