Tim Allison created TIKA-4701:
---------------------------------

             Summary: Use unencapsulated HTML body when it exists in MSGs
                 Key: TIKA-4701
                 URL: https://issues.apache.org/jira/browse/TIKA-4701
             Project: Tika
          Issue Type: Task
            Reporter: Tim Allison


We recently added a hack to decapsulate html from RTF in msgs for the purposes 
of identifying inline images.

On a set of msgs from recent commoncrawls, it is clear that encapsulated html 
within RTF is a major thing. I propose improving our decapsulate code and using 
the decapsulated html as the body text.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to