Tim Allison created TIKA-4701:
---------------------------------
Summary: Use unencapsulated HTML body when it exists in MSGs
Key: TIKA-4701
URL: https://issues.apache.org/jira/browse/TIKA-4701
Project: Tika
Issue Type: Task
Reporter: Tim Allison
We recently added a hack to decapsulate html from RTF in msgs for the purposes
of identifying inline images.
On a set of msgs from recent commoncrawls, it is clear that encapsulated html
within RTF is a major thing. I propose improving our decapsulate code and using
the decapsulated html as the body text.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)