Markus Jelsma created TIKA-2759:
-----------------------------------
Summary: ScriptsExtractor incorrectly reports Javascript to
characters() in SAX ContentHandler
Key: TIKA-2759
URL: https://issues.apache.org/jira/browse/TIKA-2759
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.18
Reporter: Markus Jelsma
Fix For: 1.20
We extract Javascript as text content while instead it is actually a script tag
with base64 inline. This inline code is decoded and reported in the
characters() method of our custom ContentHandler, and ends up as text being
extracted, but it seems the Javascript start tag itself is never reported to
startElement(). The Javascript is reported to characters() after we left the
head and entered the body.
HTML file is attached
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)