[
https://issues.apache.org/jira/browse/TIKA-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-2759:
--------------------------------
Attachment: petrolicious.html
> ScriptsExtractor incorrectly reports Javascript to characters() in SAX
> ContentHandler
> -------------------------------------------------------------------------------------
>
> Key: TIKA-2759
> URL: https://issues.apache.org/jira/browse/TIKA-2759
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.18
> Reporter: Markus Jelsma
> Priority: Major
> Fix For: 1.20
>
> Attachments: petrolicious.html
>
>
> We extract Javascript as text content while instead it is actually a script
> tag with base64 inline. This inline code is decoded and reported in the
> characters() method of our custom ContentHandler, and ends up as text being
> extracted, but it seems the Javascript start tag itself is never reported to
> startElement(). The Javascript is reported to characters() after we left the
> head and entered the body.
> HTML file is attached
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)