Markus Jelsma created TIKA-2759:
-----------------------------------

             Summary: ScriptsExtractor incorrectly reports Javascript to 
characters() in SAX ContentHandler
                 Key: TIKA-2759
                 URL: https://issues.apache.org/jira/browse/TIKA-2759
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.18
            Reporter: Markus Jelsma
             Fix For: 1.20


We extract Javascript as text content while instead it is actually a script tag 
with base64 inline. This inline code is decoded and reported in the 
characters() method of our custom ContentHandler, and ends up as text being 
extracted, but it seems the Javascript start tag itself is never reported to 
startElement(). The Javascript is reported to characters() after we left the 
head and entered the body.

HTML file is attached



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to