[
https://issues.apache.org/jira/browse/TIKA-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-2759:
--------------------------------
Description:
We extract Javascript as text content while instead it is actually a script tag
with base64 inline. This inline code is decoded and reported in the
characters() method of our custom ContentHandler, and ends up as text being
extracted, but it seems the Javascript start tag itself is never reported to
startElement(). The Javascript is reported to characters() after we left the
head and entered the body.
HTML file is attached
The following script tag:
{code}
<script
src="data:text/javascript;base64,Oyh3aW5kb3cuanExODN8fGpRdWVyeSkoZnVuY3Rpb24oJCl7bmV3IEltcHJvdmVkQUpBWExvZ2luKHsNCmlkOiAxNTcsDQppc0d1ZXN0OiAxLA0Kb2F1dGg6IHsiZmFjZWJvb2siOiJodHRwczpcL1wvd3d3LmZhY2Vib29rLmNvbVwvZGlhbG9nXC9vYXV0aD9zY29wZT1lbWFpbCZyZXNwb25zZV90eXBlPWNvZGUmZGlzcGxheT1wb3B1cCZjbGllbnRfaWQ9MTcyODk0MjQzMDY1MDQ4NiZyZWRpcmVjdF91cmk9aHR0cCUzQSUyRiUyRnBldHJvbGljaW91cy5jb20lMkZpbmRleC5waHAlM0ZvcHRpb24lM0Rjb21faW1wcm92ZWRfYWpheF9sb2dpbiUyNnRhc2slM0RmYWNlYm9vayIsImdvb2dsZSI6Imh0dHBzOlwvXC9hY2NvdW50cy5nb29nbGUuY29tXC9vXC9vYXV0aDJcL2F1dGg/c2NvcGU9aHR0cHMlM0ElMkYlMkZ3d3cuZ29vZ2xlYXBpcy5jb20lMkZhdXRoJTJGdXNlcmluZm8uZW1haWwraHR0cHMlM0ElMkYlMkZ3d3cuZ29vZ2xlYXBpcy5jb20lMkZhdXRoJTJGdXNlcmluZm8ucHJvZmlsZSZyZXNwb25zZV90eXBlPWNvZGUmZGlzcGxheT1wb3B1cCZjbGllbnRfaWQ9ODQ5NDk3NjQ3ODUzLW1mOThqNGdlOGkwYzlkaTFrbG9zc2YxbmdibWI2cG12LmFwcHMuZ29vZ2xldXNlcmNvbnRlbnQuY29tJnJlZGlyZWN0X3VyaT1odHRwJTNBJTJGJTJGcGV0cm9saWNpb3VzLmNvbSUyRmluZGV4LnBocCUzRm9wdGlvbiUzRGNvbV9pbXByb3ZlZF9hamF4X2xvZ2luJTI2dGFzayUzRGdvb2dsZSJ9LA0KYmdPcGFjaXR5OiAwLjQsDQpyZXR1cm5Vcmw6ICcvaXMtdGhpcy1kdXRjaC1jbGFzc2ljLWZpbmFsbHktYXMtY29vbC1hcy1hLWJtdycsDQpib3JkZXI6IHBhcnNlSW50KCdmNWY1ZjV8KnwzfCp8YzRjNGM0fCp8Nycuc3BsaXQoJ3wqfCcpWzFdKSwNCnBhZGRpbmc6IDQsDQp1c2VBSkFYOiAwLA0Kb3BlbkV2ZW50OiAnb25jbGljaycsDQp3bmRDZW50ZXI6IDAsDQpyZWdQb3B1cDogMSwNCmR1cjogMzAwLA0KdGltZW91dDogMCwNCmJhc2U6ICcvJywNCnRoZW1lOiAncGV0cm9saWNpb3VzJywNCnNvY2lhbFByb2ZpbGU6ICcnLA0Kc29jaWFsVHlwZTogJ2J0bkljbycsDQpjc3NQYXRoOiAnL21vZHVsZXMvbW9kX2ltcHJvdmVkX2FqYXhfbG9naW4vY2FjaGUvMTU3LzNkNDE4Mzk2NDk2N2Y2ZWVlYjI5MTdhOTI2OGM2MTIxLmNzcycsDQpyZWdQYWdlOiAnam9vbWxhJywNCmNhcHRjaGE6ICcnLA0Kc2hvd0hpbnQ6IDAsDQpnZW9sb2NhdGlvbjogZmFsc2UsDQp3aW5kb3dBbmltOiAnJw0KfSl9KTs="
type="text/javascript"></script>
{code}
gets reported outside the head (in html.p) as:
{code}
;(window.jq183||jQuery)(function($){new ImprovedAJAXLogin({
id: 157,
isGuest: 1,
oauth:
{"facebook":"https:\/\/www.facebook.com\/dialog\/oauth?scope=email&response_type=code&display=popup&client_id=1728942430650486&redirect_uri=http%3A%2F%2Fpetrolicious.com%2Findex.php%3Foption%3Dcom_improved_ajax_login%26task%3Dfacebook","google":"https:\/\/accounts.google.com\/o\/oauth2\/auth?scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.profile&response_type=code&display=popup&client_id=849497647853-mf98j4ge8i0c9di1klossf1ngbmb6pmv.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Fpetrolicious.com%2Findex.php%3Foption%3Dcom_improved_ajax_login%26task%3Dgoogle"},
bgOpacity: 0.4,
returnUrl: '/is-this-dutch-classic-finally-as-cool-as-a-bmw',
border: parseInt('f5f5f5|*|3|*|c4c4c4|*|7'.split('|*|')[1]),
padding: 4,
useAJAX: 0,
openEvent: 'onclick',
wndCenter: 0,
regPopup: 1,
dur: 300,
timeout: 0,
base: '/',
theme: 'petrolicious',
socialProfile: '',
socialType: 'btnIco',
cssPath:
'/modules/mod_improved_ajax_login/cache/157/3d4183964967f6eeeb2917a9268c6121.css',
regPage: 'joomla',
captcha: '',
showHint: 0,
geolocation: false,
windowAnim: ''
})});
{code}
was:
We extract Javascript as text content while instead it is actually a script tag
with base64 inline. This inline code is decoded and reported in the
characters() method of our custom ContentHandler, and ends up as text being
extracted, but it seems the Javascript start tag itself is never reported to
startElement(). The Javascript is reported to characters() after we left the
head and entered the body.
HTML file is attached
> ScriptsExtractor incorrectly reports Javascript to characters() in SAX
> ContentHandler
> -------------------------------------------------------------------------------------
>
> Key: TIKA-2759
> URL: https://issues.apache.org/jira/browse/TIKA-2759
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.18
> Reporter: Markus Jelsma
> Priority: Major
> Fix For: 1.20
>
> Attachments: petrolicious.html
>
>
> We extract Javascript as text content while instead it is actually a script
> tag with base64 inline. This inline code is decoded and reported in the
> characters() method of our custom ContentHandler, and ends up as text being
> extracted, but it seems the Javascript start tag itself is never reported to
> startElement(). The Javascript is reported to characters() after we left the
> head and entered the body.
> HTML file is attached
> The following script tag:
> {code}
> <script
> src="data:text/javascript;base64,Oyh3aW5kb3cuanExODN8fGpRdWVyeSkoZnVuY3Rpb24oJCl7bmV3IEltcHJvdmVkQUpBWExvZ2luKHsNCmlkOiAxNTcsDQppc0d1ZXN0OiAxLA0Kb2F1dGg6IHsiZmFjZWJvb2siOiJodHRwczpcL1wvd3d3LmZhY2Vib29rLmNvbVwvZGlhbG9nXC9vYXV0aD9zY29wZT1lbWFpbCZyZXNwb25zZV90eXBlPWNvZGUmZGlzcGxheT1wb3B1cCZjbGllbnRfaWQ9MTcyODk0MjQzMDY1MDQ4NiZyZWRpcmVjdF91cmk9aHR0cCUzQSUyRiUyRnBldHJvbGljaW91cy5jb20lMkZpbmRleC5waHAlM0ZvcHRpb24lM0Rjb21faW1wcm92ZWRfYWpheF9sb2dpbiUyNnRhc2slM0RmYWNlYm9vayIsImdvb2dsZSI6Imh0dHBzOlwvXC9hY2NvdW50cy5nb29nbGUuY29tXC9vXC9vYXV0aDJcL2F1dGg/c2NvcGU9aHR0cHMlM0ElMkYlMkZ3d3cuZ29vZ2xlYXBpcy5jb20lMkZhdXRoJTJGdXNlcmluZm8uZW1haWwraHR0cHMlM0ElMkYlMkZ3d3cuZ29vZ2xlYXBpcy5jb20lMkZhdXRoJTJGdXNlcmluZm8ucHJvZmlsZSZyZXNwb25zZV90eXBlPWNvZGUmZGlzcGxheT1wb3B1cCZjbGllbnRfaWQ9ODQ5NDk3NjQ3ODUzLW1mOThqNGdlOGkwYzlkaTFrbG9zc2YxbmdibWI2cG12LmFwcHMuZ29vZ2xldXNlcmNvbnRlbnQuY29tJnJlZGlyZWN0X3VyaT1odHRwJTNBJTJGJTJGcGV0cm9saWNpb3VzLmNvbSUyRmluZGV4LnBocCUzRm9wdGlvbiUzRGNvbV9pbXByb3ZlZF9hamF4X2xvZ2luJTI2dGFzayUzRGdvb2dsZSJ9LA0KYmdPcGFjaXR5OiAwLjQsDQpyZXR1cm5Vcmw6ICcvaXMtdGhpcy1kdXRjaC1jbGFzc2ljLWZpbmFsbHktYXMtY29vbC1hcy1hLWJtdycsDQpib3JkZXI6IHBhcnNlSW50KCdmNWY1ZjV8KnwzfCp8YzRjNGM0fCp8Nycuc3BsaXQoJ3wqfCcpWzFdKSwNCnBhZGRpbmc6IDQsDQp1c2VBSkFYOiAwLA0Kb3BlbkV2ZW50OiAnb25jbGljaycsDQp3bmRDZW50ZXI6IDAsDQpyZWdQb3B1cDogMSwNCmR1cjogMzAwLA0KdGltZW91dDogMCwNCmJhc2U6ICcvJywNCnRoZW1lOiAncGV0cm9saWNpb3VzJywNCnNvY2lhbFByb2ZpbGU6ICcnLA0Kc29jaWFsVHlwZTogJ2J0bkljbycsDQpjc3NQYXRoOiAnL21vZHVsZXMvbW9kX2ltcHJvdmVkX2FqYXhfbG9naW4vY2FjaGUvMTU3LzNkNDE4Mzk2NDk2N2Y2ZWVlYjI5MTdhOTI2OGM2MTIxLmNzcycsDQpyZWdQYWdlOiAnam9vbWxhJywNCmNhcHRjaGE6ICcnLA0Kc2hvd0hpbnQ6IDAsDQpnZW9sb2NhdGlvbjogZmFsc2UsDQp3aW5kb3dBbmltOiAnJw0KfSl9KTs="
> type="text/javascript"></script>
> {code}
> gets reported outside the head (in html.p) as:
> {code}
> ;(window.jq183||jQuery)(function($){new ImprovedAJAXLogin({
> id: 157,
> isGuest: 1,
> oauth:
> {"facebook":"https:\/\/www.facebook.com\/dialog\/oauth?scope=email&response_type=code&display=popup&client_id=1728942430650486&redirect_uri=http%3A%2F%2Fpetrolicious.com%2Findex.php%3Foption%3Dcom_improved_ajax_login%26task%3Dfacebook","google":"https:\/\/accounts.google.com\/o\/oauth2\/auth?scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.profile&response_type=code&display=popup&client_id=849497647853-mf98j4ge8i0c9di1klossf1ngbmb6pmv.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Fpetrolicious.com%2Findex.php%3Foption%3Dcom_improved_ajax_login%26task%3Dgoogle"},
> bgOpacity: 0.4,
> returnUrl: '/is-this-dutch-classic-finally-as-cool-as-a-bmw',
> border: parseInt('f5f5f5|*|3|*|c4c4c4|*|7'.split('|*|')[1]),
> padding: 4,
> useAJAX: 0,
> openEvent: 'onclick',
> wndCenter: 0,
> regPopup: 1,
> dur: 300,
> timeout: 0,
> base: '/',
> theme: 'petrolicious',
> socialProfile: '',
> socialType: 'btnIco',
> cssPath:
> '/modules/mod_improved_ajax_login/cache/157/3d4183964967f6eeeb2917a9268c6121.css',
> regPage: 'joomla',
> captcha: '',
> showHint: 0,
> geolocation: false,
> windowAnim: ''
> })});
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)