Gerard Bouchar created TIKA-2700:
------------------------------------
Summary: The HTML parser should parse the contents of the title
tag as raw text, not HTML
Key: TIKA-2700
URL: https://issues.apache.org/jira/browse/TIKA-2700
Project: Tika
Issue Type: Bug
Reporter: Gerard Bouchar
Attachments: title.html
The current HTML parser in tika fails to extract the correct document title
when it contains at least one unescaped '<' character.
For instance, in the following HTML document:
{code:html}
<html><title>title with a <b>tag</b> in it</title><body></body></html>
{code}
the extracted title is
{code}
title with a
{code}
Browsers however respect the [html parsing
specification|https://www.w3.org/TR/2011/WD-html5-20110113/tokenization.html#parsing-main-inhead],
and display this title as
{code}
title with a <b>tag</b> in it
{code}
(with a literal _<b>_)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)