[jira] [Created] (TIKA-2700) The HTML parser should parse the contents of the title tag as raw text, not HTML

Gerard Bouchar (JIRA) Tue, 31 Jul 2018 06:22:16 -0700

Gerard Bouchar created TIKA-2700:
------------------------------------

             Summary: The HTML parser should parse the contents of the title 
tag as raw text, not HTML
                 Key: TIKA-2700
                 URL: https://issues.apache.org/jira/browse/TIKA-2700
             Project: Tika
          Issue Type: Bug
            Reporter: Gerard Bouchar
         Attachments: title.html


The current HTML parser in tika fails to extract the correct document title 
when it contains at least one unescaped '<' character.

 

For instance, in the following HTML document:

{code:html}
<html><title>title with a <b>tag</b> in it</title><body></body></html>
{code}

the extracted title is

{code}
title with a
{code}


Browsers however respect the [html parsing 
specification|https://www.w3.org/TR/2011/WD-html5-20110113/tokenization.html#parsing-main-inhead],
 and display this title as 

{code}
title with a <b>tag</b> in it
{code}

(with a literal _<b>_)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (TIKA-2700) The HTML parser should parse the contents of the title tag as raw text, not HTML

Reply via email to