[jira] [Commented] (TIKA-2700) The HTML parser should parse the contents of the title tag as raw text, not HTML

Gerard Bouchar (JIRA) Tue, 31 Jul 2018 08:41:21 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16563851#comment-16563851
 ]


Gerard Bouchar commented on TIKA-2700:
--------------------------------------

The bug is in tagsoup, that tika uses. Unfortunately, the library seems to be 
completely unmaintained (and it doesn't have any tests).
I made [a pull request|https://github.com/jukka/tagsoup/pull/4] to them, but I 
doubt they react anytime soon.

Are there plans to move to a better HTML parsing library by default ?  

> The HTML parser should parse the contents of the title tag as raw text, not 
> HTML
> --------------------------------------------------------------------------------
>
>                 Key: TIKA-2700
>                 URL: https://issues.apache.org/jira/browse/TIKA-2700
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Gerard Bouchar
>            Priority: Major
>         Attachments: title.html
>
>
> The current HTML parser in tika fails to extract the correct document title 
> when it contains at least one unescaped '<' character.
>  
> For instance, in the following HTML document:
> {code:html}
> <html><title>title with a <b>tag</b> in it</title><body></body></html>
> {code}
> the extracted title is
> {code}
> title with a
> {code}
> Browsers however respect the [html parsing 
> specification|https://www.w3.org/TR/2011/WD-html5-20110113/tokenization.html#parsing-main-inhead],
>  and display this title as 
> {code}
> title with a <b>tag</b> in it
> {code}
> (with a literal _<b>_)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2700) The HTML parser should parse the contents of the title tag as raw text, not HTML

Reply via email to