[
https://issues.apache.org/jira/browse/TIKA-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16659213#comment-16659213
]
Tim Allison commented on TIKA-2758:
-----------------------------------
I agree with Ken that the {{independent.html}} file has the meta-header is
about 27k characters in to that is never even read because our default setting
is 8k.
Of more importance, though, to Ken's point about running against a corpus...we
did that before each release and we didn't detect any significant problems --
either there were actually no significant problems, or, also likely, I didn't
see the problem in the reports -- I missed the massive MP3Parser regression in
our 1.19 release :P.
So, where do we go from here with a) improving the regression corpus and eval
methods, b) actually fixing the html charset detector.
a) let's discuss on TIKA-2750.
b) The newer {{StandardHtmlEncodingDetector}} gets the encoding right because
it has a fairly substantial lookup list of aliases in {{CharsetAliases}}.
Perhaps we should check that first as a whitelist?
> Possible error charset detection
> --------------------------------
>
> Key: TIKA-2758
> URL: https://issues.apache.org/jira/browse/TIKA-2758
> Project: Tika
> Issue Type: Bug
> Components: core
> Affects Versions: 1.18
> Reporter: Markus Jelsma
> Priority: Major
> Fix For: 1.20
>
> Attachments: detroidnews.html, independent.html
>
>
> I started to upgrade our SAX parser Tika dependency from 1.17 to 1.19, ran
> all 995 unit tests and observed three failures, two encoding issues and one
> other weird thing. The tests use real HTML.
> Where we previously extracted text such as 'Spokane, Wash. [— The solar' we
> now got 'Spokane, Wash. [â€" The solar' in one test. The other had 'could
> take ["weeks, or' but we not get 'could take [“weeks, or' extracted. Our
> tests pass with 1.17 but fail with 1.18 and 1.19.1.
> Attached are the two HTML files.
> Reading our tests again, i see an old note besides the indepedent test
> complaining about the character encoding being incorrect. It seems somewhere
> before 1.17 it was faultly just as it is now with 1.18 and higher.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)