[jira] [Updated] (TIKA-2758) Possible error charset detection

Markus Jelsma (JIRA) Thu, 18 Oct 2018 04:25:19 -0700


     [ 
https://issues.apache.org/jira/browse/TIKA-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Markus Jelsma updated TIKA-2758:
--------------------------------
    Description: 
I started to upgrade our SAX parser Tika dependency from 1.17 to 1.19, ran all 
995 unit tests and observed three failures, two encoding issues and one other 
weird thing. The tests use real HTML.

Where we previously extracted text  such as 'Spokane, Wash. [— The solar' we 
now got 'Spokane, Wash. [â€" The solar' in one test. The other had 'could take 
["weeks, or' but we not get 'could take [â€œweeks, or' extracted. Our tests 
pass with 1.17 but fail with 1.18 and 1.19.1.

Attached are the two HTML files.

Reading our tests again, i see an old note besides the indepedent test 
complaining about the character encoding being incorrect. It seems somewhere 
before 1.17 it was faultly just as it is now with 1.18 and higher.

  was:
I started to upgrade our SAX parser Tika dependency from 1.17 to 1.19, ran all 
995 unit tests and observed three failures, two encoding issues and one other 
weird thing. The tests use real HTML.

Where we previously extracted text  such as 'Spokane, Wash. [— The solar' we 
now got 'Spokane, Wash. [â€" The solar' in one test. The other had 'could take 
["weeks, or' but we not get 'could take [â€œweeks, or' extracted. Our tests 
pass with 1.17 but fail with 1.18 and 1.19.1.

Attached are the two HTML files.


> Possible error charset detection
> --------------------------------
>
>                 Key: TIKA-2758
>                 URL: https://issues.apache.org/jira/browse/TIKA-2758
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.18
>            Reporter: Markus Jelsma
>            Priority: Major
>             Fix For: 1.20
>
>         Attachments: detroidnews.html, independent.html
>
>
> I started to upgrade our SAX parser Tika dependency from 1.17 to 1.19, ran 
> all 995 unit tests and observed three failures, two encoding issues and one 
> other weird thing. The tests use real HTML.
> Where we previously extracted text  such as 'Spokane, Wash. [— The solar' we 
> now got 'Spokane, Wash. [â€" The solar' in one test. The other had 'could 
> take ["weeks, or' but we not get 'could take [â€œweeks, or' extracted. Our 
> tests pass with 1.17 but fail with 1.18 and 1.19.1.
> Attached are the two HTML files.
> Reading our tests again, i see an old note besides the indepedent test 
> complaining about the character encoding being incorrect. It seems somewhere 
> before 1.17 it was faultly just as it is now with 1.18 and higher.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (TIKA-2758) Possible error charset detection

Reply via email to