Markus Jelsma created TIKA-2758:
-----------------------------------

             Summary: Possible error charset detection
                 Key: TIKA-2758
                 URL: https://issues.apache.org/jira/browse/TIKA-2758
             Project: Tika
          Issue Type: Bug
          Components: core
    Affects Versions: 1.18
            Reporter: Markus Jelsma
             Fix For: 1.20
         Attachments: detroidnews.html, independent.html

I started to upgrade our SAX parser Tika dependency from 1.17 to 1.19, ran all 
995 unit tests and observed three failures, two encoding issues and one other 
weird thing. The tests use real HTML.

Where we previously extracted text  such as 'Spokane, Wash. [— The solar' we 
now got 'Spokane, Wash. [â€" The solar' in one test. The other had 'could take 
["weeks, or' but we not get 'could take [“weeks, or' extracted. Our tests 
pass with 1.17 but fail with 1.18 and 1.19.1.

Attached are the two HTML files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to