https://bugs.kde.org/show_bug.cgi?id=491824

            Bug ID: 491824
           Summary: KEncodingProber misdetects short UTF-8 text as
                    Shift_JIS or gb18030 with high confidence=0.99
    Classification: Frameworks and Libraries
           Product: frameworks-kcodecs
           Version: 6.5.0
          Platform: Compiled Sources
                OS: Linux
            Status: REPORTED
          Severity: normal
          Priority: NOR
         Component: general
          Assignee: kdelibs-b...@kde.org
          Reporter: igor...@gmail.com
  Target Milestone: ---

Created attachment 172704
  --> https://bugs.kde.org/attachment.cgi?id=172704&action=edit
Patch for kencodingprobertest.cpp that demonstrates the bug

SUMMARY
KEncodingProber detects UTF-8-encoded 4 Russian characters (8 bytes) as
Shift_JIS with confidence=0.99. Appending the end-of-line character '\n' to
these 8 bytes makes KEncodingProber detect this short text as gb18030 with the
same high confidence of 0.99. This issue was discovered while testing
KDevelop's single use of KEncodingProber:
https://invent.kde.org/kdevelop/kdevelop/-/issues/71#note_1007105

Also KEncodingProber::reset() leaves behind earlier fed data. It is documented
as "reset the prober's internal state and data." So either reset()'s behavior
is wrong or the documentation misleading.

STEPS TO REPRODUCE
1. Download the attached patch, apply it to kcodecs and build.
2. Run the following command from the build directory of kcodecs:
QT_LOGGING_RULES='default.debug=true' ./bin/kencodingprobertest
3. Replace `#if 1` with `#if 0` in the code added by the patch and rebuild
kcodecs.
4. Repeat step 2.
5. Read and compare the two test run outputs and the patch itself.

-- 
You are receiving this mail because:
You are watching all bug changes.

Reply via email to