[frameworks-baloo] [Bug 439857] New: baloo only indexes first 4096 bytes of non-UTF-8 text and html files

skierpage Wed, 14 Jul 2021 15:57:45 -0700

https://bugs.kde.org/show_bug.cgi?id=439857


            Bug ID: 439857
           Summary: baloo only indexes first 4096 bytes of non-UTF-8 text
                    and html files
           Product: frameworks-baloo
           Version: 5.83.0
          Platform: Fedora RPMs
                OS: Linux
            Status: REPORTED
          Severity: major
          Priority: NOR
         Component: Baloo File Daemon
          Assignee: baloo-bugs-n...@kde.org
          Reporter: skierp...@gmail.com
  Target Milestone: ---

SUMMARY
Investigating bug 410680 , @tagwerk19 figured out that a problematic file had
an ISO8859 copyright symbol at the start. By laboriously strace --follow-forks
of baloo_file I determined that some child process (baloo_file_extractor?)
reads the first 4096 bytes of the file, then packs it in. Sure enough,
baloo_file only indexes terms that appear in the first 4096 bytes of the file.

This is terrible behavior for anyone relying on Baloo. Your file appears to be
indexed with no errors, but baloo will only return it in certain search
results. Until this is fixed (the bug may lie in frameworks/kfilemetadata)
there _has_ to be some warning to this effect, both in documentation and in the
operation of baloo_file.

STEPS TO REPRODUCE
0. Run `balooctl monitor`
1. Save an HTML file with non UTF-8 character near the start to a location that
Baloo indexes. I used
https://demo.borland.com/testsite/stadyn_largepagewithimages.html 
2. balooctl monitor should report "Indexing: /path/to/file.html"
3. Run `balooshow -x /path/to/file.html`
4. To prove only the first 4096 bytes are indexed, save them to new
file_start.html (use vim's :goto 4096 to go to byte offset 4096).
5. Run `balooshow -x /path/to/file_start.html`

6. Repeat these steps with a text file. I saved the demo file as text in my
browser.

OBSERVED RESULT
baloo only indexes terms found in the first 4096 bytes of the HTML and text
file.
The output of `balooshow -x` on the shorter file includes exactly the same
Terms: line.

EXPECTED RESULT
Baloo should index all text files and HTML files.
While this bug exists, better warnings and logging from `baloo_file` daemon and
`balooctl monitor` are essential.

SOFTWARE/OS VERSIONS
Linux/KDE Plasma: Fedora 34 KDE spin
KDE Plasma Version: 5.22.3
KDE Frameworks Version: 5.83.0
Qt Version: 5.12.2 on Wayland

ADDITIONAL INFORMATION
Text and HTML files encoded with other file encodings, that have invalid UTF-8
bytes, also probably trigger this bug.

Detecting a file's character encoding is hard, but browsers do it pretty well
and have open-source implementations. Simply continuing to read and index the
file despite any character encoding issues would be better.

It is very difficult to trace what's going on because baloo_file_extractor,
baloo_filemetadata_temp_extractor, the kfileextractors, and the file indexing
process in general are largely undocumented. This is mentioned in bug 398101
but it's a more extensive problem.

As a workaround you can convert files to utf8, @tagwerk19 suggests `iconv -f
ISO-8859-1 -t utf-8 /path/to/file.extension > /path/to/file_utf8.ext`. There
seem to be other bugs in indexing large files, see later comments in bug
410680.

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 439857] New: baloo only indexes first 4096 bytes of non-UTF-8 text and html files

Reply via email to