https://bugs.kde.org/show_bug.cgi?id=439857
Bug ID: 439857 Summary: baloo only indexes first 4096 bytes of non-UTF-8 text and html files Product: frameworks-baloo Version: 5.83.0 Platform: Fedora RPMs OS: Linux Status: REPORTED Severity: major Priority: NOR Component: Baloo File Daemon Assignee: baloo-bugs-n...@kde.org Reporter: skierp...@gmail.com Target Milestone: --- SUMMARY Investigating bug 410680 , @tagwerk19 figured out that a problematic file had an ISO8859 copyright symbol at the start. By laboriously strace --follow-forks of baloo_file I determined that some child process (baloo_file_extractor?) reads the first 4096 bytes of the file, then packs it in. Sure enough, baloo_file only indexes terms that appear in the first 4096 bytes of the file. This is terrible behavior for anyone relying on Baloo. Your file appears to be indexed with no errors, but baloo will only return it in certain search results. Until this is fixed (the bug may lie in frameworks/kfilemetadata) there _has_ to be some warning to this effect, both in documentation and in the operation of baloo_file. STEPS TO REPRODUCE 0. Run `balooctl monitor` 1. Save an HTML file with non UTF-8 character near the start to a location that Baloo indexes. I used https://demo.borland.com/testsite/stadyn_largepagewithimages.html 2. balooctl monitor should report "Indexing: /path/to/file.html" 3. Run `balooshow -x /path/to/file.html` 4. To prove only the first 4096 bytes are indexed, save them to new file_start.html (use vim's :goto 4096 to go to byte offset 4096). 5. Run `balooshow -x /path/to/file_start.html` 6. Repeat these steps with a text file. I saved the demo file as text in my browser. OBSERVED RESULT baloo only indexes terms found in the first 4096 bytes of the HTML and text file. The output of `balooshow -x` on the shorter file includes exactly the same Terms: line. EXPECTED RESULT Baloo should index all text files and HTML files. While this bug exists, better warnings and logging from `baloo_file` daemon and `balooctl monitor` are essential. SOFTWARE/OS VERSIONS Linux/KDE Plasma: Fedora 34 KDE spin KDE Plasma Version: 5.22.3 KDE Frameworks Version: 5.83.0 Qt Version: 5.12.2 on Wayland ADDITIONAL INFORMATION Text and HTML files encoded with other file encodings, that have invalid UTF-8 bytes, also probably trigger this bug. Detecting a file's character encoding is hard, but browsers do it pretty well and have open-source implementations. Simply continuing to read and index the file despite any character encoding issues would be better. It is very difficult to trace what's going on because baloo_file_extractor, baloo_filemetadata_temp_extractor, the kfileextractors, and the file indexing process in general are largely undocumented. This is mentioned in bug 398101 but it's a more extensive problem. As a workaround you can convert files to utf8, @tagwerk19 suggests `iconv -f ISO-8859-1 -t utf-8 /path/to/file.extension > /path/to/file_utf8.ext`. There seem to be other bugs in indexing large files, see later comments in bug 410680. -- You are receiving this mail because: You are watching all bug changes.