https://bugs.kde.org/show_bug.cgi?id=410680
Bug ID: 410680 Summary: baloo doesn't index words far down in HTML documents Product: frameworks-baloo Version: 5.59.0 Platform: Fedora RPMs OS: Linux Status: REPORTED Severity: normal Priority: NOR Component: Engine Assignee: stefan.bru...@rwth-aachen.de Reporter: skierp...@gmail.com Target Milestone: --- SUMMARY I realized `baloosearch TERM` wasn't returning a 750 kB HTML document that I knew contained TERM starting at byte offset 72,814. But it does work if TERM is nearer the start. I reproduced this with a 100 kB file, baloosearch doesn't return the file if TERM is beyond around 61,600 bytes from the start. I also reproduced with a big HTML file off the web. STEPS TO REPRODUCE 1. Find a big HTML file (over 100 kB), look for a word that only appears near the end, or just insert <p>NOSUCHWORD</p> somewhere near the end of the file. (I found https://demo.borland.com/testsite/stadyn_largepagewithimages.html but got inconsistent results.) 2. Run `balooctl monitor` in a terminal 3. Copy the HTML file to a location that Baloo indexes, e.g. your home directory 3. After `balooctl monitor` reports it's Indexing: file, then Idle, enter `baloosearch NOSUCHWORD. E.g. I found (using `rg --byte-offset NOSUCHWORD`) that "SSLv3" first appears 85,249 bytes into that test file, and baloosearch doesn't return it. OBSERVED RESULT Baloo doesn't index words beyond "a certain point" in an HTML file. EXPECTED RESULT Baloo should index the entire file... except when it intentionally doesn't. I found a five-year-old plasma-devel thread https://plasma-devel.kde.narkive.com/TJAmjxUb/baloo-not-indexing-everything-by-default in which someone suggested "Just index the first say 100 KiB or so of a file", I don't know if that was implemented. If it has been, there *MUST* be good documentation of this and logging and warnings when Baloo intentionally doesn't index part or all of a file. E.g. `balooshow path/to/file` could say "Large file, only the first 64 kiB of text in it was indexed." SOFTWARE/OS VERSIONS Linux/KDE Plasma: (available in About System) KDE Plasma Version: 5.15.5 KDE Frameworks Version: 5.59.0 Qt Version: 5.12.4, xcb ADDITIONAL INFORMATION There doesn't seem to be any way to run baloo_file_indexer yourself to find out what it gets from a file. Nor could I figure out what the baloo-widge baloo_filemetadata_temp_extractor does, or how to get useful logging of text extraction. This all makes debugging painful. The Baloo source README.md says "Baloo relies on [KFileMetaData](https://api.kde.org/frameworks/kfilemetadata/html/index.html) to extract content from the files", so maybe the problem lies in that library. There's no specific extractor in either project for HTML files. -- You are receiving this mail because: You are watching all bug changes.