https://bugs.kde.org/show_bug.cgi?id=410680

            Bug ID: 410680
           Summary: baloo doesn't index words far down in HTML documents
           Product: frameworks-baloo
           Version: 5.59.0
          Platform: Fedora RPMs
                OS: Linux
            Status: REPORTED
          Severity: normal
          Priority: NOR
         Component: Engine
          Assignee: stefan.bru...@rwth-aachen.de
          Reporter: skierp...@gmail.com
  Target Milestone: ---

SUMMARY
I realized `baloosearch TERM` wasn't returning a 750 kB HTML document that I
knew contained TERM starting at byte offset 72,814. But it does work if TERM is
nearer the start. I reproduced this with a 100 kB file, baloosearch doesn't
return the file if TERM is beyond around 61,600 bytes from the start. I also
reproduced with a big HTML file off the web.

STEPS TO REPRODUCE
1. Find a big HTML file (over 100 kB), look for a word that only appears near
the end, or just insert <p>NOSUCHWORD</p> somewhere near the end of the file.
(I found https://demo.borland.com/testsite/stadyn_largepagewithimages.html but
got inconsistent results.)
2. Run `balooctl monitor` in a terminal
3. Copy the HTML file to a location that Baloo indexes, e.g. your home
directory
3. After `balooctl monitor` reports it's Indexing: file, then Idle, enter
`baloosearch NOSUCHWORD. E.g. I found (using `rg --byte-offset NOSUCHWORD`)
that "SSLv3" first appears 85,249 bytes into that test file, and baloosearch
doesn't return it.

OBSERVED RESULT
Baloo doesn't index words beyond "a certain point" in an HTML file.

EXPECTED RESULT
Baloo should index the entire file... except when it intentionally doesn't.

I found a five-year-old plasma-devel thread
https://plasma-devel.kde.narkive.com/TJAmjxUb/baloo-not-indexing-everything-by-default
in which someone suggested "Just index the first say 100 KiB or so of a file",
I don't know if that was implemented. If it has been, there *MUST* be good
documentation of this and logging and warnings when Baloo intentionally doesn't
index part or all of a file. E.g. `balooshow path/to/file` could say "Large
file, only the first 64 kiB of text in it was indexed."

SOFTWARE/OS VERSIONS
Linux/KDE Plasma: 
(available in About System)
KDE Plasma Version: 5.15.5
KDE Frameworks Version: 5.59.0
Qt Version: 5.12.4, xcb

ADDITIONAL INFORMATION
There doesn't seem to be any way to run baloo_file_indexer yourself to find out
what it gets from a file. Nor could I figure out what the baloo-widge
baloo_filemetadata_temp_extractor does, or how to get useful logging of text
extraction. This all makes debugging painful.

The Baloo source README.md says "Baloo relies on
[KFileMetaData](https://api.kde.org/frameworks/kfilemetadata/html/index.html)
to extract content from the files", so maybe the problem lies in that library.
There's no specific extractor in either project for HTML files.

-- 
You are receiving this mail because:
You are watching all bug changes.

Reply via email to