https://bugs.kde.org/show_bug.cgi?id=438455

            Bug ID: 438455
           Summary: Baloo doesn't index Microsoft Office .doc files
           Product: frameworks-baloo
           Version: 5.82.0
          Platform: Fedora RPMs
                OS: Linux
            Status: REPORTED
          Severity: normal
          Priority: NOR
         Component: Baloo File Daemon
          Assignee: stefan.bru...@rwth-aachen.de
          Reporter: skierp...@gmail.com
                CC: baloo-bugs-n...@kde.org, n...@kde.org
  Target Milestone: ---

SUMMARY
`baloosearch` couldn't locate a word processing file with a term in it. It was
a .doc file, not .docx or .odt.

STEPS TO REPRODUCE
1. In LibreOffice Writer, create a document containing just
"baloopleaseindexme"
2. File > Save As in Word 97-2003 format as baloo_indexing_test.doc in some
directory that Baloo indexes.
3. In a terminal, run `baloosearch baloopleaseindexme`
4. In a terminal, run `balooshow -x /path/to/baloo_indexing_test.doc

OBSERVED RESULT
The document contents aren't indexed, so baloosearch for the content fails.
balooshow doesn't list any words in the document, just
  Terms: Mapplication Mmsword T5 X19-0 X20-0


EXPECTED RESULT
baloo should index these files as it does .odt and .docx files.

SOFTWARE/OS VERSIONS
Linux/KDE Plasma: 
KDE Plasma Version: 5.21.5
KDE Frameworks Version: 5.82.0
Qt Version: 5.15.2 on Wayland

ADDITIONAL INFORMATION
There are tools to extract text from MSOffice files, e.g.
  % flatpak run org.libreoffice.LibreOffice --invisible --convert-to txt
--outdir /tmp/ /path/to/baloo_indexing_test.doc
will convert a .doc file to .txt. And TDF/DocumentLiberation project offers
introspection tools like mso-dumper's doc-dump which dumps in some weird XML
format.

In the interim this limitation should be mentioned somewhere, but I can't see
where Baloo describes the file types whose content it does index.

I don't know if Baloo indexes contents of other MS Office 1990-2000 formats.
Again, I should have to create test files to find out, known limitations should
be documented.

-- 
You are receiving this mail because:
You are watching all bug changes.

Reply via email to