https://bugs.kde.org/show_bug.cgi?id=410680
--- Comment #5 from skierpage <skierp...@gmail.com> --- (In reply to tagwerk19 from comment #4) > (In reply to skierpage from comment #2) > > ... stadyn_largpagewithimages.html ... > ... There's a plain A9 hex there, maybe a bit "old school". Try converting the > file to unicode... > > iconv -f ISO-8859-1 -t utf-8 stadyn_largepagewithimages.html > test.html And terms indexed according to `balooshow -x` jumped from 129 words to 2671! You win teh InterWebz. Now "Design" and "Principles" are indexed 🎉 ... but still not words later on like "SSLv3" and "CANPENDING". However, by laboriously strace --follow-forks of baloo_file , it seems some child process (baloo_file_extractor?) does read the entire UTF8 file's contents. I'll try to research that problem more. I strace'd baloo_file of the original non-utf-8 files, and some child process does one 4096-byte read of the start of the file, then packs it in! That's why balooo indexed so few terms in the original files; I filed bug 439857. -- You are receiving this mail because: You are watching all bug changes.