https://bugs.kde.org/show_bug.cgi?id=410680

--- Comment #5 from skierpage <skierp...@gmail.com> ---
(In reply to tagwerk19 from comment #4)
> (In reply to skierpage from comment #2)
> > ... stadyn_largpagewithimages.html ...
> ... There's a plain A9 hex there, maybe a bit "old school". Try converting the
> file to unicode...
> 
>     iconv -f ISO-8859-1 -t utf-8 stadyn_largepagewithimages.html > test.html

And terms indexed according to `balooshow -x` jumped from 129 words to 2671!
You win teh InterWebz. Now "Design" and "Principles" are indexed 🎉 ...  but
still not words later on like "SSLv3" and "CANPENDING". However, by laboriously
strace --follow-forks of baloo_file , it seems some child process
(baloo_file_extractor?) does read the entire UTF8 file's contents. I'll try to
research that problem more.

I strace'd baloo_file of the original non-utf-8 files, and some child process
does one 4096-byte read of the start of the file, then packs it in! That's why
balooo indexed so few terms in the original files; I filed bug 439857.

-- 
You are receiving this mail because:
You are watching all bug changes.

Reply via email to