https://bugs.kde.org/show_bug.cgi?id=380456

--- Comment #32 from tagwer...@innerjoin.org ---
(In reply to Adam Fontenot from comment #31)
> I've given up on testing for the time being.
That's OK, thank you for your efforts.

> ... I find it hard to believe that any
> reasonable database format would need to write 150+ GB to disk to delete
> entries from a database that was only 8 GB ...
I'm sure its possible to come up with a better explanation; with less
handwaving, more detail and probably more accurate, but...

The way that Baloo provides results for searches so quickly is that it jumps to
the word in the database and pulls a page from disk that lists all the files
that the word appears in. When you index a file, you extract a list of words,
look up each word in the database, get the list of files it appears in, insert 
this new file (as an ID rather than filename) into the list and save it back.
Or rather it saves it in memory, waiting for a commit...

The same process happens in reverse when you delete a file, for each word the
list is read, the ID removed and the list written back. For common words, these
lists can be *large* (and there's position information to be considered as well
so you can look for phrases as well as words)

Baloo_file_extractor sidesteps the problem by dealing with 40 files at once,
the list is not read and written back (committed) after each file, the somewhat
extended lists are written back after indexing 40 files. Baloo_file ought to do
the same for deletes, I think it would make a difference.

> This is more of a question, but what is the *intent* behind the 512MB memory
> limit? I think that's an entirely reasonable upper bound on a file indexer,
> personally, but I'm not sure what's supposed to happen when indexing some file
> would cause the indexer to exceed that. It is supposed to intelligently skip
> the file? Crash and then continue with another file? Hang entirely and no
> longer make progress?
The 512M is a somewhat arbitrary external constraint and is probably OK in the
majority of cases. The intent was to stop Baloo competing for memory with the
rest of the system. As a technique, it works very well, it's just that the
"limit chosen" is too tight in my view. The bugs that arrive here are the tough
cases where Baloo really does need more space, and in a lot of these cases,
setting a higher limit works.

>From my experience, when Baloo approaches the limit, it starts dropping "clean"
pages and rereading then when needed. What you see is a lot more reads. When it
is indexing and is building a very large transaction, where it cannot drop
dirty pages, it can swap (which is bad news) or the kernel responds ever more
slowly when Baloo asks for more memory (which is bad news), or eventually the
process is killed OOM (which is bad news)

As to whether Baloo knows whether it is hitting the limit, I think not, it is
external to the code. Whether it *can* know, that's interesting but I don't
know.

> I'm willing to re-test if someone wants to provide a patch to batch up deletes
> in baloo_file.
Thank you!

-- 
You are receiving this mail because:
You are watching all bug changes.

Reply via email to