[frameworks-baloo] [Bug 434926] Crash in Baloo::IdFilenameDB::get() after Baloo::DocumentUrlDB::get

nyanpasu64 Sat, 25 Jun 2022 20:41:27 -0700

https://bugs.kde.org/show_bug.cgi?id=434926


--- Comment #13 from nyanpasu64 <nyanpas...@tuta.io> ---
After renaming my corrupted database to data.mdb (and keeping a backup copy), I
decided to try checking if the corruption occurred in Baloo's memory or if the
database was already corrupt on-disk. It's corrupt on-disk.

 > mdb_dump -s documenttimedb .|pv>/dev/null
    mdb.c:5856: Assertion 'IS_BRANCH(mc->mc_pg[mc->mc_top])' failed in
mdb_cursor_sibling()
    10.1MiB 0:00:00
    fish: Process 127289, 'mdb_dump' from job 1, 'mdb_dump -s documenttimedb
.|pv…' terminated by signal SIGABRT (Abort)
 > mdb_dump -a .|pv>/dev/null
    mdb.c:5856: Assertion 'IS_BRANCH(mc->mc_pg[mc->mc_top])' failed in
mdb_cursor_sibling()
    68.7MiB 0:00:00

Running gdb on both `mdb_dump -s documenttimedb . -f /dev/null` and `mdb_dump
-a . -f /dev/null`, I found that the bad page that crashes (a sibling of other
midway pages, but holding block-like data) occurs at a *different* file/mmap
offset (0x03CAB000) than my initial Baloo crash (0x5925000)! Are they similar
or not? (somewhat:)

- 0x03CAB000 comes after data containing strings like "Fpathpercent" and
"Fpathetique", which I believe was created by
TermGenerator::indexFileNameText() inserting "F" + (words appearing in
filenames) into LMDB. The page starting at 0x03CAB000 itself has a weak 10-byte
periodicity. The 32-bit integer 0x00003CAB (page address >> 12) appears 10
times in the database file.

- 0x5925000 exists in a region with a strong 10-byte periodicity both before
and after the page starts. The 32-bit integer 0x00005925 appears a whopping 307
times in the file!

Seeking Audacity to offset 93474816, I see data with a periodicity of 10. Valid
page headers show 2-byte periodicities. This pointer doesn't point to page
metadata! Either the page contents were overwritten or never written, or the
page pointer was written incorrectly.

I haven't tried modifying LMDB to scan the *entire* database, continuing on
errors, and logging *all* data inconsistencies. I think that would help gather
more data to understand what kind of corruption is happening.

(In reply to tagwerk19 from comment #12)
> I'm not so sure how/when baloo_file recognises when the index is being
> "read" and therefore has to append instead of update however it's clear that
> this is happening is you look at Bug 437754 (where you see that a "balooctl
> status", which seems to enumerate files to be indexed, means that updates
> are "appends" and the index grows dramatically).
https://schd.ws/hosted_files/buildstuff14/96/20141120-BuildStuff-Lightning.pdf
describes page reclamation. Of note:
> LMDB maintains a free list tracking the IDs of unused pages
> Old pages are reused as soon as possible, so data volumes don't grow without 
> bound
And if you get this code wrong, it's a fast fast path to data corruption.

If I understand correctly, write transactions never erase no-longer-used pages,
but only pages abandoned by an *earlier* write transaction if no active readers
predate that transaction committing. So an active read transaction, which I
assume snapshots the root page and relies on writers to not overwrite the tree
it references, prevents writers from reusing pages freed by *all* writes which
commit after the read transaction started.

So yeah, long-running read transactions cause written unused data to pile up.
And since the PDF says "No compaction or garbage collection phase is ever
needed", I suspect Baloo's index file size will *never* decrease, even if data
gets freed (eg. by closing a long-running read transaction, excluding folders
from indexing, deleting files, or turning off content indexing). This is...
suboptimal.
> > ... I don't know who wrote the corrupted file
> I know there was a flood of "corruption" reports (Bug 389848). This issue
> was found but the fix left the index corrupt and it became normal to
> recommend purging and rebuilding the index (Bug 431664). Yes, still quite a
> while ago and the number of these reports is dropping away but it did
> resurface when people upgraded from Debian 10 to 11 (which was only the end
> of last year)
Reading https://www.openldap.org/lists/openldap-devel/201710/msg00019.html, I'm
scared of yet another category of corruption: corrupting in-memory data queued
in a write transaction, *before being committed to disk*!

Does baloo_file have any memory corruption bugs overwriting data with a 10-byte
stride? I don't know!
> Interesting in that baloo "batches up" its content indexing work (where it
> analyses 40 files at a time and writes the results to the index) however it
> deals with the initial scan of files it needs to index in a single tranche;
> give it a hundred thousand files it needs to index, it will collect the
> information for all of them and write the results to the index in one go.
> This can be pretty horrible (see Bug 394750)
> 
> No reason that this is a cause but it is a behaviour that might raise the
> stakes...
This could be fixed separately I assume.
> > ... evaluate the performance differences
> One of the joys of baloo is it's amazing speed, that you can type a search
> string and see the results refine themselves on screen.
https://github.com/LumoSQL/LumoSQL claims LMDB is still somewhat faster than
SQLite's standard engine (though SQLite is catching up). I trust LMDB less to
avoid corrupting data though.

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 434926] Crash in Baloo::IdFilenameDB::get() after Baloo::DocumentUrlDB::get

Reply via email to