https://bugs.kde.org/show_bug.cgi?id=394750

tagwer...@innerjoin.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
             Status|REPORTED                    |CONFIRMED

--- Comment #13 from tagwer...@innerjoin.org ---
(In reply to Thaddee Tyl from comment #0)
> ... My setup is barely unusual ...
May I read that as "My setup is fairly unusual"? 8-)

> ... I have some directories with a million small files (records of
> Go games obtained from this command:
> https://github.com/espadrine/badukjs/blob/master/Makefile#L13)

Wow...

Maybe some time has gone by and the number of recorded games has crept up but
I've just downloaded and unpacked nearly 2 million .sgf files (that end up in a
single, flat, directory).

That's going to be a torture test!

First off. Yes, I see the described behaviour:

    baloo_file fills RAM and disk for hours with no visible progress

This is with the current Neon Unstable...

    Plasma: 5.22.80
    Frameworks: 5.85.0
    Qt: 5.15.3
    Filesystem: Ext4 

This hadn't been marked "Confirmed" but, yes, reproducible...

Digging down into the "torture test"; extracting the files from the tar
archives overwhelms iNotify. Baloo reports

    Inotify - too many event - Overflowed

Baloo attempts to index the files where it get the notification, but it will
only discover "the remainder" on a "balooctl check" or on the next logon.

I see "baloo_file" running at 100% and with steadily growing memory use. It's
listing all the files it will need to index (it's not got as far as indexing
content). However I see the same behaviour with content indexing disabled, so
it is an issue with baloo_file and not baloo_file_extractor.

It seems that baloo_file wants to build the list of unindexed files as a single
transaction. "balooctl check" does not show anything happening; the information
is being collected but not appearing on disc.

Testing on a VM with 16 GB RAM, I could index 1.4 million files (it took almost
an hour, without content indexing) and it was possible to see the memory use
creeping up during the process and the results committed to disc right at the
end.

With the full 2 million files, it filled RAM and swap in 90 minutes and
baloo_file hung with what looked like a corrupt/truncated index written to disc
(the filesize of index was the size of RAM. Interesting but maybe a
coincidence)

It was possible to index the full 2 million files if they were copied "in
batches" into an indexed directory and baloo_file allowed to catch up after
each copy.

I think there is something to be fixed here...

    When baloo is indexing content it does it with batches of files
    (40 files, then the next 40 and so on) and commit the results after
    each. It would make sense to batch the initial indexing, something
    like a commit every 15 seconds perhaps. That would also allow people
    to see that something was happening with "balooctl status"

More speculatively...

    The "40 file" batches for content indexing is very, very low for the
    small .sgf files; the full text index would take days (weeks?) to
    complete. This limit can shrink, maybe it should be allowed to grow
    as well.

I'd place the baloo_file and baloo_file_extractor issues into different pigeon
holes here.

-- 
You are receiving this mail because:
You are watching all bug changes.

Reply via email to