https://bugs.kde.org/show_bug.cgi?id=394750
tagwer...@innerjoin.org changed: What |Removed |Added ---------------------------------------------------------------------------- Ever confirmed|0 |1 Status|REPORTED |CONFIRMED --- Comment #13 from tagwer...@innerjoin.org --- (In reply to Thaddee Tyl from comment #0) > ... My setup is barely unusual ... May I read that as "My setup is fairly unusual"? 8-) > ... I have some directories with a million small files (records of > Go games obtained from this command: > https://github.com/espadrine/badukjs/blob/master/Makefile#L13) Wow... Maybe some time has gone by and the number of recorded games has crept up but I've just downloaded and unpacked nearly 2 million .sgf files (that end up in a single, flat, directory). That's going to be a torture test! First off. Yes, I see the described behaviour: baloo_file fills RAM and disk for hours with no visible progress This is with the current Neon Unstable... Plasma: 5.22.80 Frameworks: 5.85.0 Qt: 5.15.3 Filesystem: Ext4 This hadn't been marked "Confirmed" but, yes, reproducible... Digging down into the "torture test"; extracting the files from the tar archives overwhelms iNotify. Baloo reports Inotify - too many event - Overflowed Baloo attempts to index the files where it get the notification, but it will only discover "the remainder" on a "balooctl check" or on the next logon. I see "baloo_file" running at 100% and with steadily growing memory use. It's listing all the files it will need to index (it's not got as far as indexing content). However I see the same behaviour with content indexing disabled, so it is an issue with baloo_file and not baloo_file_extractor. It seems that baloo_file wants to build the list of unindexed files as a single transaction. "balooctl check" does not show anything happening; the information is being collected but not appearing on disc. Testing on a VM with 16 GB RAM, I could index 1.4 million files (it took almost an hour, without content indexing) and it was possible to see the memory use creeping up during the process and the results committed to disc right at the end. With the full 2 million files, it filled RAM and swap in 90 minutes and baloo_file hung with what looked like a corrupt/truncated index written to disc (the filesize of index was the size of RAM. Interesting but maybe a coincidence) It was possible to index the full 2 million files if they were copied "in batches" into an indexed directory and baloo_file allowed to catch up after each copy. I think there is something to be fixed here... When baloo is indexing content it does it with batches of files (40 files, then the next 40 and so on) and commit the results after each. It would make sense to batch the initial indexing, something like a commit every 15 seconds perhaps. That would also allow people to see that something was happening with "balooctl status" More speculatively... The "40 file" batches for content indexing is very, very low for the small .sgf files; the full text index would take days (weeks?) to complete. This limit can shrink, maybe it should be allowed to grow as well. I'd place the baloo_file and baloo_file_extractor issues into different pigeon holes here. -- You are receiving this mail because: You are watching all bug changes.