https://bugs.kde.org/show_bug.cgi?id=444520

--- Comment #8 from tagwer...@innerjoin.org ---
(In reply to Adam Fontenot from comment #7)
> Here's the original file that caused the problem:
> https://ipfs.io/ipfs/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC
You are right about the warning... also best not to open the link with a
browser that wants to render PDF's itself 8-]

Yes, 20MB but the plot content is compressed, as plaintext it could be *very*
much larger. It is titled "R Graphics Output" so maybe there's a possibility to
recognise such files - even though I'm sure "R" allows you to set a title
yourself.

> That's a fair point. Let me put it a different way. 
Good arguments...

> ... Perhaps an
> option to limit the size of the Baloo cache could be provided: either X GB
> or X% of free space. Given the available space, Baloo could manage its
> storage to not index files that are less usefully indexed. E.g. if there's
> one file that is 20 MB but using 2 GB of index space, it's going to be the
> first to go ...
I don't know "the internals" well enough to say. I do know that the underlying
library (LMDB) is designed withstand normal desktop misuse (killing processes,
turning things off in the middle of an update). You can get times when the
index grows because a transaction is being appended while another process is
reading the index... Another design decision.

For the 20MB PDFs, it may be that indexing the first file generates a 2 GB
index but the second one only adds a few additional MB. There's no guessing
with edge cases...

> ... For example, biologists frequently use
> plain text "SAM" files, which contain long strings of meaningful but not
> indexable text, representing bits of DNA and metadata. E.g.
> "ATAGCACTCAAGCAATCAAATCAAATAGCCAACTCCTTATCTCAACTCTCC". These files might be
> under 10 MB, and they might have a .sam, .txt, or no extension at all.
In this case, I'd hope that SAM files have their own Mimetype (although looks
like not... perhaps possible to build a rule if the files follow the
"Recommended Practice").

I know the SAM files were just an example but if you _did_ want to index them,
you'd hit baloo's "25 character limit" (Bug 412421) :-/  See this with:

    $ echo "abcdefghijklmnopqrstuvwxyz" > testfile.txt

    $ balooshow -x testfile.txt
    13fc000000fc01 64513 1309696 testfile.txt
[/home/user/Documents/testfile.txt]
            Mtime: 1637231394 2021-11-18T11:29:54
            Ctime: 1637231394 2021-11-18T11:29:54
            Cached properties:
                    Line Count: 1

    Internal Info
    Terms: Mplain Mtext T5 T8 X20-1 abcdefghijklmnopqrstuvwxy
    File Name Terms: Ftestfile Ftxt
    XAttr Terms:
    lineCount: 1

    $ baloosearch abc
    /home/user/Documents/testfile.txt
    Elapsed: 0.31964 msecs

    $ baloosearch abcdefghijklmnopqrstuvwxyz
    Elapsed: 0.215223 msecs

So there's compromises here as well.... In a way it's a question of what you
mean by "just works"....

-- 
You are receiving this mail because:
You are watching all bug changes.

Reply via email to