Hey Joel, thanks for chiming in!
Regarding dependencies - while it's possible to provide
pluggable interfaces, the issue I'm concerned about is
conflicting versions of transitive dependencies at
runtime. For example, I used a java agent that had a
different version of snakeyaml, and it ended up breaking
C*'s startup sequence [1]. I suggest putting external
modules on separate threads with their own classpath to
avoid this issue.
I think there's quite a bit of overlap between the two
desires expressed in this thread, even though they achieve
very different results. I personally can't see myself
using something that treats an object store as cold
storage where SSTables are moved (implying they weren't
there before), and I've expressed my concerns with this,
but other folks seem to want it and that's OK. I feel
very strongly that treating local storage as a cache with
the full dataset on object store is a better approach, but
ultimately different people have different priorities.
Either way, stuff is moved to object store at some point,
and pulled to the local disk on demand.
I am *firmly* of the position that this CEP should not
exclude the local storage as cache option, and should be
accounted for in the design.
Jon
[1]https://issues.apache.org/jira/browse/CASSANDRA-19663
On Thu, Mar 6, 2025 at 10:31 AM Joel Shepherd
<sheph...@amazon.com> wrote:
On 3/6/2025 7:16 AM, Jon Haddad wrote:
Assuming everything else is identical, might not
matter for S3. However, not every object store has a
filesystem mount.
Regarding sprawling dependencies, we can always make
the provider specific libraries available as a
separate download and put them on their own thread
with a separate class path. I think in JVM dtest does
this already. Someone just started asking about IAM
for login, it sounds like a similar problem.
That was me. :-) Cassandra's auth already has fairly
well defined interfaces and a plug-in mechanism, so
it's easy to vend alternative auth solutions without
polluting the main project's dependency graph, at
build-time anyway. A similar approach could be
beneficial for CEP-36, particularly (IMO) for
cold-storage purposes. I suspect decoupling pluggable
alternate channel proxies for cold storage from
configurable alternate channel proxies for redirecting
data locally to free up space, migrate to a different
storage device, etc., would make both easier. The CEP
seems to be trying to do both, but they smell like
pretty different goals to me.
Thanks -- Joel.
On Thu, Mar 6, 2025 at 12:53 AM Benedict
<bened...@apache.org> wrote:
I think another way of saying what Stefan may be
getting at is what does a library give us that an
appropriately configured mount dir doesn’t?
We don’t want to treat S3 the same as local disk,
but this can be achieved easily with config. Is
there some other benefit of direct integration?
Well defined exceptions if we need to distinguish
cases is one that maybe springs to mind but
perhaps there are others?
On 6 Mar 2025, at 08:39, Štefan Miklošovič
<smikloso...@apache.org> wrote:
That is cool but this still does not show /
explain how it would look like when it comes to
dependencies needed for actually talking to
storages like s3.
Maybe I am missing something here and please
explain when I am mistaken but If I understand
that correctly, for talking to s3 we would need
to use a library like this, right? (1). So that
would be added among Cassandra dependencies?
Hence Cassandra starts to be biased against s3?
Why s3? Every time somebody comes up with a new
remote storage support, that would be added to
classpath as well? How are these dependencies
going to play with each other and with Cassandra
in general? Will all these storage
provider libraries for arbitrary clouds be even
compatible with Cassandra licence-wise?
I am sorry I keep repeating these questions but
this part of that I just don't get at all.
We can indeed add an API for this, sure sure,
why not. But for people who do not want to deal
with this at all and just be OK with a FS
mounted, why would we block them doing that?
(1)
https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-s3/pom.xml
On Wed, Mar 5, 2025 at 3:28 PM Mick Semb Wever
<m...@apache.org> wrote:
.
It’s not an area where I can currently
dedicate engineering effort. But if
others are interested in contributing a
feature like this, I’d see it as
valuable for the project and would be
happy to collaborate on
design/architecture/goals.
Jake mentioned 17 months ago a custom
FileSystemProvider we could offer.
None of us at DataStax has gotten around to
providing that, but to quickly throw
something over the wall this is it:
https://github.com/datastax/cassandra/blob/main/src/java/org/apache/cassandra/io/storage/StorageProvider.java
(with a few friend classes under
o.a.c.io.util)
We then have a RemoteStorageProvider,
private in another repo, that implements
that and also provides the
RemoteFileSystemProvider that Jake refers to.
Hopefully that's a start to get people
thinking about CEP level details, while we
get a cleaned abstract of
RemoteStorageProvider and friends to offer.