Right, exactly. Which (I think) makes the object store about as valuable
as an ephemeral disk if you don't keep everything on there. It's a
tradeoff I'd never use given the cost / benefit.
Does that mean you agree that we should focus on writethrough cache mode
first?
Jon
On Fri, Apr 11, 202
> On Apr 11, 2025, at 1:15 PM, Jon Haddad wrote:
>
>
> I also keep running up against my concern about treating object store as a
> write back cache instead of write through. "Tiering" data off has real
> consequences for the user, the big one being data loss, especially with
> regards to
I've been thinking about this a bit more recently, and I think Joey's
suggestion about improving the yaml based disk configuration is a better
first step than what I wrote (table definition), for a couple reasons.
1. Attaching it to the schema means we need to have the disk configuration
as part o
I really like the data directories and replication configuration. I think
it makes a ton of sense to put it in the yaml, but we should probably yaml
it all, and not nest JSON :), and we can probably simplify it a little with
a file uri scheme, something like this:
data_file_locations:
disk:
That is cool but this still does not show / explain how it would look like
when it comes to dependencies needed for actually talking to storages like
s3.
Maybe I am missing something here and please explain when I am mistaken but
If I understand that correctly, for talking to s3 we would need to u
Jon, I like where you are headed with that, just brainstorming out what the
end interface might look like (might be getting a bit ahead of things
talking about directories if we don't even have files implemented yet).
What do folks think about pairing data_file_locations (fka
data_file_directories)
Taking this a step further, this opens up is a different way of
bootstrapping new nodes, by using the object store's native copy commands.
This is something that's impossible when just using the filesystem mount.
I think Joey mentioned something along these lines to me several years ago,
maybe at t
Thanks Jordan and Joey for the additional info.
One thing I'd like to clarify - what I'm mostly after is 100% of my data on
object store, local disk acting as a LRU cache, although there's also a
case for the mirror.
What I see so far are three high level ways of running this:
1. Mirror Mode
Th
Great discussion - I agree strongly with Jon's points, giving operators
this option will make many operator's lives easier. Even if you still have
to have 100% disk space to meet performance requirements, that's still much
more efficient than you can run C* with just disks (as you need to leave
hea
Because an earlier reply hinted that mounting a bucket yields "terrible
results". That has moved the discussion, in my mind, practically to the
place of "we are not going to do this", to which I explained that in this
particular case I do not find the speed important, because the use cases
you want
I too initially felt we should just use mounts and was excited by e.g.
Single Zone Express mounting. As Cheng mentioned we tried it…and the
results were disappointing (except for use cases who could sometimes
tolerate seconds of p99 latency. That brought me around to needing an
implementation we ow
Supporting a filesystem mount is perfectly reasonable. If you wanted to use
that with the S3 mount, there's nothing that should prevent you from doing
so, and the filesystem version is probably the default implementation that
we'd want to ship with since, to your point, it doesn't require additiona
If that's not your intent, then you should be more careful with your
replies. When you write something like this:
> While this might work, what I find tricky is that we are forcing this to
users. Not everybody is interested in putting everything to a bucket and
server traffic from that. They just
I was explaining multiple times (1) that I don't have anything against what
is discussed here.
Having questions about what that is going to look like does not mean I am
dismissive.
(1) https://lists.apache.org/thread/ofh2q52p92cr89wh2l3djsm5n9dmzzsg
On Fri, Mar 7, 2025 at 5:44 PM Jon Haddad wro
The only way I see that working is that, if everything was in a bucket, if
you take a snapshot, these SSTables would be "copied" from live data dir
(living in a bucket) to snapshots dir (living in a bucket). Basically, we
would need to say "and if you go to take a snapshot on this table, instead
of
Nobody is saying you can't work with a mount, and this isn't a conversation
about snapshots.
Nobody is forcing users to use object storage either.
You're making a ton of negative assumptions here about both the discussion,
and the people you're having it with. Try to be more open minded.
On Fr
Thank you very much, I'm certainly interested. I'll start working on the
update for cep-36 next week.
Mick Semb Wever 于2025年3月7日 周五下午7:07写道:
>
>
> On Thu, 6 Mar 2025 at 09:40, Štefan Miklošovič
> wrote:
>
>> That is cool but this still does not show / explain how it would look
>> like when it c
On Thu, 6 Mar 2025 at 09:40, Štefan Miklošovič
wrote:
> That is cool but this still does not show / explain how it would look like
> when it comes to dependencies needed for actually talking to storages like
> s3.
>
As Benedict writes, dealing with optional dependencies is not hard (and as
Jon
BTW, snapshots are quite special because these are not "files", they are
just hard links. They "materialize" as regular files once underlying
SSTables are compacted away. How are you going to hardlink from local
storage to an object storage anyway? We will always need to "upload".
On Fri, Mar 7, 2
Jon,
all "big three" support mounting a bucket locally. That being said, I do
not think that completely ditching this possibility for Cassandra working
with a mount, e.g. for just uploading snapshots there etc, is reasonable.
GCP
https://cloud.google.com/storage/docs/cloud-storage-fuse/quickstar
On 3/6/2025 7:16 AM, Jon Haddad wrote:
Assuming everything else is identical, might not matter for S3.
However, not every object store has a filesystem mount.
Regarding sprawling dependencies, we can always make the provider
specific libraries available as a separate download and put them on
Hey Joel, thanks for chiming in!
Regarding dependencies - while it's possible to provide pluggable
interfaces, the issue I'm concerned about is conflicting versions of
transitive dependencies at runtime. For example, I used a java agent that
had a different version of snakeyaml, and it ended up b
It anyway seems reasonable to me that we would support multiple FileSystemProvider. So perhaps this is really two problems we’re maybe conflating: 1) a mechanism for dropping jars that can register a FileSystemProvider for Cassandra to utilise2) a way to mark directories (from any provider) as “rem
Assuming everything else is identical, might not matter for S3. However,
not every object store has a filesystem mount.
Regarding sprawling dependencies, we can always make the provider specific
libraries available as a separate download and put them on their own thread
with a separate class path.
I think another way of saying what Stefan may be getting at is: what does a library give us that an appropriately configured mount dir doesn’t?We don’t want to treat S3 the same as local disk, but this can be achieved easily with config. Is there some other benefit of direct integration? Well defin
.
It’s not an area where I can currently dedicate engineering effort. But if
> others are interested in contributing a feature like this, I’d see it as
> valuable for the project and would be happy to collaborate on
> design/architecture/goals.
>
Jake mentioned 17 months ago a custom FileSys
Scott,
what you wrote is all correct, but I have a feeling that both you and Jeff
are talking about something different, some other aspect of using that.
It seems that I still need to explain myself that I don't consider object
storage to be useless, it is as if everybody has to make the point ab
To Jeff’s point on tactical vs. strategic, here’s the big picture for me on object storage:– Object storage is 70% cheaper:Replicated flash block storage is extremely expensive, and more so with compute resources constantly attached. If one were to build a storage platform on top of a cloud provide
I agree with all the points mentioned by Soctt. We are actually very
interested to explore the tiered storage for the same reasons above. Our
first experiment with S3 single zone express was, unfortunately, awfully
slow compared to ephemeral and EBS.
On Tue, Mar 4, 2025 at 9:22 PM C. Scott Andreas
I've come around on the rsync vs built in topic, and I think this is the
same. Having this managed in process gives us more options for control.
I think it's critical that 100% of the data be pushed to the object store.
I mentioned this in my email on Dec 15, but nobody directly responded to
that
Jeff,
when it comes to snapshots, there was already discussion in the other
thread I am not sure you are aware of (1), here (2) I am talking about
Sidecar + snapshots specifically. One "caveat" of Sidecar is that you
actually _need_ sidecar if we ever contemplated Sidecar doing upload /
backup (by
Mounted dirs give up the opportunity to change the IO model to account for different behaviors. The configurable channel proxy may suffer from the same IO constraints depending on implementation, too. But it may also become viable. The snapshot outside of the mounted file system seems like you’re i
For what it's worth, as it might come to somebody I am rejecting this
altogether (which is not the case, all I am trying to say is that we should
just think about it more) - it would be cool to know more about the
experience of others when it comes to this, maybe somebody already tried to
mount and
If we want to do this, we should wrap the object storage downwards and
provide the file system api capabilities upwards (Cassandra layer),if my
understanding is correct.
Brandon Williams 于2025年3月4日 周二下午9:55写道:
> A failing remote api that you are calling and a failing filesystem you
> are using h
I would be very cautious about not "reinventing the wheel" here. Are we
confident that we implement it in such a way which is bug-free / robust
enough? Do we think we can do a better job than the authors of "s3 driver"?
If they did a good job (which I assume they did) then failing an operation
whil
Most obviously, you don’t need to move all components of the sstable to s3, you could keep index + compression offsets locally. On Mar 4, 2025, at 1:46 PM, Štefan Miklošovič wrote:I don't say that using remote object storage is useless. I am just saying that I don't see the difference. I have not
A failing remote api that you are calling and a failing filesystem you
are using have different implications.
Kind Regards,
Brandon
On Tue, Mar 4, 2025 at 7:47 AM Štefan Miklošovič wrote:
>
> I don't say that using remote object storage is useless.
>
> I am just saying that I don't see the diffe
I don't say that using remote object storage is useless.
I am just saying that I don't see the difference. I have not measured that
but I can imagine that s3 mounted would use, under the hood, the same calls
to s3 api. How else would it be done? You need to talk to remote s3 storage
eventually any
Mounting an s3 bucket as a directory is an easy but poor implementation of object backed storage for databases Object storage is durable (most data loss is due to bugs not concurrent hardware failures), cheap (can 5-10x cheaper) and ubiquitous. A huge number of modern systems are object-storage-on
I do not think we need this CEP, honestly. I don't want to diss this
unnecessarily but if you mount a remote storage locally (e.g. mounting s3
bucket as if it was any other directory on node's machine), then what is
this CEP good for?
Not talking about the necessity to put all dependencies to be a
26 February 2025 14:54
To: dev@cassandra.apache.org
Subject: Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external
storage locations
EXTERNAL EMAIL - USE CAUTION when clicking links or attachments
Is anyone else interested in continuing to discuss this topic?
guo Maxwell mailto:
I’d love to see this implemented — where “this” is a proxy for some notion of support for remote object storage, perhaps usable by compaction strategies like TWCS to migrate data older than a threshold from a local filesystem to remote object.It’s not an area where I can currently dedicate enginee
Is anyone else interested in continuing to discuss this topic?
guo Maxwell 于2024年9月20日周五 09:44写道:
> I discussed this offline with Claude, he is no longer working on this.
>
> It's a pity. I think this is a very valuable thing. Commitlog's archiving
> and restore may be able to use the relevant c
I discussed this offline with Claude, he is no longer working on this.
It's a pity. I think this is a very valuable thing. Commitlog's archiving
and restore may be able to use the relevant code if it is completed.
Patrick McFadin 于2024年9月20日 周五上午2:01写道:
> Thanks for reviving this one!
>
> On Wed
Thanks for reviving this one!
On Wed, Sep 18, 2024 at 12:06 AM guo Maxwell wrote:
> Is there any update on this topic? It seems that things can make a big
> progress if Jake Luciani can find someone who can make the
> FileSystemProvider code accessible.
>
> Jon Haddad 于2023年12月16日周六 05:29写道:
Is there any update on this topic? It seems that things can make a big
progress if Jake Luciani can find someone who can make the
FileSystemProvider code accessible.
Jon Haddad 于2023年12月16日周六 05:29写道:
> At a high level I really like the idea of being able to better leverage
> cheaper storage
At a high level I really like the idea of being able to better leverage
cheaper storage especially object stores like S3.
One important thing though - I feel pretty strongly that there's a big,
deal breaking downside. Backups, disk failure policies, snapshots and
possibly repairs would get more
Is there still interest in this? Can we get some points down on electrons so
that we all understand the issues?
While it is fairly simple to redirect the read/write to something other than
the local system for a single node this will not solve the problem for tiered
storage.
Tiered storage w
@henrik, Have you made any progress on this? I would like to help drive
it forward but I am waiting to see what your code looks like and figure out
what I need to do. Any update on timeline would be appreciated.
On Mon, Oct 23, 2023 at 9:07 PM Jon Haddad
wrote:
> I think this is a great more
I think this is a great more generally useful than the two scenarios you've
outlined. I think it could / should be possible to use an object store as the
primary storage for sstables and rely on local disk as a cache for reads.
I don't know the roadmap for TCM, but imo if it allowed for more
nd yesterday.
>>>>>
>>>>> Henrik, How does your system work? What is the design strategy?
>>>>> Also is your code available somewhere?
>>>>>
>>>>> After looking at the code some more I think that the best solution is
>
Also
>>>> is your code available somewhere?
>>>>
>>>> After looking at the code some more I think that the best solution is
>>>> not a FileChannelProxy but to modify the Cassandra File class to get a
>>>> FileSystem object for a Factory
that this makes if very small change that will pick up
>>> 90+% of the cases. We then just need to find the edge cases.
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Sep 29, 2023 at 1:14 AM German Eichberger via dev <
>>> dev@cassandra.
gt; On Fri, Sep 29, 2023 at 1:14 AM German Eichberger via dev <
>> dev@cassandra.apache.org> wrote:
>>
>>> Super excited about this as well. Happy to help test with Azure and any
>>> other way needed.
>>>
>>> Thanks,
>>> German
>&g
;> --
>> *From:* guo Maxwell
>> *Sent:* Wednesday, September 27, 2023 7:38 PM
>> *To:* dev@cassandra.apache.org
>> *Subject:* [EXTERNAL] Re: [DISCUSS] CEP-36: A Configurable ChannelProxy
>> to alias external storage locations
>>
>
; *Subject:* [EXTERNAL] Re: [DISCUSS] CEP-36: A Configurable ChannelProxy
> to alias external storage locations
>
> Thanks , So I think a jira can be created now. And I'd be happy to provide
> some help with this as well if needed.
>
> Henrik Ingo 于2023年9月28日周四
ChannelProxy to alias
external storage locations
Thanks , So I think a jira can be created now. And I'd be happy to provide some
help with this as well if needed.
Henrik Ingo mailto:henrik.i...@datastax.com>>
于2023年9月28日周四 00:21写道:
It seems I was volunteered to rebase the Astra implementat
_
From: Jake Luciani <jak...@gmail.com>
Sent: Tuesday, September 26, 2023 19:03
To: dev@cassandra.apache.org
Subject: Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations
NetApp Security WARNING: This is an external email. Do not click links or
Thanks , So I think a jira can be created now. And I'd be happy to provide
some help with this as well if needed.
Henrik Ingo 于2023年9月28日周四 00:21写道:
> It seems I was volunteered to rebase the Astra implementation of this
> functionality (FileSystemProvider) onto Cassandra trunk. (And publish it,
It seems I was volunteered to rebase the Astra implementation of this
functionality (FileSystemProvider) onto Cassandra trunk. (And publish it,
of course) I'll try to get going today or tomorrow, so that this
discussion can then benefit from having that code available for inspection.
And potentiall
we might "compact" "TWCS tables"
> automatically after so-and-so period by moving them there.
>
>
> From: Jake Luciani
> Sent: Tuesday, September 26, 2023 19:03
> To: dev@cassandra.apache.org
> Subject: Re: [DISCUSS]
___
From: Jake Luciani
Sent: Tuesday, September 26, 2023 19:03
To: dev@cassandra.apache.org
Subject: Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external
storage locations
NetApp Security WARNING: This is an external email. Do not click links or open
attachments u
We (DataStax) have a FileSystemProvider for Astra we can provide.
Works with S3/GCS/Azure.
I'll ask someone on our end to make it accessible.
This would work by having a bucket prefix per node. But there are lots
of details needed to support things like out of bound compaction
(mentioned in CEP).
I agree with Ariel, the more suitable insertion point is probably the JDK level
FileSystemProvider and FileSystem abstraction.
It might also be that we can reuse existing work here in some cases?
> On 26 Sep 2023, at 17:49, Ariel Weisberg wrote:
>
>
> Hi,
>
> Support for multiple storage ba
Hi,
Support for multiple storage backends including remote storage backends is a
pretty high value piece of functionality. I am happy to see there is interest
in that.
I think that `ChannelProxyFactory` as an integration point is going to quickly
turn into a dead end as we get into really usin
Yeah, there is so much things to do as cassandra (share-nothing) is
different from some other system like hbase , So I think we can break the
final goal into multiple steps. first is what Claude proposed. But I
suggest that this design can make the interface more scalable and we can
consider the im
> it may be better to support most cloud storage
> It simply only supports S3, which feels a bit customized for a certain user
> and is not universal enough.Am I right ?
I agree w/the eventual goal (and constraint on design now) of supporting most
popular cloud storage vendors, but if we have som
The intention of the CEP is to lay the groundwork to allow development of
ChannelProxyFactories that are pluggable in Cassandra. In this way any
storage system can be a candidate for Cassandra storage provided
FileChannels can be created for the system.
As I stated before I think that there may b
In my mind , it may be better to support most cloud storage : aws,
azure,gcp,aliyun and so on . We may make it a plugable. But in that way, it
seems there may need a filesystem interface layer for object storage. And
should we support ,distributed system like hdfs ,or something else. We
should firs
My intention is to develop an S3 storage system using
https://github.com/carlspring/s3fs-nio
There are several issues yet to be solved:
1. There are some internal calls that create files in the table
directory that do not use the channel proxy. I believe that these are
making calls on F
"Rather than building this piece by piece, I think it'd be awesome if
someone drew up an end-to-end plan to implement tiered storage, so we can
make sure we're discussing the whole final state, and not an implementation
detail of one part of the final state?"
Do agree with jeff for this ~~~ If the
- I think this is a great step forward.
- Being able to move sstables around between tiers of storage is a feature
Cassandra desperately needs, especially if one of those tiers is some sort
of object storage
- This looks like it's a foundational piece that enables that. Perhaps by a
team that's alr
external storage can be any storage that you can produce a FileChannel
for. There is an S3 library that does this so S3 is a definite
possibility for storage in this solution. My example code only writes to a
different directory on the same system. And there are a couple of places
where I did no
Great suggestion, Can external storage only be local storage media? Or can
it be stored in any storage medium, such as object storage s3 ?
We have previously implemented a tiered storage capability, that is, there
are multiple storage media on one node, SSD, HDD, and data placement based
on reques
I have just filed CEP-36 [1] to allow for keyspace/table storage outside of
the standard storage space.
There are two desires driving this change:
1. The ability to temporarily move some keyspaces/tables to storage
outside the normal directory tree to other disk so that compaction can
o
75 matches
Mail list logo