[DISCUSS] Snapshots outside of Cassandra data directory

2025-01-12 Thread Štefan Miklošovič
Hi,

I would like to run this through ML to gather feedback as we are
contemplating about making this happen.

Currently, snapshots are just hardlinks located in a snapshot directory to
live data directory. That is super handy as it occupies virtually zero disk
space etc (as long as underlying SSTables are not compacted away, then
their size would "materialize").

On the other hand, because it is a hardlink, it is not possible to make
hard links across block devices (infamous "Invalid cross-device link"
error). That means that snapshots can ever be located on the very same disk
Cassandra has its datadirs on.

Imagine there is a company ABC which has 10 TiB disk (or NFS share) mounted
to a Cassandra node and they would like to use that as a cheap / cold
storage of snapshots. They do not care about the speed of such storage nor
they care about how much space it occupies etc. when it comes to snapshots.
On the other hand, they do not want to have snapshots occupying a disk
space where Cassandra has its data because they consider it to be a waste
of space. They would like to utilize fast disk and disk space for
production data to the max and snapshots might eat a lot of that space
unnecessarily.

There might be a configuration property like "snapshot_root_dir:
/mnt/nfs/cassandra" and if a snapshot is taken, it would just copy SSTables
there, but we need to be a little bit smart here (By default, it would all
work as it does now - hard links to snapshot directories located under
Cassandra's data_file_directories.)

Because it is a copy, it occupies disk space. But if we took 100 snapshots
on the same SSTables, we would not want to copy the same files 100 times.
There is a very handy way to prevent this - unique SSTable identifiers
(under already existing uuid_sstable_identifiers_enabled property) so we
could have a flat destination hierarchy where all SSTables would be located
in the same directory and we would just check if such SSTable is already
there or not before copying it. Snapshot manifests (currently under
manifest.json) would then contain all SSTables a logical snapshot consists
of.

This would be possible only for _user snapshots_. All snapshots taken by
Cassandra itself (diagnostic snapshots, snapshots upon repairs, snapshots
against all system tables, ephemeral snapshots) would continue to be hard
links and it would not be possible to locate them outside of live data
dirs.

The advantages / characteristics of this approach for user snapshots:

1. Cassandra will be able to create snapshots located on different devices.
2. From an implementation perspective it would be totally transparent,
there will be no specific code about "where" we copy. We would just copy,
from Java perspective, as we copy anywhere else.
3. All the tooling would work as it does now - nodetool listsnapshots /
clearsnapshot / snapshot. Same outputs, same behavior.
4. No need to use external tools copying SSTables to desired destination,
custom scripts, manual synchronisation ...
5. Snapshots located outside of Cassandra live data dirs would behave the
same when it comes to snapshot TTL. (TTL on snapshot means that after so
and so period of time, they are automatically removed). This logic would be
the same. Hence, there is not any need to re-invent a wheel when it comes
to removing expired snapshots from the operator's perspective.
6. Such a solution would deduplicate SSTables so it would be as
space-efficient as possible (but not as efficient as hardlinks, because of
obvious reasons mentioned above).

It seems to me that there is recently a "push" to add more logic to
Cassandra where it was previously delegated for external toolings, for
example CEP around automatic repairs are basically doing what external
tooling does, we just move it under Cassandra. We would love to get rid of
a lot of tooling and customly written logic around copying snapshot
SSTables. From the implementation perspective it would be just plain Java,
without any external dependencies etc. There seems to be a lot to gain for
relatively straightforward additions to the snapshotting code.

We did a serious housekeeping in CASSANDRA-18111 where we consolidated and
centralized everything related to snapshot management so we feel
comfortable to build logic like this on top of that. In fact,
CASSANDRA-18111 was a prerequisite for this because we did not want to base
this work on pre-18111 state of things when it comes to snapshots (it was
all over the code base, fragmented and duplicated logic etc).

WDYT?

Regards


Re: [DISCUSS] Snapshots outside of Cassandra data directory

2025-01-12 Thread Jon Haddad
Sound like part of a backup strategy.Probably worth chiming in on the
sidecar issue: https://issues.apache.org/jira/browse/CASSSIDECAR-148.

IIRC, Medusa and Tablesnap both uploaded a manifest and don't upload
multiple copies of the same SSTables.  I think this should definitely be
part of our backup system.

Jon



On Sun, Jan 12, 2025 at 10:25 AM Štefan Miklošovič 
wrote:

> Hi,
>
> I would like to run this through ML to gather feedback as we are
> contemplating about making this happen.
>
> Currently, snapshots are just hardlinks located in a snapshot directory to
> live data directory. That is super handy as it occupies virtually zero disk
> space etc (as long as underlying SSTables are not compacted away, then
> their size would "materialize").
>
> On the other hand, because it is a hardlink, it is not possible to make
> hard links across block devices (infamous "Invalid cross-device link"
> error). That means that snapshots can ever be located on the very same disk
> Cassandra has its datadirs on.
>
> Imagine there is a company ABC which has 10 TiB disk (or NFS share)
> mounted to a Cassandra node and they would like to use that as a cheap /
> cold storage of snapshots. They do not care about the speed of such storage
> nor they care about how much space it occupies etc. when it comes to
> snapshots. On the other hand, they do not want to have snapshots occupying
> a disk space where Cassandra has its data because they consider it to be a
> waste of space. They would like to utilize fast disk and disk space for
> production data to the max and snapshots might eat a lot of that space
> unnecessarily.
>
> There might be a configuration property like "snapshot_root_dir:
> /mnt/nfs/cassandra" and if a snapshot is taken, it would just copy SSTables
> there, but we need to be a little bit smart here (By default, it would all
> work as it does now - hard links to snapshot directories located under
> Cassandra's data_file_directories.)
>
> Because it is a copy, it occupies disk space. But if we took 100 snapshots
> on the same SSTables, we would not want to copy the same files 100 times.
> There is a very handy way to prevent this - unique SSTable identifiers
> (under already existing uuid_sstable_identifiers_enabled property) so we
> could have a flat destination hierarchy where all SSTables would be located
> in the same directory and we would just check if such SSTable is already
> there or not before copying it. Snapshot manifests (currently under
> manifest.json) would then contain all SSTables a logical snapshot consists
> of.
>
> This would be possible only for _user snapshots_. All snapshots taken by
> Cassandra itself (diagnostic snapshots, snapshots upon repairs, snapshots
> against all system tables, ephemeral snapshots) would continue to be hard
> links and it would not be possible to locate them outside of live data
> dirs.
>
> The advantages / characteristics of this approach for user snapshots:
>
> 1. Cassandra will be able to create snapshots located on different devices.
> 2. From an implementation perspective it would be totally transparent,
> there will be no specific code about "where" we copy. We would just copy,
> from Java perspective, as we copy anywhere else.
> 3. All the tooling would work as it does now - nodetool listsnapshots /
> clearsnapshot / snapshot. Same outputs, same behavior.
> 4. No need to use external tools copying SSTables to desired destination,
> custom scripts, manual synchronisation ...
> 5. Snapshots located outside of Cassandra live data dirs would behave the
> same when it comes to snapshot TTL. (TTL on snapshot means that after so
> and so period of time, they are automatically removed). This logic would be
> the same. Hence, there is not any need to re-invent a wheel when it comes
> to removing expired snapshots from the operator's perspective.
> 6. Such a solution would deduplicate SSTables so it would be as
> space-efficient as possible (but not as efficient as hardlinks, because of
> obvious reasons mentioned above).
>
> It seems to me that there is recently a "push" to add more logic to
> Cassandra where it was previously delegated for external toolings, for
> example CEP around automatic repairs are basically doing what external
> tooling does, we just move it under Cassandra. We would love to get rid of
> a lot of tooling and customly written logic around copying snapshot
> SSTables. From the implementation perspective it would be just plain Java,
> without any external dependencies etc. There seems to be a lot to gain for
> relatively straightforward additions to the snapshotting code.
>
> We did a serious housekeeping in CASSANDRA-18111 where we consolidated and
> centralized everything related to snapshot management so we feel
> comfortable to build logic like this on top of that. In fact,
> CASSANDRA-18111 was a prerequisite for this because we did not want to base
> this work on pre-18111 state of things when it comes to snapshots (it was
> all

Re: [DISCUSS] Snapshots outside of Cassandra data directory

2025-01-12 Thread Štefan Miklošovič
Oh yeah I knew Sidecar will be mentioned, let's dive into that.

Sidecar has a lot of endpoints / functionality, backup / restore is just
part of that.

What I proposed has also thes advantages:

1) Every time you go to upload to some cloud storage provider, you need to
add all the dependencies to Sidecar to do that. In the case of S3, we need
to add S3 libs. What about Azure? We need to add a library which knows how
to talk to Azure. Then GCP ... This was probably the case why this "cloud
specific" functionality was never part of Cassandra itself but by adding
all the libraries with all the dependencies, we would bloat the tarball
unnecessarily, tracking all the dependencies which might be incompatible
etc.

However, you can also mount S3 bucket to a system and it acts as any other
native data dir. You can do the same with Azure (1) etc. But you do not
need to depend on any library. It will just copy files. That's it. That
means we are ready for whatever storage there might be as long as it can be
mounted locally. We would just have the same code for Azure, S3, NFS, a
local disk ... anything.

2) I am not sure we should _force_ people to use Sidecar if there are way
more simple ways to do the job. If we just enabled snapshots to be taken
outside of Cassandra data dir, then there is no reason to use Sidecar just
to be able to backup snapshots because Cassandra could do it itself. I
think we should strive for doing as much as possible with the least amount
of effort and I do not think that taking care of Sidecar for each node in a
cluster, configuring it, learning it should be mandatory. What if a
respective business is not interested in running Sidecar, they just want to
copy directly from Cassandra and be done with it? If we force people to use
Sidecar then somebody has to take care of all of that.

I am not saying that Sidecar is not suitable for restoring / backuping, but
I do not see anything wrong with having options.

(1)
https://learn.microsoft.com/en-us/azure/storage/blobs/blobfuse2-configuration

On Sun, Jan 12, 2025 at 11:46 PM Jon Haddad  wrote:

> Sound like part of a backup strategy.Probably worth chiming in on the
> sidecar issue: https://issues.apache.org/jira/browse/CASSSIDECAR-148.
>
> IIRC, Medusa and Tablesnap both uploaded a manifest and don't upload
> multiple copies of the same SSTables.  I think this should definitely be
> part of our backup system.
>
> Jon
>
>
>
> On Sun, Jan 12, 2025 at 10:25 AM Štefan Miklošovič 
> wrote:
>
>> Hi,
>>
>> I would like to run this through ML to gather feedback as we are
>> contemplating about making this happen.
>>
>> Currently, snapshots are just hardlinks located in a snapshot directory
>> to live data directory. That is super handy as it occupies virtually zero
>> disk space etc (as long as underlying SSTables are not compacted away, then
>> their size would "materialize").
>>
>> On the other hand, because it is a hardlink, it is not possible to make
>> hard links across block devices (infamous "Invalid cross-device link"
>> error). That means that snapshots can ever be located on the very same disk
>> Cassandra has its datadirs on.
>>
>> Imagine there is a company ABC which has 10 TiB disk (or NFS share)
>> mounted to a Cassandra node and they would like to use that as a cheap /
>> cold storage of snapshots. They do not care about the speed of such storage
>> nor they care about how much space it occupies etc. when it comes to
>> snapshots. On the other hand, they do not want to have snapshots occupying
>> a disk space where Cassandra has its data because they consider it to be a
>> waste of space. They would like to utilize fast disk and disk space for
>> production data to the max and snapshots might eat a lot of that space
>> unnecessarily.
>>
>> There might be a configuration property like "snapshot_root_dir:
>> /mnt/nfs/cassandra" and if a snapshot is taken, it would just copy SSTables
>> there, but we need to be a little bit smart here (By default, it would all
>> work as it does now - hard links to snapshot directories located under
>> Cassandra's data_file_directories.)
>>
>> Because it is a copy, it occupies disk space. But if we took 100
>> snapshots on the same SSTables, we would not want to copy the same files
>> 100 times. There is a very handy way to prevent this - unique SSTable
>> identifiers (under already existing uuid_sstable_identifiers_enabled
>> property) so we could have a flat destination hierarchy where all SSTables
>> would be located in the same directory and we would just check if such
>> SSTable is already there or not before copying it. Snapshot manifests
>> (currently under manifest.json) would then contain all SSTables a logical
>> snapshot consists of.
>>
>> This would be possible only for _user snapshots_. All snapshots taken by
>> Cassandra itself (diagnostic snapshots, snapshots upon repairs, snapshots
>> against all system tables, ephemeral snapshots) would continue to be hard
>> links and it would not be 

Re: [DISCUSS] Snapshots outside of Cassandra data directory

2025-01-12 Thread Štefan Miklošovič
C) Let's just enable backuping to a local filesystem.

To make things simpler and more user-friendly, it would be stored the same
way (in the target destination) Sidecar would upload it, so when people
decide to start to use Sidecar and incorporate it into their deployments /
workflows, these backups will be digestible by Sidecar and they can
restore by it too.

It was not my intention to provide restoration capabilities which would be
directly in Cassandra. There is no problem having this logic in Sidecar
only. I think that in practice people are backuping way more often than
restoring and if Sidecar is not the absolute must to backup, they shouldn't
be forced to use that.

On Mon, Jan 13, 2025 at 12:31 AM Jon Haddad  wrote:

> Are you proposing that we manage backups in the DB instead of Sidecar, or
> that we have the same functionality in both C* proper and the sidecar?  Or
> that we ship C* with backups to a local filesystem only?
>
> Where should the line be on what goes into sidecar and what goes into C*
> proper?
>
> Jon
>
>
>
> On Sun, Jan 12, 2025 at 3:04 PM Štefan Miklošovič 
> wrote:
>
>> Oh yeah I knew Sidecar will be mentioned, let's dive into that.
>>
>> Sidecar has a lot of endpoints / functionality, backup / restore is just
>> part of that.
>>
>> What I proposed has also thes advantages:
>>
>> 1) Every time you go to upload to some cloud storage provider, you need
>> to add all the dependencies to Sidecar to do that. In the case of S3, we
>> need to add S3 libs. What about Azure? We need to add a library which knows
>> how to talk to Azure. Then GCP ... This was probably the case why this
>> "cloud specific" functionality was never part of Cassandra itself but by
>> adding all the libraries with all the dependencies, we would bloat the
>> tarball unnecessarily, tracking all the dependencies which might be
>> incompatible etc.
>>
>> However, you can also mount S3 bucket to a system and it acts as any
>> other native data dir. You can do the same with Azure (1) etc. But you do
>> not need to depend on any library. It will just copy files. That's it. That
>> means we are ready for whatever storage there might be as long as it can be
>> mounted locally. We would just have the same code for Azure, S3, NFS, a
>> local disk ... anything.
>>
>> 2) I am not sure we should _force_ people to use Sidecar if there are way
>> more simple ways to do the job. If we just enabled snapshots to be taken
>> outside of Cassandra data dir, then there is no reason to use Sidecar just
>> to be able to backup snapshots because Cassandra could do it itself. I
>> think we should strive for doing as much as possible with the least amount
>> of effort and I do not think that taking care of Sidecar for each node in a
>> cluster, configuring it, learning it should be mandatory. What if a
>> respective business is not interested in running Sidecar, they just want to
>> copy directly from Cassandra and be done with it? If we force people to use
>> Sidecar then somebody has to take care of all of that.
>>
>> I am not saying that Sidecar is not suitable for restoring / backuping,
>> but I do not see anything wrong with having options.
>>
>> (1)
>> https://learn.microsoft.com/en-us/azure/storage/blobs/blobfuse2-configuration
>>
>> On Sun, Jan 12, 2025 at 11:46 PM Jon Haddad 
>> wrote:
>>
>>> Sound like part of a backup strategy.Probably worth chiming in on
>>> the sidecar issue: https://issues.apache.org/jira/browse/CASSSIDECAR-148
>>> .
>>>
>>> IIRC, Medusa and Tablesnap both uploaded a manifest and don't upload
>>> multiple copies of the same SSTables.  I think this should definitely be
>>> part of our backup system.
>>>
>>> Jon
>>>
>>>
>>>
>>> On Sun, Jan 12, 2025 at 10:25 AM Štefan Miklošovič <
>>> smikloso...@apache.org> wrote:
>>>
 Hi,

 I would like to run this through ML to gather feedback as we are
 contemplating about making this happen.

 Currently, snapshots are just hardlinks located in a snapshot directory
 to live data directory. That is super handy as it occupies virtually zero
 disk space etc (as long as underlying SSTables are not compacted away, then
 their size would "materialize").

 On the other hand, because it is a hardlink, it is not possible to make
 hard links across block devices (infamous "Invalid cross-device link"
 error). That means that snapshots can ever be located on the very same disk
 Cassandra has its datadirs on.

 Imagine there is a company ABC which has 10 TiB disk (or NFS share)
 mounted to a Cassandra node and they would like to use that as a cheap /
 cold storage of snapshots. They do not care about the speed of such storage
 nor they care about how much space it occupies etc. when it comes to
 snapshots. On the other hand, they do not want to have snapshots occupying
 a disk space where Cassandra has its data because they consider it to be a
 waste of space. They would like to utilize fast di

Re: [DISCUSS] Snapshots outside of Cassandra data directory

2025-01-12 Thread Jon Haddad
Are you proposing that we manage backups in the DB instead of Sidecar, or
that we have the same functionality in both C* proper and the sidecar?  Or
that we ship C* with backups to a local filesystem only?

Where should the line be on what goes into sidecar and what goes into C*
proper?

Jon



On Sun, Jan 12, 2025 at 3:04 PM Štefan Miklošovič 
wrote:

> Oh yeah I knew Sidecar will be mentioned, let's dive into that.
>
> Sidecar has a lot of endpoints / functionality, backup / restore is just
> part of that.
>
> What I proposed has also thes advantages:
>
> 1) Every time you go to upload to some cloud storage provider, you need to
> add all the dependencies to Sidecar to do that. In the case of S3, we need
> to add S3 libs. What about Azure? We need to add a library which knows how
> to talk to Azure. Then GCP ... This was probably the case why this "cloud
> specific" functionality was never part of Cassandra itself but by adding
> all the libraries with all the dependencies, we would bloat the tarball
> unnecessarily, tracking all the dependencies which might be incompatible
> etc.
>
> However, you can also mount S3 bucket to a system and it acts as any other
> native data dir. You can do the same with Azure (1) etc. But you do not
> need to depend on any library. It will just copy files. That's it. That
> means we are ready for whatever storage there might be as long as it can be
> mounted locally. We would just have the same code for Azure, S3, NFS, a
> local disk ... anything.
>
> 2) I am not sure we should _force_ people to use Sidecar if there are way
> more simple ways to do the job. If we just enabled snapshots to be taken
> outside of Cassandra data dir, then there is no reason to use Sidecar just
> to be able to backup snapshots because Cassandra could do it itself. I
> think we should strive for doing as much as possible with the least amount
> of effort and I do not think that taking care of Sidecar for each node in a
> cluster, configuring it, learning it should be mandatory. What if a
> respective business is not interested in running Sidecar, they just want to
> copy directly from Cassandra and be done with it? If we force people to use
> Sidecar then somebody has to take care of all of that.
>
> I am not saying that Sidecar is not suitable for restoring / backuping,
> but I do not see anything wrong with having options.
>
> (1)
> https://learn.microsoft.com/en-us/azure/storage/blobs/blobfuse2-configuration
>
> On Sun, Jan 12, 2025 at 11:46 PM Jon Haddad 
> wrote:
>
>> Sound like part of a backup strategy.Probably worth chiming in on the
>> sidecar issue: https://issues.apache.org/jira/browse/CASSSIDECAR-148.
>>
>> IIRC, Medusa and Tablesnap both uploaded a manifest and don't upload
>> multiple copies of the same SSTables.  I think this should definitely be
>> part of our backup system.
>>
>> Jon
>>
>>
>>
>> On Sun, Jan 12, 2025 at 10:25 AM Štefan Miklošovič <
>> smikloso...@apache.org> wrote:
>>
>>> Hi,
>>>
>>> I would like to run this through ML to gather feedback as we are
>>> contemplating about making this happen.
>>>
>>> Currently, snapshots are just hardlinks located in a snapshot directory
>>> to live data directory. That is super handy as it occupies virtually zero
>>> disk space etc (as long as underlying SSTables are not compacted away, then
>>> their size would "materialize").
>>>
>>> On the other hand, because it is a hardlink, it is not possible to make
>>> hard links across block devices (infamous "Invalid cross-device link"
>>> error). That means that snapshots can ever be located on the very same disk
>>> Cassandra has its datadirs on.
>>>
>>> Imagine there is a company ABC which has 10 TiB disk (or NFS share)
>>> mounted to a Cassandra node and they would like to use that as a cheap /
>>> cold storage of snapshots. They do not care about the speed of such storage
>>> nor they care about how much space it occupies etc. when it comes to
>>> snapshots. On the other hand, they do not want to have snapshots occupying
>>> a disk space where Cassandra has its data because they consider it to be a
>>> waste of space. They would like to utilize fast disk and disk space for
>>> production data to the max and snapshots might eat a lot of that space
>>> unnecessarily.
>>>
>>> There might be a configuration property like "snapshot_root_dir:
>>> /mnt/nfs/cassandra" and if a snapshot is taken, it would just copy SSTables
>>> there, but we need to be a little bit smart here (By default, it would all
>>> work as it does now - hard links to snapshot directories located under
>>> Cassandra's data_file_directories.)
>>>
>>> Because it is a copy, it occupies disk space. But if we took 100
>>> snapshots on the same SSTables, we would not want to copy the same files
>>> 100 times. There is a very handy way to prevent this - unique SSTable
>>> identifiers (under already existing uuid_sstable_identifiers_enabled
>>> property) so we could have a flat destination hierarchy where all SSTables
>>>