[DISCUSS] CASSANDRA-16767, CASSANDRA-16768, and CASSANDRA-16769 for 3.11.x

Scott Carey Tue, 29 Jun 2021 15:49:21 -0700

I'd like to discuss the inclusion of the above tickets for a 3.11.x
release.  These are not a pure 'bug fix' so I'll need a waiver to get them
into 3.11.x  (and implicitly, 4.0.x).


The first two are straightforward oversights:  neither *nodetool
garbagecollect *nor *nodetool scrub* currently accept a *--user-defined*
parameter list of SSTables in the same way that *nodetool compact* does.

This is an operational problem for large tables.

I often need to scrub just one file that is corrupted for some reason, and
not scrub an entire 1TB+ of data for a table on a node.  This renders
'nodetool scrub' operationally useless for large tables.

For *garbagecollect* it is often operationally easy to identify which
tables are likely to be full of bloa- and operationally useful to do this
task in small increments.  The existing order that garbagecollect processes
SSTables prevents it from being useful in any incremental fashion -- if you
stop it and later restart, it will first process the SSTables you just
garbage collected.

The third ticket adds an option for* nodetool garbagecollect*,
*--oldest-fraction* that can select a fraction of the oldest table data in
bytes, and garbagecollect only the SSTables that 'cover' that percentage of
data.  Operationally, this lends itself to easy automation -- for example
running this once a week on 10% of a table's data would imply that there is
no data on disk that has been overwritten within the last 10 weeks.  This
caps data bloat in ways neither LCS nor STCS can currently achieve without
regular major compactions or full-pass garbagecollect.

I have a large LCS table that has existed in steady state for about two
years. Its oldest SSTable files were about 20 months old.  These old tables
were 95% bloated by that time -- 'garbagecollect' was able to shrink those
to 5% of their original size.
Being able to automate garbagecollect on a small fraction of the older data
would be a big disk space and performance win, without the downsides of a
major compaction.

The overall risk of these additions is low:

   - They do not modify any existing behavior, only add new options.
   - They re-use existing machinery for most of the work, and only adds
   logic in areas that are already well tested.  The areas that need the most
   scrutiny in review have good test coverage.
   - scripts that worked with nodetool before should continue to work
   except for the case where a keyspace is named --user-defined or
   --oldest-fraction, but this flaw already exists with other nodetool
   commands.
   - Three is no modification to sensitive areas like the read, write, or
   autocompaction path.  This merely does the same thing that is already done,
   just on a subset of SSTables rather than all of them.



Thanks for considering this proposal,

-Scott Carey


P.S.
You might wonder why the --oldest-fraction is necessary when one can use
--user-defined and some OS level scripting.

   1.  --oldest-fraction calculates the SSTable fraction based on the total
   data size, not file count.
   2. nodetool can avoid race conditions with autocompaction on sstable
   selection
   3. nodetool has access to the current state of active SSTables, a script
   just sees files on disk, files that might be scheduled for delete or files
   that are actively being written to.
   4. Even if used at a 100% fraction, it processes from oldest to newest
   by the SSTable generation number, meaning that if it is interrupted half
   way through, then re-started, it won't immediately work on the files that
   were just processed, as those will have the largest generation number.

[DISCUSS] CASSANDRA-16767, CASSANDRA-16768, and CASSANDRA-16769 for 3.11.x

Reply via email to