Hello Raymond,
Do you have draft changes to look at?
I'd suggest a more general approach, as some interfaces seem to
overlap each other. There is the FSErrorHandler, and the
JVMStabilityInspector both of which are currently not configurable via
user configuration. I think it would be possible to have a public
interface for which users could configure their own handlers via
configuration:
public interface FailureHandler
{
public boolean onFailure(Component type, FailureHandlerContext context);
}
It seems to me that the JVMStabilityInspector is a good candidate for
the default implementation of the FailureHandler API as it already
handles OOM, CommitLog errors, and disk errors as far as I can see.
On Sat, 16 Dec 2023 at 03:43, Josh McKenzie <[email protected]> wrote:
>
> Adding a poison-pill error option on finding of corrupt data makes sense to
> me. Not sure if there's enough demand / other customization being done in
> this space to justify the user customizable aspect; any immediate other
> approaches come to mind? If not, this isn't an area of the code that's
> changed all that much, so just adding a new option seems surgical and minimal
> to me.
>
> On Tue, Dec 12, 2023, at 4:21 AM, Claude Warren, Jr via dev wrote:
>
> I can see this as a strong improvement in Cassandra management and support it.
>
> +1 non binding
>
> On Mon, Dec 11, 2023 at 8:28 PM Raymond Huffman <[email protected]>
> wrote:
>
> Hello All,
>
> On our fork of Cassandra, we've implemented some custom behavior for handling
> CommitLog and SSTable Corruption errors. Specifically, if a node detects one
> of those errors, we want the node to stop itself, and if the node is
> restarted, we want initialization to fail. This is useful in Kubernetes when
> you expect nodes to be restarted frequently and makes our corruption
> remediation workflows less error-prone. I think we could make this behavior
> more pluggable by allowing users to provide custom implementations of the
> FSErrorHandler, and the error handler that's currently implemented at
> org.apache.cassandra.db.commitlog.CommitLog#handleCommitError via config in
> the same way one can provide custom Partitioners and
> Authenticators/Authorizers.
>
> Would you take as a contribution one of the following?
> 1. user provided implementations of FSErrorHandler and CommitLogErrorHandler,
> set via config; and/or
> 2. new commit failure and disk failure policies that write a poison pill file
> to disk and fail on startup if that file exists
>
> The poison pill implementation is what we currently use - we call this a "Non
> Transient Error" and we want these states to always require manual
> intervention to resolve, including manual action to clear the error. I'd be
> happy to contribute this if other users would find it beneficial. I had
> initially shared this question in Slack, but I'm now sharing it here for
> broader visibility.
>
> -Raymond Huffman
>
>