Gil Ganz created CASSANDRA-21217:
------------------------------------
Summary: Race condition between hints and decommision
Key: CASSANDRA-21217
URL: https://issues.apache.org/jira/browse/CASSANDRA-21217
Project: Apache Cassandra
Issue Type: Bug
Reporter: Gil Ganz
I am running decommission on 4.1.5 cluster, and decommission fails to complete
due to error regarding hints. It runs for quite some time, streaming what
appears to be all data, but then fails due to this error (which happened quite
early in the decommission process).
I got this in 3 separate cases (this is an env that is spread across the world,
so network hiccups are common).
I was able to overcome this by setting transfer_hints_on_decommission: false,
but I think the code that handles that hints in the decommision path should not
fail on missing file, it can just throw a warning, and not require me to not
transfer any hint.
INFO [NonPeriodicTasks:1] 2026-03-12 18:13:47,148 StreamResultFuture.java:252
- [Stream #24728df0-1e20-11f1-be78-9bd75fd01983] All sessions completed
ERROR [RMI TCP Connection(3338753)-127.0.0.1] 2026-03-12 18:13:47,149
StorageService.java:5017 - Error while decommissioning node
java.lang.RuntimeException: java.nio.file.NoSuchFileException:
/var/lib/cassandra/data/disk1/hints/5da9d583-259e-425f-a0ad-18b7e744dabc-1744041563101-2.hints
at
org.apache.cassandra.io.util.ChannelProxy.openChannel(ChannelProxy.java:54)
at
org.apache.cassandra.io.util.ChannelProxy.<init>(ChannelProxy.java:65)
at
org.apache.cassandra.hints.ChecksummedDataInput.open(ChecksummedDataInput.java:76)
at org.apache.cassandra.hints.HintsReader.open(HintsReader.java:78)
at
org.apache.cassandra.hints.HintsDispatcher.create(HintsDispatcher.java:79)
at
org.apache.cassandra.hints.HintsDispatchExecutor$DispatchHintsTask.deliver(HintsDispatchExecutor.java:290)
at
org.apache.cassandra.hints.HintsDispatchExecutor$DispatchHintsTask.dispatch(HintsDispatchExecutor.java:277)
at
org.apache.cassandra.hints.HintsDispatchExecutor$DispatchHintsTask.dispatch(HintsDispatchExecutor.java:255)
at
org.apache.cassandra.hints.HintsDispatchExecutor$DispatchHintsTask.run(HintsDispatchExecutor.java:234)
at
java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
at
java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
at
java.base/java.util.concurrent.ConcurrentHashMap$ValueSpliterator.forEachRemaining(ConcurrentHashMap.java:3603)
at
java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
at
java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
at
java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
at
java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
at
java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at
java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497)
at
org.apache.cassandra.hints.HintsDispatchExecutor$TransferHintsTask.transfer(HintsDispatchExecutor.java:196)
at
org.apache.cassandra.hints.HintsDispatchExecutor$TransferHintsTask.run(HintsDispatchExecutor.java:169)
Proposed fix
1. Catch missing file in \{{deliver()}} (\{{HintsDispatchExecutor.java}}
~line 290)
Wrap the \{{HintsDispatcher.create()}} call to catch \{{RuntimeException}}
caused by \{{NoSuchFileException}}. If the file is gone, the hints were already
successfully
dispatched — delete the descriptor from the store and return \{{true}}.
\{code:java}
private boolean deliver(HintsDescriptor descriptor, InetAddressAndPort
address)
{
File file = descriptor.file(hintsDirectory);
if (!file.exists())
{
logger.info("Hints file {} was already dispatched, skipping", file);
store.cleanUp(descriptor);
return true;
}
// ... existing code
}
\{code}
Note: a simple \{{file.exists()}} pre-check narrows the window but does not
eliminate the TOCTOU race. The \{{try-catch}} around
\{{HintsDispatcher.create()}} is still
needed as a backstop:
\{code:java}
try (HintsDispatcher dispatcher = HintsDispatcher.create(file, rateLimiter,
address, descriptor.hostId, shouldAbort))
{
// ... existing dispatch logic
}
catch (RuntimeException e)
{
if (Throwables.getRootCause(e) instanceof NoSuchFileException)
{
logger.info("Hints file {} disappeared during dispatch, treating as
already dispatched", file);
store.cleanUp(descriptor);
return true;
}
throw e;
}
\{code}
2. Prevent the race in \{{transferHints()}} (\{{HintsService.java}} ~line 440)
After \{{completeDispatchBlockingly()}} at line 444, call
\{{pauseDispatch()}} again before starting the transfer at line 446. This
prevents \{{HintsDispatchTrigger}}
from scheduling new normal dispatch tasks that race with the transfer:
\{code:java}
// current code at line 441:
resumeDispatch();
catalog.stores().forEach(dispatchExecutor::completeDispatchBlockingly);
// add:
pauseDispatch();
return dispatchExecutor.transfer(catalog, hostIdSupplier);
\{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]