[jira] [Created] (CASSANDRA-21217) Race condition between hints and decommision

Gil Ganz (Jira) Sun, 15 Mar 2026 01:54:08 -0700

Gil Ganz created CASSANDRA-21217:
------------------------------------

             Summary: Race condition between hints and decommision
                 Key: CASSANDRA-21217
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21217
             Project: Apache Cassandra
          Issue Type: Bug
            Reporter: Gil Ganz



I am running decommission on 4.1.5 cluster, and decommission fails to complete 
due to error regarding hints. It runs for quite some time, streaming what 
appears to be all data, but then fails due to this error (which happened quite 
early in the decommission process).
I got this in 3 separate cases (this is an env that is spread across the world, 
so network hiccups are common).
I was able to overcome this by setting transfer_hints_on_decommission: false, 
but I think the code that handles that hints in the decommision path should not 
fail on missing file, it can just throw a warning, and not require me to not 
transfer any hint.


INFO  [NonPeriodicTasks:1] 2026-03-12 18:13:47,148 StreamResultFuture.java:252 
- [Stream #24728df0-1e20-11f1-be78-9bd75fd01983] All sessions completed
ERROR [RMI TCP Connection(3338753)-127.0.0.1] 2026-03-12 18:13:47,149 
StorageService.java:5017 - Error while decommissioning node
java.lang.RuntimeException: java.nio.file.NoSuchFileException: 
/var/lib/cassandra/data/disk1/hints/5da9d583-259e-425f-a0ad-18b7e744dabc-1744041563101-2.hints
        at 
org.apache.cassandra.io.util.ChannelProxy.openChannel(ChannelProxy.java:54)
        at 
org.apache.cassandra.io.util.ChannelProxy.<init>(ChannelProxy.java:65)
        at 
org.apache.cassandra.hints.ChecksummedDataInput.open(ChecksummedDataInput.java:76)
        at org.apache.cassandra.hints.HintsReader.open(HintsReader.java:78)
        at 
org.apache.cassandra.hints.HintsDispatcher.create(HintsDispatcher.java:79)
        at 
org.apache.cassandra.hints.HintsDispatchExecutor$DispatchHintsTask.deliver(HintsDispatchExecutor.java:290)
        at 
org.apache.cassandra.hints.HintsDispatchExecutor$DispatchHintsTask.dispatch(HintsDispatchExecutor.java:277)
        at 
org.apache.cassandra.hints.HintsDispatchExecutor$DispatchHintsTask.dispatch(HintsDispatchExecutor.java:255)
        at 
org.apache.cassandra.hints.HintsDispatchExecutor$DispatchHintsTask.run(HintsDispatchExecutor.java:234)
        at 
java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
        at 
java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
        at 
java.base/java.util.concurrent.ConcurrentHashMap$ValueSpliterator.forEachRemaining(ConcurrentHashMap.java:3603)
        at 
java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
        at 
java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
        at 
java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
        at 
java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
        at 
java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at 
java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497)
        at 
org.apache.cassandra.hints.HintsDispatchExecutor$TransferHintsTask.transfer(HintsDispatchExecutor.java:196)
        at 
org.apache.cassandra.hints.HintsDispatchExecutor$TransferHintsTask.run(HintsDispatchExecutor.java:169)


Proposed fix

  1. Catch missing file in \{{deliver()}} (\{{HintsDispatchExecutor.java}} 
~line 290)

  Wrap the \{{HintsDispatcher.create()}} call to catch \{{RuntimeException}} 
caused by \{{NoSuchFileException}}. If the file is gone, the hints were already 
successfully
  dispatched — delete the descriptor from the store and return \{{true}}.

  \{code:java}
  private boolean deliver(HintsDescriptor descriptor, InetAddressAndPort 
address)
  {
      File file = descriptor.file(hintsDirectory);
      if (!file.exists())
      {
          logger.info("Hints file {} was already dispatched, skipping", file);
          store.cleanUp(descriptor);
          return true;
      }
      // ... existing code
  }
  \{code}

  Note: a simple \{{file.exists()}} pre-check narrows the window but does not 
eliminate the TOCTOU race. The \{{try-catch}} around 
\{{HintsDispatcher.create()}} is still
  needed as a backstop:

  \{code:java}
  try (HintsDispatcher dispatcher = HintsDispatcher.create(file, rateLimiter, 
address, descriptor.hostId, shouldAbort))
  {
      // ... existing dispatch logic
  }
  catch (RuntimeException e)
  {
      if (Throwables.getRootCause(e) instanceof NoSuchFileException)
      {
          logger.info("Hints file {} disappeared during dispatch, treating as 
already dispatched", file);
          store.cleanUp(descriptor);
          return true;
      }
      throw e;
  }
  \{code}

  2. Prevent the race in \{{transferHints()}} (\{{HintsService.java}} ~line 440)

  After \{{completeDispatchBlockingly()}} at line 444, call 
\{{pauseDispatch()}} again before starting the transfer at line 446. This 
prevents \{{HintsDispatchTrigger}}
  from scheduling new normal dispatch tasks that race with the transfer:

  \{code:java}
  // current code at line 441:
  resumeDispatch();
  catalog.stores().forEach(dispatchExecutor::completeDispatchBlockingly);

  // add:
  pauseDispatch();

  return dispatchExecutor.transfer(catalog, hostIdSupplier);
  \{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (CASSANDRA-21217) Race condition between hints and decommision

Reply via email to