Iurii Gerzhedovich created IGNITE-24895:
-------------------------------------------
Summary: AI3. Cluster became broken after a few restarts
Key: IGNITE-24895
URL: https://issues.apache.org/jira/browse/IGNITE-24895
Project: Ignite
Issue Type: Improvement
Reporter: Iurii Gerzhedovich
The scenario is straightforward but can have a variety number of restarts.
So, I just run org.apache.ignite.internal.benchmark.TpchBenchmark with TPCH SF
0.1 dataset with defined working directory to keep persistence for every run.
In other words the scenario can be just a
1. Create 3 node cluster.
2. Load some data.
3. Run SQL RO loads.
4. Restart cluster
5. goto 3.
After an undefined number of restarts the cluster became broken and had tons of
errors in logs. Try to run the cluster again on the same persistence lead the
same issue.
The first Exception in logs:
{code:java}
2025-03-21T10:12:29,195][WARN ][%node_3345%common-scheduler-0][FailureManager]
Possible failure suppressed according to a configured handler
[hnd=StopNodeOrHaltFailureHandler [tryStop=true, timeout=60000,
super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet
[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]],
failureCtx=SYSTEM_WORKER_BLOCKED]
org.apache.ignite.lang.IgniteException: A critical thread is blocked for 524
ms that is more than the allowed 500 ms, it is
"%node_3345%MessagingService-inbound-Default-0-0" prio=10 Id=292 WAITING on
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@24577734 owned by
"%node_3345%metastorage-compaction-executor-0" Id=595
at [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
- waiting on
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@24577734
at
[email protected]/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
at
[email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
at
[email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:1009)
at
[email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1324)
at
[email protected]/java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:738)
at
app//org.apache.ignite.internal.metastorage.server.AbstractKeyValueStorage.getCompactionRevision(AbstractKeyValueStorage.java:258)
at
app//org.apache.ignite.internal.metastorage.impl.MetaStorageManagerImpl.withTrackReadOperationFromLeaderFuture(MetaStorageManagerImpl.java:1260)
at
app//org.apache.ignite.internal.metastorage.impl.MetaStorageManagerImpl.lambda$getAll$49(MetaStorageManagerImpl.java:916)
at
app//org.apache.ignite.internal.metastorage.impl.MetaStorageManagerImpl$$Lambda$1794/0x0000000800bc2c40.get(Unknown
Source)
at
app//org.apache.ignite.internal.util.IgniteUtils.inBusyLock(IgniteUtils.java:868)
at
app//org.apache.ignite.internal.metastorage.impl.MetaStorageManagerImpl.getAll(MetaStorageManagerImpl.java:914)
at
app//org.apache.ignite.internal.table.distributed.TableManager.lambda$writeTableAssignmentsToMetastore$51(TableManager.java:1089)
at
app//org.apache.ignite.internal.table.distributed.TableManager$$Lambda$1887/0x0000000800bf7440.apply(Unknown
Source)
at
[email protected]/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
at
[email protected]/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
at
[email protected]/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2079)
at
app//org.apache.ignite.internal.raft.RaftGroupServiceImpl.lambda$sendWithRetry$49(RaftGroupServiceImpl.java:624)
at
app//org.apache.ignite.internal.raft.RaftGroupServiceImpl$$Lambda$1341/0x0000000800a4f840.accept(Unknown
Source)
at
[email protected]/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
at
[email protected]/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
at
[email protected]/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
at
[email protected]/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2079)
at
app//org.apache.ignite.internal.network.DefaultMessagingService.onInvokeResponse(DefaultMessagingService.java:584)
at
app//org.apache.ignite.internal.network.DefaultMessagingService.handleInvokeResponse(DefaultMessagingService.java:475)
at
app//org.apache.ignite.internal.network.DefaultMessagingService.lambda$handleMessageFromNetwork$4(DefaultMessagingService.java:409)
at
app//org.apache.ignite.internal.network.DefaultMessagingService$$Lambda$1545/0x0000000800ad9040.run(Unknown
Source)
at
[email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
[email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at [email protected]/java.lang.Thread.run(Thread.java:829)
Number of locked synchronizers = 1
- java.util.concurrent.ThreadPoolExecutor$Worker@71945073 {code}
And after it we have tons of such exceptions with increasing timeouts up to a
half minutes ( maybe more, but need to wait to much time)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)