[
https://issues.apache.org/jira/browse/OAK-11477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17932627#comment-17932627
]
Stefan Egli commented on OAK-11477:
-----------------------------------
{quote}However, this is only similar see the exception text.{quote}
Agree, it's similar similar but different. The logs say
{noformat}
27.02.2025 16:56:10.119 *INFO* [main]
org.apache.jackrabbit.oak.plugins.document.DocumentDiscoveryLiteServiceIT Case
0: creating instance
27.02.2025 16:56:10.120 *INFO* [main]
org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore Server time
difference: 0ms (max allowed: 2000ms)
27.02.2025 16:56:10.120 *DEBUG* [main]
org.apache.jackrabbit.oak.plugins.document.ClusterNodeInfo Cannot acquire 1:
leaseEnd 1740675575279 > 1740675455291 - 119988ms in the fu
ture
27.02.2025 16:56:10.127 *INFO* [main]
org.apache.jackrabbit.oak.plugins.document.ClusterNodeInfo Found an existing
possibly active cluster node info (2) for this instance: mac
:78e3b5173576/22bb2b9d-055a-47b0-aed2-5d91bc8b546f, will try use it.
27.02.2025 16:56:10.132 *INFO* [main]
org.apache.jackrabbit.oak.plugins.document.ClusterNodeInfo Waiting for cluster
node 2's lease to expire: 2s left
27.02.2025 16:56:10.132 *DEBUG* [main]
org.apache.jackrabbit.oak.plugins.document.ClusterNodeInfo Cannot acquire 1:
leaseEnd 1740675575279 > 1740675460295 - 114984ms in the fu
ture
27.02.2025 16:56:10.135 *DEBUG* [main]
org.apache.jackrabbit.oak.plugins.document.LastRevRecoveryAgent No recovery
needed for clusterId 2
27.02.2025 16:56:10.136 *INFO* [main]
org.apache.jackrabbit.oak.plugins.document.ClusterNodeInfo Collision while
acquiring clusterId 2. Retrying...
27.02.2025 16:56:10.136 *DEBUG* [main]
org.apache.jackrabbit.oak.plugins.document.ClusterNodeInfo Cannot acquire 1:
leaseEnd 1740675575279 > 1740675460298 - 114981ms in the fu
ture
27.02.2025 16:56:10.136 *DEBUG* [main]
org.apache.jackrabbit.oak.plugins.document.LastRevRecoveryAgent No recovery
needed for clusterId 2
27.02.2025 16:56:10.136 *INFO* [main]
org.apache.jackrabbit.oak.plugins.document.ClusterNodeInfo Collision while
acquiring clusterId 2. Retrying...
27.02.2025 16:56:10.137 *DEBUG* [main]
org.apache.jackrabbit.oak.plugins.document.ClusterNodeInfo Cannot acquire 1:
leaseEnd 1740675575279 > 1740675460301 - 114978ms in the fu
ture
(then shortly afterwards)
org.apache.jackrabbit.oak.plugins.document.DocumentStoreException: Could not
get cluster node info (tried 10 times)
at
org.apache.jackrabbit.oak.plugins.document.ClusterNodeInfo.getInstance(ClusterNodeInfo.java:557)
at
org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.<init>(DocumentNodeStore.java:616)
{noformat}
which looks like it wanted to just get a random clusterId (it passed 0 there)
but ClusterNodeInfo failed to provide one: it seems to try ids 2 and 1, both
unsuccessful for different reasons. Looking at those reasons in detail (during
those 10 retries, picking only log lines about individual clusterIds) show:
For clusterId 2 it seems to first be a crashed instance so it first waits for
it - then it says no recovery needed - but still doesn't pick it "due to
collision acquiring the lease" :
{noformat}
27.02.2025 16:56:10.127 *INFO* [main]
org.apache.jackrabbit.oak.plugins.document.ClusterNodeInfo Found an existing
possibly active cluster node info (2) for this instance:
mac:78e3b5173576/22bb2b9d-055a-47b0-aed2-5d91bc8b546f, will try use it.
27.02.2025 16:56:10.132 *INFO* [main]
org.apache.jackrabbit.oak.plugins.document.ClusterNodeInfo Waiting for cluster
node 2's lease to expire: 2s left
27.02.2025 16:56:10.135 *DEBUG* [main]
org.apache.jackrabbit.oak.plugins.document.LastRevRecoveryAgent No recovery
needed for clusterId 2
27.02.2025 16:56:10.136 *INFO* [main]
org.apache.jackrabbit.oak.plugins.document.ClusterNodeInfo Collision while
acquiring clusterId 2. Retrying...
27.02.2025 16:56:10.136 *DEBUG* [main]
org.apache.jackrabbit.oak.plugins.document.LastRevRecoveryAgent No recovery
needed for clusterId 2
...
{noformat}
For clusterId 1 it rejects it as the lease is [in the
future|https://github.com/apache/jackrabbit-oak/blob/f82e1b833243b7b021065390efb34fa8f539aa01/oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/ClusterNodeInfo.java#L630-L631]
(initially just still valid for 120000ms) :
{noformat}
27.02.2025 16:56:10.120 *DEBUG* [main]
org.apache.jackrabbit.oak.plugins.document.ClusterNodeInfo Cannot acquire 1:
leaseEnd 1740675575279 > 1740675455291 - 119988ms in the future
27.02.2025 16:56:10.132 *DEBUG* [main]
org.apache.jackrabbit.oak.plugins.document.ClusterNodeInfo Cannot acquire 1:
leaseEnd 1740675575279 > 1740675460295 - 114984ms in the future
27.02.2025 16:56:10.136 *DEBUG* [main]
org.apache.jackrabbit.oak.plugins.document.ClusterNodeInfo Cannot acquire 1:
leaseEnd 1740675575279 > 1740675460298 - 114981ms in the future
...
{noformat}
So clusterId 1 makes sense, that is still in the lease time - but why it can't
grab clusterId 2 is to be rabbit holed next ...
> Flaky test: DocumentDiscoveryLiteServiceIT.testLargeStartStopFiesta
> -------------------------------------------------------------------
>
> Key: OAK-11477
> URL: https://issues.apache.org/jira/browse/OAK-11477
> Project: Jackrabbit Oak
> Issue Type: Test
> Components: documentmk
> Reporter: Julian Reschke
> Assignee: Manfred Baedke
> Priority: Major
> Fix For: 1.78.0
>
>
> This test appears to be flaky:
> DocumentDiscoveryLiteServiceIT.testLargeStartStopFiesta
> Last failure:
> https://ci-builds.apache.org/job/Jackrabbit/job/jackrabbit-oak-trunk/org.apache.jackrabbit$oak-store-document/2023/testReport/junit/org.apache.jackrabbit.oak.plugins.document/DocumentDiscoveryLiteServiceIT/testLargeStartStopFiesta/
> {noformat}
> Fehlermeldung
> did not renew clusterid lease
> Stacktrace
> java.lang.AssertionError: did not renew clusterid lease
> at org.junit.Assert.fail(Assert.java:89)
> at
> org.apache.jackrabbit.oak.plugins.document.BaseDocumentDiscoveryLiteServiceTest$SimplifiedInstance.crash(BaseDocumentDiscoveryLiteServiceTest.java:288)
> at
> org.apache.jackrabbit.oak.plugins.document.BaseDocumentDiscoveryLiteServiceTest.doStartStopFiesta(BaseDocumentDiscoveryLiteServiceTest.java:872)
> at
> org.apache.jackrabbit.oak.plugins.document.DocumentDiscoveryLiteServiceIT.testLargeStartStopFiesta(DocumentDiscoveryLiteServiceIT.java:37)
> at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
> at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
> at
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> at
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54)
> at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
> at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
> at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> at
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
> at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
> at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
> at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
> at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
> at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
> {noformat}
> Maybe a timing issue?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)