I have to admit that I haven’t tried the SafepointTimeout (I just noticed that
it was actually a production VM option in the JVM code, after my initial
suggestions below for debugging without it).
There doesn’t seem to be an obvious bug in SafepointTimeout, though I may not
be looking at the sa
Well I tried the SafepointTimeout option, but unfortunately it seems like
the long safepoint syncs don't actually trigger the SafepointTimeout
mechanism, so we didn't get any logs on it. It's possible I'm just doing it
wrong, I used the following options:
JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticV
Excellent, thanks for the tips, Graham. I'll give SafepointTimeout a try
and see if that gives us anything to act on.
On Fri, Oct 24, 2014 at 3:52 PM, graham sanderson wrote:
> And -XX:SafepointTimeoutDelay=xxx
>
> to set how long before it dumps output (defaults to 1 I believe)…
>
> Note it
And -XX:SafepointTimeoutDelay=xxx
to set how long before it dumps output (defaults to 1 I believe)…
Note it doesn’t actually timeout by default, it just prints the problematic
threads after that time and keeps on waiting
> On Oct 24, 2014, at 2:44 PM, graham sanderson wrote:
>
> Actually
Actually - there is
-XX:+SafepointTimeout
which will print out offending threads (assuming you reach a 10 second pause)…
That is probably your best bet.
> On Oct 24, 2014, at 2:38 PM, graham sanderson wrote:
>
> This certainly sounds like a JVM bug.
>
> We are running C* 2.0.9 on pretty hig
This certainly sounds like a JVM bug.
We are running C* 2.0.9 on pretty high end machines with pretty large heaps,
and don’t seem to have seen this (note we are on 7u67, so that might be an
interesting data point, though since the old thread predated that probably not)
1) From the app/java side
I'm also curious to know if this was ever resolved or if there's any other
recommended steps to take to continue to track it down. I'm seeing the same
issue in our production cluster, which is running Cassandra 2.0.10 and JVM
1.7u71, using the CMS collector. Just as described above, the issue is lo
My searching my list archives shows this thread evaporated. Was a root
cause ever found? Very curious.
On Mon, Feb 3, 2014 at 11:52 AM, Benedict Elliott Smith <
belliottsm...@datastax.com> wrote:
> Hi Frank,
>
> The "9391" under RevokeBias is the number of milliseconds spent
> synchronising
>> Sum: 120]
>>>>>>>>> [Scan RS (ms): Min: 23.2, Avg: 23.2, Max: 23.3, Diff: 0.1,
>>>>>>>>> Sum: 46.5]
>>>>>>>>> [Object Copy (ms): Min: 112.3, Avg: 112.3, Max: 112.4, Diff:
>>>>>>>
163.8, Avg: 163.8, Max: 163.8,
>>>>>>>> Diff: 0.0, Sum: 327.6]
>>>>>>>> [GC Worker End (ms): Min: 222346382.1, Avg: 222346382.1, Max:
>>>>>>>> 222346382.1, Diff: 0.0]
>>>>>>>> [Code Root Fixup: 0
T: 0.4 ms]
>>>>>>> [Other: 2.1 ms]
>>>>>>> [Choose CSet: 0.0 ms]
>>>>>>> [Ref Proc: 1.1 ms]
>>>>>>> [Ref Enq: 0.0 ms]
>>>>>>> [Free CSet: 0.4 ms]
>>>>>>
,
>>>>>> 0x0007f5c0, 0x0007f5c0)
>>>>>> region size 4096K, 17 young (69632K), 17 survivors (69632K)
>>>>>> compacting perm gen total 28672K, used 27428K [0x0007f5c0,
>>>>>> 0x0007f780, 0x
;>> 0x0007f76c9200, 0x0007f780)
>>>>> No shared spaces configured.
>>>>> }
>>>>> [Times: user=0.35 sys=0.00, real=27.58 secs]
>>>>> 222346.219: G1IncCollectionPause [ 111 0
>>>>>
CMS behaves in a similar manner. We thought it would be GC, waiting for
>>>> mmaped files being read from disk (the thread cannot reach safepoint during
>>>> this operation), but it doesn't explain the huge time.
>>>>
>>>> We'll try jhiccup to see i
rovides any additional information. The
>>> test was done on mixed aws/openstack environment, openjdk 1.7.0_45,
>>> cassandra 1.2.11. Upgrading to 2.0.x is no option for us.
>>>
>>> regards,
>>>
>>> ondrej cernos
>>>
>>>
>>>
chance to file a JIRA ticket. We have not been
>>> able to resolve the issue. But since Joel mentioned that upgrading to
>>> Cassandra 2.0.X solved it for them, we may need to upgrade. We are
>>> currently on Java 1.7 and Cassandra 1.2.8
>>>
>>>
>>&g
Cassandra 2.0.X solved it for them, we may need to upgrade. We are
>> currently on Java 1.7 and Cassandra 1.2.8
>>
>>
>>
>> On Thu, Feb 13, 2014 at 12:40 PM, Keith Wright wrote:
>>
>>> You’re running 2.0.* in production? May I ask what C* version and OS?
>
as well. Thx!
>>
>> From: Joel Samuelsson
>> Reply-To: "user@cassandra.apache.org"
>> Date: Thursday, February 13, 2014 at 11:39 AM
>>
>> To: "user@cassandra.apache.org"
>> Subject: Re: Intermittent long application pauses on nodes
>>
&g
gt;
> To: "user@cassandra.apache.org"
> Subject: Re: Intermittent long application pauses on nodes
>
> We have had similar issues and upgrading C* to 2.0.x and Java to 1.7 seems
> to have helped our issues.
>
>
> 2014-02-13 Keith Wright :
>
>> Frank did you
r@cassandra.apache.org>>
Date: Thursday, February 13, 2014 at 11:39 AM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>"
mailto:user@cassandra.apache.org>>
Subject: Re: Intermittent long application pauses on nodes
We have had similar issues and upgrading
nks
>
> From: Robert Coli
> Reply-To: "user@cassandra.apache.org"
> Date: Monday, February 3, 2014 at 6:10 PM
> To: "user@cassandra.apache.org"
> Subject: Re: Intermittent long application pauses on nodes
>
> On Mon, Feb 3, 2014 at 8:52 AM, Bene
gt;"
mailto:user@cassandra.apache.org>>
Date: Monday, February 3, 2014 at 6:10 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>"
mailto:user@cassandra.apache.org>>
Subject: Re: Intermittent long application pauses on nodes
On Mon, Feb 3, 2014 at
On Mon, Feb 3, 2014 at 8:52 AM, Benedict Elliott Smith <
belliottsm...@datastax.com> wrote:
>
> It's possible that this is a JVM issue, but if so there may be some
> remedial action we can take anyway. There are some more flags we should
> add, but we can discuss that once you open a ticket. If yo
Hi Frank,
The "9391" under RevokeBias is the number of milliseconds spent
synchronising on the safepoint prior to the VM operation, i.e. the time it
took to ensure all application threads were stopped. So this is the
culprit. Notice that the time spent spinning/blocking for the threads we
are supp
I was able to send SafePointStatistics to another log file via the
additional JVM flags and recently noticed a pause of 9.3936600 seconds.
Here are the log entries:
GC Log file:
---
2014-01-31T07:49:14.755-0500: 137460.842: Total time for which application
threads were stopped: 0.1
>
>
> I never figured out what kills stdout for C*. It's a library we depend on,
> didn't try too hard to figure out which one.
>
Nah, it's Cassandra itself (in
org.apache.cassandra.service.CassandraDaemon.activate()), but you can pass
-f (for 'foreground') to not do it.
>
>
> On 29 January 2014
Add some more flags: -XX:+UnlockDiagnosticVMOptions -XX:LogFile=${path}
-XX:+LogVMOutput
I never figured out what kills stdout for C*. It's a library we depend on,
didn't try too hard to figure out which one.
On 29 January 2014 21:07, Frank Ng wrote:
> Benedict,
> Thanks for the advice. I've
Benedict,
Thanks for the advice. I've tried turning on PrintSafepointStatistics.
However, that info is only sent to the STDOUT console. The cassandra
startup script closes the STDOUT when it finishes, so nothing is shown for
safepoint statistics once it's done starting up. Do you know how to
sta
Frank,
The same advice for investigating holds: add the VM flags
-XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1
(you could put something above 1 there, to reduce the amount of
logging, since a pause of 52s will be pretty obvious even if
aggregated with lots of other safe points
Thanks for the update. Our logs indicated that there were 0 pending for
CompactionManager at that time. Also, there were no nodetool repairs
running at that time. The log statements above state that the application
had to stop to reach a safepoint. Yet, it doesn't say what is requesting
the saf
We had similar latency spikes when pending compactions can't keep it up or
repair/streaming taking too much cycles.
On Wed, Jan 29, 2014 at 10:07 AM, Frank Ng wrote:
> All,
>
> We've been having intermittent long application pauses (version 1.2.8) and
> not sure if it's a cassandra bug. During
All,
We've been having intermittent long application pauses (version 1.2.8) and
not sure if it's a cassandra bug. During these pauses, there are dropped
messages in the cassandra log file along with the node seeing other nodes
as down. We've turned on gc logging and the following is an example o
32 matches
Mail list logo