etup: Cassandra 2.0.14. 2 DCs with 3 nodes each connected via 10Gbps VPN. We
run repair with -par and -pr option.
Problem: Repair Hangs. Merkle Tree Responses are not received from one or more
nodes in remote DC.
Observations till now:
1. Repair hangs intermittently on one node of DC2.. Only on
Hi All,
I am summarizing the setup, problem & key observations till now:
Setup: Cassandra 2.0.14. 2 DCs with 3 nodes each connected via 10Gbps VPN. We
run repair with -par and -pr option.
Problem: Repair Hangs. Merkle Tree Responses are not received from one or more
nodes in remot
via its public IP.
Thanks
Anuj
On Tue, 24/11/15, Paulo Motta wrote:
Subject: Re: Repair Hangs while requesting Merkle Trees
To: "user@cassandra.apache.org" , "Anuj Wadehra"
Date: Tuesday, 24 November, 2015, 12:38 AM
The is
k team to capture netstats and tcpdump
> too..
>
> Thanks
> Anuj
>
>
>
> On Wed, 18/11/15, Anuj Wadehra wrote:
>
> Subject: Re: Repair Hangs while requesting Merkle Trees
> To: "user@cassandra.apache.org"
: Repair Hangs while requesting Merkle Trees
To: "user@cassandra.apache.org"
Date: Wednesday, 18 November, 2015, 7:57 AM
Thanks Bryan !!
Connection
is in ESTBLISHED state on on end and completely missing at
other end (in another dc).
Yes,
we can revisit TCP tuning.But the probl
Cheng"
Date:Wed, 18 Nov, 2015 at 2:04 am
Subject:Re: Repair Hangs while requesting Merkle Trees
Ah OK, might have misunderstood you. Streaming socket should not be in play
during merkle tree generation (validation compaction). They may come in play
during merkle tree exchange- that I'm not
s
> Anuj
>
>
>
>
>
>
>
>
> Sent from Yahoo Mail on Android
> <https://overview.mail.yahoo.com/mobile/?.src=Android>
> --
> *From*:"Bryan Cheng"
> *Date*:Tue, 17 Nov, 2015 at 5:54 am
>
> *Subject*:Re: Repair Hangs while requesting Merkle Tr
different?
Thanks
Anuj
Sent from Yahoo Mail on Android
From:"Bryan Cheng"
Date:Tue, 17 Nov, 2015 at 5:54 am
Subject:Re: Repair Hangs while requesting Merkle Trees
Hi Anuj,
Did you mean streaming_socket_timeout_in_ms? If not, then you definitely want
that set. Even the be
gt; Anuj
>
> Sent from Yahoo Mail on Android
> <https://overview.mail.yahoo.com/mobile/?.src=Android>
> --
> *From*:"Anuj Wadehra"
> *Date*:Sat, 14 Nov, 2015 at 11:59 pm
>
> *Subject*:Re: Repair Hangs while requesting Merkle Trees
roid
From:"Anuj Wadehra"
Date:Sat, 14 Nov, 2015 at 11:59 pm
Subject:Re: Repair Hangs while requesting Merkle Trees
Thanks Daemeon !!
I wil capture the output of netstats and share in next few days. We were
thinking of taking tcp dumps also. If its a network issue and increasing
request
. Is it related some how?
Thanks
Anuj
Sent from Yahoo Mail on Android
From:"daemeon reiydelle"
Date:Thu, 12 Nov, 2015 at 10:34 am
Subject:Re: Repair Hangs while requesting Merkle Trees
Have you checked the network statistics on that machine? (netstats -tas) while
attempting to rep
5 nodes. On only
> one node in DC2, we are unable to complete repair as it always hangs. Node
> sends Merkle Tree requests, but one or more nodes in DC1 (remote) never
> show that they sent the merkle tree reply to requesting node.
> Repair hangs infinitely.
>
> After increasing req
show that
they sent the merkle tree reply to requesting node.
Repair hangs infinitely.
After increasing request_timeout_in_ms on affected node, we were able to
successfully run repair on one of the two occassions.
Any comments, why this is happening on just one node? In
merkle tree reply
to requesting node.
Repair hangs infinitely.
After increasing request_timeout_in_ms on affected node, we were able to
successfully run repair on one of the two occassions.
Any comments, why this is happening on just one node? In
OutboundTcpConnection.java, when isTimeOut
> Thanks Robert, I believe this is a good idea but I was doing it already.
>
> "If you are really overprovisioned and on real hardware and network and
> SSD, it might work sometimes."
>
> I am on AWS and was on m1.small to m1.xlarge from Cassandra 0.8 to 1.2.18,
> that
about the most?"
Thanks Robert, I believe this is a good idea but I was doing it already.
"If you are really overprovisioned and on real hardware and network and
SSD, it might work sometimes."
I am on AWS and was on m1.small to m1.xlarge from Cassandra 0.8 to 1.2.18,
that's the
On Mon, Oct 20, 2014 at 5:45 AM, Alain RODRIGUEZ wrote:
> I now that 2.1 fixes this all. We are going to migrate to C* 2.0 soon
> (asap) and then to 2.1, but we first need to run some tests, which will
> take us some time. Is repair officially broken on 1.2.18 ? Is there any
> known workaround or
On Mon, Oct 20, 2014 at 5:45 AM, Alain RODRIGUEZ wrote:
> Using Cassandra 1.2.18, we are experimenting an issue in our 2 DC
> (EC2MultiRegionSnitch) C*1.2.18 cluster.
>
> We have 2 DC and I saw some weird* inconsistencies between our 2 DC. I
> tried to run repair on all the nodes of all 2 DC (We
Hi,
Using Cassandra 1.2.18, we are experimenting an issue in our 2 DC
(EC2MultiRegionSnitch) C*1.2.18 cluster.
We have 2 DC and I saw some weird* inconsistencies between our 2 DC. I
tried to run repair on all the nodes of all 2 DC (We tried running various
repair at the same time and also in a ro
> I changed logging to debug level, but still nothing is logged.
> Again - any help will be appreciated.
There is nothing at the ERROR level on any machine ?
check nodetool compactionstats to see if a validation compaction is running,
the repair may be waiting on this.
check nodetool netstats
Update - I am still experiencing the above issues, but not all the time. I
was able to run repair (on this keyspace) from node 2 and from node 4, but
now a different keyspace hangs on these nodes, and I am still not able to
run repair on node 1. It seems random. I changed logging to debug level,
bu
Hi,
On AWS, we had a 2 node cluster with RF 2.
We added 2 more nodes, then changed RF to 3 on all our keyspaces.
Next step was to run nodetool repair, node by node.
(In the meantime, we found that we must use CL quorum, which is affecting
our application's performance).
Started with node 1, which
> If I wait 24 hours, the repair command will return an error saying that the
> node died… but the node really didn't die, I watch it the whole time.
Can you include the error, it makes it easier to know what's going on.
You should see INFO messages on the node you are running repair on that say
> How does it fail?
If I wait 24 hours, the repair command will return an error saying that the
node died… but the node really didn't die, I watch it the whole time.
I have the DEBUG messages on in the log files, when the node I'm repairing
sends out a merkle tree request, I will normally see, {C
> A repair on a certain CF will fail, and I run it again and again, eventually
> it will succeed.
How does it fail?
Can you see the repair start on the other node ?
If you are getting errors in the log about streaming failing because a node
died, and the FailureDetector is in the call stack, ch
Hello,
I have a cluster with 4 nodes, 2 nodes in 2 data centers. I had a hardware
failure in one DC and had to replace the nodes. I'm running 1.2.3 on all of the
nodes now. I was able to run nodetool rebuild on the two replacement nodes, but
now I cannot finish a repair on any of them. I have 1
Upgrading to 1.2.3 fixed the -pr Repair.. I'll just use that from now on
(which is what I prefer!)
Thanks,
Ryan
On Wed, Mar 27, 2013 at 9:11 AM, Ryan Lowe wrote:
> Marco,
>
> No there are no errors... the last line I see in my logs related to repair
> is :
>
> [repair #...] Sending completed m
Marco,
No there are no errors... the last line I see in my logs related to repair
is :
[repair #...] Sending completed merkle tree to /[node] for
(keyspace1,columnfamily1)
Ryan
On Wed, Mar 27, 2013 at 8:49 AM, Marco Matarazzo <
marco.matara...@hexkeep.com> wrote:
> > If I run `nodetool -h lo
> If I run `nodetool -h localhost repair`, then it will repair only the first
> Keyspace and then hang... I let it go for a week and nothing.
Does node logs show any error ?
> If I run `nodetool -h localhost repair -pr`, then it appears to only repair
> the first VNode range, but does do all ke
Has anyone else experienced this? After upgrading to VNodes, I am having
Repair issues.
If I run `nodetool -h localhost repair`, then it will repair only the first
Keyspace and then hang... I let it go for a week and nothing.
If I run `nodetool -h localhost repair -pr`, then it appears to only r
> /raid0/cassandra/data/OpsCenter/events_timeline/OpsCenter-events_timeline-hf-1-Data.db
> is not compatible with current version ib
> --
This can be fixed with a nodetool upgradesstables
Cheers
-
Aaron Morton
Freelance Cassandra Consultant
New Zealand
@aaronmorton
http://www.the
On Thu, Mar 14, 2013 at 6:34 AM, aaron morton wrote:
>> 1. is this a nodetool bug? is there any way to propagate the
>> java.io.IOException back to nodetool?
> The repair continues to work even if nodetool fails, it's a server side thing.
>
>> 2. network problems on EC2, I'm shocked! are there r
> 1. is this a nodetool bug? is there any way to propagate the
> java.io.IOException back to nodetool?
The repair continues to work even if nodetool fails, it's a server side thing.
> 2. network problems on EC2, I'm shocked! are there recommended
> network settings for EC2?
Streaming does not p
On Wed, Mar 13, 2013 at 12:39 PM, Wei Zhu wrote:
> My guess would be there is some exception during the repair and your session
> is aborted.
> Here is the code of doing repair:
>
>https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/AntiEntropyService.java
>
> loo
should give you a rough idea in which stage
repaired died.
-Wei
- Original Message -
From: "Dane Miller"
To: user@cassandra.apache.org, "Wei Zhu"
Sent: Wednesday, March 13, 2013 12:32:20 PM
Subject: Re: repair hangs
On Wed, Mar 13, 2013 at 11:44 AM, Wei Zhu wrote:
On Wed, Mar 13, 2013 at 11:44 AM, Wei Zhu wrote:
>Do you see anything related to "merkle" tree in your log?
>
>Also do a nodetool compactionstats, during merkle tree calculation, you will
>see
>validation there.
The last mention of "merkle" is 2 days old. compactionstats are:
$ nodetool compac
10:54:50 AM
Subject: repair hangs
Hi,
On one of my nodes, nodetool repair -pr has been running for 48 hours
and appears to be hung, with no output and no AntiEntropy messages in
system.log for 40+ hours. Load, cpu, etc are all near zero. There
are no other repair jobs running in my cluster.
Wh
Hi,
On one of my nodes, nodetool repair -pr has been running for 48 hours
and appears to be hung, with no output and no AntiEntropy messages in
system.log for 40+ hours. Load, cpu, etc are all near zero. There
are no other repair jobs running in my cluster.
What's the recommended way to deal wi
38 matches
Mail list logo