Re: Node goes AWOL briefly; failed replication does not report error to client, though consistency=ALL

2010-12-08 Thread Reverend Chip
On 12/8/2010 7:30 AM, Jonathan Ellis wrote: > On Tue, Dec 7, 2010 at 4:00 PM, Reverend Chip wrote: >> Full DEBUG level logs would be a space problem; I'm loading at least 1T >> per node (after 3x replication), and these events are rare. Can the >> DEBUG logs be limit

Re: Node goes AWOL briefly; failed replication does not report error to client, though consistency=ALL

2010-12-07 Thread Reverend Chip
st 1T per node (after 3x replication), and these events are rare. Can the DEBUG logs be limited to the specific modules helpful for this diagnosis of the gossip problem and, secondarily, the failure to report replication failure? > On Tue, Dec 7, 2010 at 2:37 PM, Reverend Chip wrote: >> No,

Re: Node goes AWOL briefly; failed replication does not report error to client, though consistency=ALL

2010-12-07 Thread Reverend Chip
e/CASSANDRA-1804 which is fixed in > rc2. > > On Mon, Dec 6, 2010 at 6:58 PM, Reverend Chip wrote: >> I'm running a big test -- ten nodes with 3T disk each. I'm using >> 0.7.0rc1. After some tuning help (thanks Tyler) lots of this is working >> as it should. H

Node goes AWOL briefly; failed replication does not report error to client, though consistency=ALL

2010-12-06 Thread Reverend Chip
I'm running a big test -- ten nodes with 3T disk each. I'm using 0.7.0rc1. After some tuning help (thanks Tyler) lots of this is working as it should. However a serious event occurred as well -- the server froze up -- and though mutations were dropped, no error was reported to the client. Here'

Re: repair takes two days, and ends up stuck: stream at 1096% (yes, really)

2010-11-15 Thread Reverend Chip
On 11/15/2010 2:01 PM, Jonathan Ellis wrote: > On Mon, Nov 15, 2010 at 3:05 PM, Reverend Chip wrote: >> >> There are a lot of non-tmps that were not included in the load >> figure. Having stopped the server and deleted tmp files, the data are >> still using way mor

Re: repair takes two days, and ends up stuck: stream at 1096% (yes, really)

2010-11-15 Thread Reverend Chip
On 11/15/2010 12:09 PM, Jonathan Ellis wrote: > On Mon, Nov 15, 2010 at 1:03 PM, Reverend Chip wrote: >> I find X.21's data disk is full. "nodetool ring" says that X.21 has a >> load of only 326.2 GB, but the 1T partition is full. > Load only tracks live data --

Re: Gossip yoyo under write load

2010-11-15 Thread Reverend Chip
On 11/15/2010 12:13 PM, Rob Coli wrote: > On 11/15/10 12:08 PM, Reverend Chip wrote: >>> " >>> logger_.warn("Unable to lock JVM memory (ENOMEM)." >>> or >>> logger.warn("Unknown mlockall error " + errno(e)); >>> " >

Re: Gossip yoyo under write load

2010-11-15 Thread Reverend Chip
On 11/15/2010 11:34 AM, Rob Coli wrote: > On 11/13/10 11:59 AM, Reverend Chip wrote: >> Swapping could conceivably be a >> factor; the JVM is 32G out of 72G, but the machine is 2.5G into swap >> anyway. I'm going to disable swap and see if the gossip issues resolve. >

Re: repair takes two days, and ends up stuck: stream at 1096% (yes, really)

2010-11-15 Thread Reverend Chip
On 11/15/2010 10:30 AM, Jonathan Ellis wrote: > Is X.20 spewing these errors constantly now? Yes. > Did X.21 log anything when/before the errors started on X.20? I find X.21's data disk is full. "nodetool ring" says that X.21 has a load of only 326.2 GB, but the 1T partition is full. When I tra

Re: repair takes two days, and ends up stuck: stream at 1096% (yes, really)

2010-11-15 Thread Reverend Chip
Did I answer the question sufficiently? I need repair to work, and the cluster is sick. On 11/14/2010 2:17 PM, Jonathan Ellis wrote: > What exception is causing it to fail/retry? > > On Sun, Nov 14, 2010 at 3:49 PM, Chip Salzenberg wrote: >> My by-now infamous eight-node cluster running 0.7.0bet

Re: Cluster fragility

2010-11-13 Thread Reverend Chip
streaming, > both source and target > - DEBUG level logs > - instructions for how to reproduce > > On Thu, Nov 11, 2010 at 7:46 PM, Reverend Chip wrote: >> I've been running tests with a first four-node, then eight-node >> cluster. I started with 0.7.0 beta3, but ha

Re: Gossip yoyo under write load

2010-11-13 Thread Reverend Chip
On 11/12/2010 6:46 PM, Jonathan Ellis wrote: > On Fri, Nov 12, 2010 at 3:19 PM, Chip Salzenberg wrote: >> After I rebooted my 0.7.0beta3+ cluster to increase threads (read=100 >> write=200 ... they're beefy machines), and putting them under load again, I >> find gossip reporting yoyo up-down-up-do

Cluster fragility

2010-11-11 Thread Reverend Chip
I've been running tests with a first four-node, then eight-node cluster. I started with 0.7.0 beta3, but have since updated to a more recent Hudson build. I've been happy with a lot of things, but I've had some really surprisingly unpleasant experiences with operational fragility. For example, w

Re: node won't leave

2010-11-07 Thread Reverend Chip
On 11/6/2010 8:26 PM, Jonathan Ellis wrote: > On Sat, Nov 6, 2010 at 4:51 PM, Reverend Chip wrote: >> Am I to understand that >> ring maintenance requests can just fail when partially complete, in the >> same manner as a regular insert might fail, perhaps due to inter-node &

loadbalance kills gossip?

2010-11-06 Thread Reverend Chip
More weirdness with my four-or-five-node cluster of 0.7 beta3. Having brought up all five nodes, including the one that didn't loadbalance right, I tried loadbalancing it again. (This is under completely idle conditions - no external reads or writes.) The result is a cluster where each node thi

Re: node won't leave

2010-11-06 Thread Reverend Chip
On 11/6/2010 1:48 PM, Jonathan Ellis wrote: > On Fri, Nov 5, 2010 at 8:03 PM, Chip Salzenberg wrote: >> In the below "nodetool ring" output, machine 18 was told to loadbalance over >> an hour ago. It won't actually leave the ring. When I first told it to >> loadbalance, the cluster was under hea