Re: Corrupted data

2011-07-11 Thread Jonathan Ellis
That looks a lot like what I've seen from machines with bad ram. 2011/7/8 Héctor Izquierdo Seliva : > Hi everyone, > > I'm having thousands of these errors: > >  WARN [CompactionExecutor:1] 2011-07-08 16:36:45,705 > CompactionManager.java (line 737) Non-fatal error reading row > (stacktrace follow

Re: Corrupted data

2011-07-10 Thread Yan Chunlu
it has already run about 20 hours... On Mon, Jul 11, 2011 at 1:36 AM, aaron morton wrote: > 1) do I need to treat every node as failure and do a rolling replacement? > since there might be some inconsistent in the cluster even I have no way to > find out. > > see > http://wiki.apache.org/cassand

Re: Corrupted data

2011-07-10 Thread Yan Chunlu
oh the error seems from jmx sorry but seems I dont have more error messages, the node repair just never ends... and strace the process find out nothing, it is not doing anything. is there anyway to get more information about this? do I need to do a major compaction on every column family? thank

Re: Corrupted data

2011-07-10 Thread aaron morton
> 1) do I need to treat every node as failure and do a rolling replacement? > since there might be some inconsistent in the cluster even I have no way to > find out. see http://wiki.apache.org/cassandra/Operations#Dealing_with_the_consequences_of_nodetool_repair_not_running_within_GCGraceSecond

Re: Corrupted data

2011-07-10 Thread Yan Chunlu
I am running RF=2(I have changed it from 2->3 and back to 2) and 3 nodes and didn't running node repair more than 10 days, did not aware of this is critical. I run node repair recently and one of the node always hung... from log it seems doing nothing related to the repair. so I got two problems:

Re: Corrupted data

2011-07-09 Thread Héctor Izquierdo Seliva
All the important stuff is using QUORUM. Normal operation uses around 3-4 GB of heap out of 6. I've also tried running repair on a per CF basis, and still no luck. I've found it's faster to bootstrap a node again than repairing it. Once I have the cluster in a sane state I'll try running a repair

Re: Corrupted data

2011-07-09 Thread Jonathan Ellis
Sounds like your non-repair workload is using too much of the heap. Alternatively, you could have a very large supercolumn that causes the OOM when it is read. 2011/7/9 Héctor Izquierdo Seliva : > Hi Peter. > >  I have a problem with repair, and it's that it always brings the node > doing the rep

Re: Corrupted data

2011-07-09 Thread aaron morton
> Nop, only when something breaks Unless you've been working at QUORUM life is about to get trickier. Repair is an essential part of running a cassandra cluster, without it you risk data loss and dead data coming back to life. If you have been writing at QUORUM, so have a reasonable expectatio

Re: Corrupted data

2011-07-09 Thread Héctor Izquierdo Seliva
Hi Peter. I have a problem with repair, and it's that it always brings the node doing the repairs down. I've tried setting index_interval to 5000, and it still dies with OutOfMemory errors, or even worse, it generates thousands of tiny sstables before dying. I've tried like 20 repairs during thi

Re: Corrupted data

2011-07-09 Thread Peter Schuller
>> - Have you been running repair consistently ? > > Nop, only when something breaks This is unrelated to the problem you were asking about, but if you never run delete, make sure you are aware of: http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair http://wiki.apache.org/cas

Re: Corrupted data

2011-07-08 Thread Héctor Izquierdo Seliva
Hi Aaron, El vie, 08-07-2011 a las 14:47 -0700, aaron morton escribió: > You may not lose data. > > - What version and whats the upgrade history? all versions from 0.7.1 to 0.8.1. All cfs were in 0.8.1 format though > - What RF / node count / CL ? RF=3, node count = 6 > - Have you been runni

Re: Corrupted data

2011-07-08 Thread aaron morton
You may not lose data. - What version and whats the upgrade history? - What RF / node count / CL ? - Have you been running repair consistently ? - Is this on a single node or all nodes ? Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.c

Corrupted data

2011-07-08 Thread Héctor Izquierdo Seliva
Hi everyone, I'm having thousands of these errors: WARN [CompactionExecutor:1] 2011-07-08 16:36:45,705 CompactionManager.java (line 737) Non-fatal error reading row (stacktrace follows) java.io.IOError: java.io.IOException: Impossible row size 6292724931198053 at org.apache.cassandra.db.