Re: Occasional 10s Timeouts on Read

2010-06-24 Thread Jonathan Ellis
Glad you tracked that down! On Wed, Jun 23, 2010 at 6:14 PM, AJ Slater wrote: > This issue is caused by my network. > > Cassandra maintains multiple gossip connections per node pair. One of > these connections is used for heartbeat and load broadcasting traffic. > Its quite talky. Another one is

Re: Occasional 10s Timeouts on Read

2010-06-23 Thread AJ Slater
This issue is caused by my network. Cassandra maintains multiple gossip connections per node pair. One of these connections is used for heartbeat and load broadcasting traffic. Its quite talky. Another one is used for distributed key reads. Its idle unless distributed keys are actively being sough

Re: Occasional 10s Timeouts on Read

2010-06-19 Thread AJ Slater
The only indication I have that cassandra realized something was wrong during this period was this INFO message: 10.33.2.70:/var/log/cassandra/output.log DEBUG 20:00:35,841 get_slice DEBUG 20:00:35,841 weakreadremote reading SliceFromReadCommand(table='jolitics.c om', key='4c43228354b38f14a1a015d

Re: Occasional 10s Timeouts on Read

2010-06-19 Thread AJ Slater
Agreed. But those connection errors were happening at a sort of random time. Not the time when I was seeing the problem. Now I am seeing the problem and here are some logs without ConnectionExceptions. Here we're asking 10.33.2.70 for key: 52e86817a577f75e545cdecd174d8b17 which resides only on 10.

Re: Occasional 10s Timeouts on Read

2010-06-19 Thread Jonathan Ellis
This is definitely not a Cassandra bug, something external is causing those connection failures. On Sat, Jun 19, 2010 at 3:12 PM, AJ Slater wrote: > Logging with TRACE reveals immediate problems with no client requests > coming in to the servers. The problem was immediate and persisted over > the

Re: Occasional 10s Timeouts on Read

2010-06-19 Thread AJ Slater
tcpdump shows bidirectional communication with ACKs during a known problem period. I did not have TRACE logging going during the period I have tcpdump logs, but I assume that an 'INFO error connecting to' is probably caused by ConnectExceptions For instance... lpc03:~$ telnet fs02 7000 ...conne

Re: Occasional 10s Timeouts on Read

2010-06-19 Thread Peter Schuller
> TRACE 14:42:06,248 unable to connect to /10.33.3.20 > java.net.ConnectException: Connection refused >        at java.net.PlainSocketImpl.socketConnect(Native Method) So that's interesting since it is a clear failure that comes from the operating system and indicates something which can be observ

Re: Occasional 10s Timeouts on Read

2010-06-19 Thread AJ Slater
Logging with TRACE reveals immediate problems with no client requests coming in to the servers. The problem was immediate and persisted over the course of half an hour: 10.33.2.70 lpc03 10.33.3.10 fs01 10.33.3.20 fs02 a...@lpc03:~$ grep unable /var/log/cassandra/output.log TRACE 14:07:52,1

Re: Occasional 10s Timeouts on Read

2010-06-19 Thread AJ Slater
I shall do just that. I did a bunch of tests this morning and the situation appears to be this: I have three nodes A, B and C, with RF=2. I understand now why this issue wasn't apparent with RF=3. If there are regular intranode column requests going on (e.g. i set up a pinger to get remote column

Re: Occasional 10s Timeouts on Read

2010-06-18 Thread Jonathan Ellis
set log level to TRACE and see if the OutboundTcpConnection is going bad. that would explain the message never arriving. On Fri, Jun 18, 2010 at 10:39 AM, AJ Slater wrote: > To summarize: > > If a request for a column comes in *after a period of several hours > with no requests*, then the node s

Re: Occasional 10s Timeouts on Read

2010-06-18 Thread AJ Slater
To summarize: If a request for a column comes in *after a period of several hours with no requests*, then the node servicing the request hangs while looking for its peer rather than servicing the request like it should. It then throws either a TimedOutException or a (wrong) NotFoundExeption. And

Re: Occasional 10s Timeouts on Read

2010-06-17 Thread AJ Slater
These are physical machines. storage-conf.xml.fs03 is here: http://pastebin.com/weL41NB1 Diffs from that for the other two storage-confs are inline here: a...@worm:../Z3/cassandra/conf/dev$ diff storage-conf.xml.lpc03 storage-conf.xml.fs01 185c185 > 71603818521973537678586548668074777838 229

Re: Occasional 10s Timeouts on Read

2010-06-17 Thread AJ Slater
The machines in question have 8GB of RAM each and generally never touch swap. I shall try to monitor memory/swap overnight and see if something strange happens. Would swapping really take 10s? AJ On Thu, Jun 17, 2010 at 1:54 PM, Jonathan Ellis wrote: > The explanation that best fits the symptom

Re: Occasional 10s Timeouts on Read

2010-06-17 Thread AJ Slater
The behavior was seen with row caching off. I now have row caching on. key cache hit rate is 0.75-0.8 row cache hit rate is 0 (row cache capacity=1, RowsCached="100%") looks like I should try another format for RowsCached, like "0.8" or "90%" or something. On Thu, Jun 17, 2010 at 1:47 PM, aaron

Re: Occasional 10s Timeouts on Read

2010-06-17 Thread Benjamin Black
Are these physical machines or virtuals? Did you post your cassandra.in.sh and storage-conf.xml someplace? On Thu, Jun 17, 2010 at 10:31 AM, AJ Slater wrote: > Total data size in the entire cluster is about twenty 12k images. With > no other load on the system. I just ask for one column and I ge

Re: Occasional 10s Timeouts on Read

2010-06-17 Thread Jonathan Ellis
The explanation that best fits the symptoms you describe is that you are swapping. On Thu, Jun 17, 2010 at 10:12 AM, AJ Slater wrote: > I'm seing 10s timeouts on reads few times a day. Its hard to reproduce > consistently but seems to happen most often after its been a long time > between reads.

Re: Occasional 10s Timeouts on Read

2010-06-17 Thread aaron morton
Do you have Row Caching enabled ? You can check in the JMX console to see if you're hitting the cache. Try turning on DEBUG level logging and look at the log on a machine you connect to do the read. Aaron On 18 Jun 2010, at 05:31, AJ Slater wrote: > Total data size in the entire cluster i

Re: Occasional 10s Timeouts on Read

2010-06-17 Thread AJ Slater
Total data size in the entire cluster is about twenty 12k images. With no other load on the system. I just ask for one column and I get these timeouts. Performing multiple gets on the columns leads to multiple timeouts for a period of a few seconds or minutes and then the situation magically resolv

Re: Occasional 10s Timeouts on Read

2010-06-17 Thread AJ Slater
Cassandra 0.6.2 from the apache debian source. Ubunutu Jaunty. Sun Java6 jvm. All nodes in separate racks at 365 main. On Thu, Jun 17, 2010 at 10:12 AM, AJ Slater wrote: > I'm seing 10s timeouts on reads few times a day. Its hard to reproduce > consistently but seems to happen most often after i

Occasional 10s Timeouts on Read

2010-06-17 Thread AJ Slater
I'm seing 10s timeouts on reads few times a day. Its hard to reproduce consistently but seems to happen most often after its been a long time between reads. After presenting itself for a couple minutes the problem then goes away. I've got a three node cluster with replication factor 2, reading at