No, no full GC.

The JVM does nothing during the outages, no CPU, no GC, as checked with jvisualvm and htop.

Harald.

On 08.07.2014 10:12, Heyde, Ralf wrote:
My First assumption: full gc.
Can you please tell us about your jvm setup and maybe trace what happens
the jvms?
On Jul 8, 2014 9:54 AM, "Harald Kirsch" <harald.kir...@raytion.com> wrote:

Hi all,

This is what happens when I run a regular wget query to log the current
number of documents indexed:

2014-07-08:07:23:28 QTime=20 numFound="5720168"
2014-07-08:07:24:28 QTime=12 numFound="5721126"
2014-07-08:07:25:28 QTime=19 numFound="5721126"
2014-07-08:07:27:18 QTime=50071 numFound="5721126"
2014-07-08:07:29:08 QTime=50058 numFound="5724494"
2014-07-08:07:30:58 QTime=50033 numFound="5730710"
2014-07-08:07:31:58 QTime=13 numFound="5730710"
2014-07-08:07:33:48 QTime=50065 numFound="5734069"
2014-07-08:07:34:48 QTime=16 numFound="5737742"
2014-07-08:07:36:38 QTime=50037 numFound="5737742"
2014-07-08:07:37:38 QTime=12 numFound="5738190"
2014-07-08:07:38:38 QTime=23 numFound="5741208"
2014-07-08:07:40:29 QTime=50034 numFound="5742067"
2014-07-08:07:41:29 QTime=12 numFound="5742067"
2014-07-08:07:42:29 QTime=17 numFound="5742067"
2014-07-08:07:43:29 QTime=20 numFound="5745497"
2014-07-08:07:44:29 QTime=13 numFound="5745981"
2014-07-08:07:45:29 QTime=23 numFound="5746420"

As you can see, the QTime is just over 50 seconds at irregular intervals.

This happens independent of whether I am indexing documents with around 20
dps or not. First I thought about a dependence on the auto-commit of 5
minutes, but the the 50 seconds hits are too irregular.

Furthermore, and this is *really strange*: when hooking strace on the solr
process, the 50 seconds QTimes disappear completely and consistently --- a
real Heisenbug.

Nevertheless, strace shows that there is a socket timeout of 50 seconds
defined in calls like this:

[pid  1253] 09:09:37.857413 poll([{fd=96, events=POLLIN|POLLERR}], 1,
50000) = 1 ([{fd=96, revents=POLLIN}]) <0.000040>

where the fd=96 is the result of

[pid 25446] 09:09:37.855235 accept(122, {sa_family=AF_INET,
sin_port=htons(57236), sin_addr=inet_addr("ip address of local host")},
[16]) = 96 <0.000054>

where again fd=122 is the TCP port on which solr was started.

My hunch is that this is communication between the cores of solr.

I tried to search the internet for such a strange connection between
socket timeouts and strace, but could not find anything (the stackoverflow
entry from yesterday is my own :-(


This smells a bit like a race condition/deadlock kind of thing which is
broken up by timing differences introduced by stracing the process.

Any hints appreciated.

For completeness, here is my setup:
- solr-4.8.1,
- cloud version running
- 10 shards on 10 cores in one instance
- hosted on SUSE Linux Enterprise Server 11 (x86_64), VERSION 11,
PATCHLEVEL 2
- hosted on a vmware, 4 CPU cores, 16 GB RAM
- single digit million docs indexed, exact number does not matter
- zero query load


Harald.



--
Harald Kirsch
Raytion GmbH
Kaiser-Friedrich-Ring 74
40547 Duesseldorf
Fon +49 211 53883-216
Fax +49-211-550266-19
http://www.raytion.com

Reply via email to