Local disks or shared network disks? --wunder
On Jul 8, 2014, at 11:43 AM, Shawn Heisey <s...@elyograg.org> wrote: > On 7/8/2014 1:53 AM, Harald Kirsch wrote: >> Hi all, >> >> This is what happens when I run a regular wget query to log the >> current number of documents indexed: >> >> 2014-07-08:07:23:28 QTime=20 numFound="5720168" >> 2014-07-08:07:24:28 QTime=12 numFound="5721126" >> 2014-07-08:07:25:28 QTime=19 numFound="5721126" >> 2014-07-08:07:27:18 QTime=50071 numFound="5721126" >> 2014-07-08:07:29:08 QTime=50058 numFound="5724494" >> 2014-07-08:07:30:58 QTime=50033 numFound="5730710" >> 2014-07-08:07:31:58 QTime=13 numFound="5730710" >> 2014-07-08:07:33:48 QTime=50065 numFound="5734069" >> 2014-07-08:07:34:48 QTime=16 numFound="5737742" >> 2014-07-08:07:36:38 QTime=50037 numFound="5737742" >> 2014-07-08:07:37:38 QTime=12 numFound="5738190" >> 2014-07-08:07:38:38 QTime=23 numFound="5741208" >> 2014-07-08:07:40:29 QTime=50034 numFound="5742067" >> 2014-07-08:07:41:29 QTime=12 numFound="5742067" >> 2014-07-08:07:42:29 QTime=17 numFound="5742067" >> 2014-07-08:07:43:29 QTime=20 numFound="5745497" >> 2014-07-08:07:44:29 QTime=13 numFound="5745981" >> 2014-07-08:07:45:29 QTime=23 numFound="5746420" >> >> As you can see, the QTime is just over 50 seconds at irregular intervals. >> >> This happens independent of whether I am indexing documents with >> around 20 dps or not. First I thought about a dependence on the >> auto-commit of 5 minutes, but the the 50 seconds hits are too irregular. >> >> Furthermore, and this is *really strange*: when hooking strace on the >> solr process, the 50 seconds QTimes disappear completely and >> consistently --- a real Heisenbug. >> >> Nevertheless, strace shows that there is a socket timeout of 50 >> seconds defined in calls like this: >> >> [pid 1253] 09:09:37.857413 poll([{fd=96, events=POLLIN|POLLERR}], 1, >> 50000) = 1 ([{fd=96, revents=POLLIN}]) <0.000040> >> >> where the fd=96 is the result of >> >> [pid 25446] 09:09:37.855235 accept(122, {sa_family=AF_INET, >> sin_port=htons(57236), sin_addr=inet_addr("ip address of local >> host")}, [16]) = 96 <0.000054> >> >> where again fd=122 is the TCP port on which solr was started. >> >> My hunch is that this is communication between the cores of solr. >> >> I tried to search the internet for such a strange connection between >> socket timeouts and strace, but could not find anything (the >> stackoverflow entry from yesterday is my own :-( >> >> >> This smells a bit like a race condition/deadlock kind of thing which >> is broken up by timing differences introduced by stracing the process. >> >> Any hints appreciated. >> >> For completeness, here is my setup: >> - solr-4.8.1, >> - cloud version running >> - 10 shards on 10 cores in one instance >> - hosted on SUSE Linux Enterprise Server 11 (x86_64), VERSION 11, >> PATCHLEVEL 2 >> - hosted on a vmware, 4 CPU cores, 16 GB RAM >> - single digit million docs indexed, exact number does not matter >> - zero query load > > Long GC pauses would also be my first guess. DNS problems on the > inter-server communication for SolrCloud would be a second guess. If > it's not one of these, then I really have no idea. > > http://wiki.apache.org/solr/SolrPerformanceProblems#GC_pause_problems > http://serverfault.com/questions/339791/5-second-resolving-delay > > Thanks, > Shawn >