Re: SOLR4 cluster - strange CPU spike on slave

Erick Erickson Fri, 30 Nov 2012 06:13:40 -0800

right, so here's what I'd check for.

Your logs should show a replication pretty coincident with the spike and
that should be in the log. Note: the replication should complete just
before the spike.


Or you can just turn replication off and fire it manually to try to force
the situation at will, see:
http://wiki.apache.org/solr/SolrReplication#HTTP_API. (but note that you'll
have to wait until the index has changed on the master to see any action).

So you should be able to create your spike at will. And this will be pretty
normal. When replication happens, a new searcher is opened, caches are
filled, autowarming is done, all kinds of stuff like that. During this
period, the _old_ searcher is still open, which will both cause the CPU to
be busier and require additional memory. Once the new searcher is warmed,
new queries go to it, and when the old searcher has finished serving all
the queries it shuts down and all the resources are freed. Which is why
commits are expensive operations.

All of which means that so far I don't think there's a problem, this is
just normal Solr operation. If you're seeing responsiveness problems when
serving queries you probably want to throw more hardware (particularly
memory) at the problem.

But when thinking about memory allocating to the JVM, _really_ read Uwe's
post here:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Best
Erick


On Thu, Nov 29, 2012 at 2:39 AM, John Nielsen <j...@mcb.dk> wrote:

> Yup you read it right.
>
> We originally intended to do all our indexing to varnish02, replicate to
> varnish01 and then search from varnish01 (through a fail-over ip which
> would switch the reader to varnish02 in case of trouble).
>
> When I saw the spikes, I tried to eliminate possibilities by starting
> searching from varnish02, leaving varnish01 with nothing to do but to
> receive replication data. This did not remove the spikes. As soon as this
> spike is fixed, I will start searching from varnish01 again. These sort of
> debug antics are only possible because, although we do have customers using
> this, we are still in our beta phase.
>
> Varnish01 never receives any manual commit orders. Varnish02 does from time
> to time.
>
> Oh, and I accidentally misinformed you before. (damn secondary language) We
> are actually seeing the spikes on both servers. I was just focusing on
> varnish01 because I use it to eliminate possibilities.
>
> It just occurred to me now; We tried switching off our feeder/index tool
> for 24 hours, and we didn't see any spikes during this period, so receiving
> replication data certainly has something to do with it.
>
> Med venlig hilsen / Best regards
>
> *John Nielsen*
> Programmer
>
>
>
> *MCB A/S*
> Enghaven 15
> DK-7500 Holstebro
>
> Kundeservice: +45 9610 2824
> p...@mcb.dk
> www.mcb.dk
>
>
>
> On Thu, Nov 29, 2012 at 3:20 AM, Erick Erickson <erickerick...@gmail.com
> >wrote:
>
> > Am I reading this right? All you're doing on varnish1 is replicating to
> it?
> > You're not searching or indexing? I'm sure I'm misreading this.....
> >
> >
> > "The spike, which only lasts for a couple of minutes, sends the disks
> > racing" This _sounds_ suspiciously like segment merging, especially the
> > "disks racing" bit. Or possibly replication. Neither of which make much
> > sense. But is there any chance that somehow multiple commits are being
> > issued? Of course if varnish1 is a slave, that shouldn't be happening
> > either.
> >
> > And the whole bit about nothing going to the logs is just bizarre. I'm
> > tempted to claim hardware gremlins, especially if you see nothing similar
> > on varnish2. Or some other process is pegging the machine. All of which
> is
> > a way of saying "I have no idea"....
> >
> > Yours in bewilderment,
> > Erick
> >
> >
> >
> > On Wed, Nov 28, 2012 at 6:15 AM, John Nielsen <j...@mcb.dk> wrote:
> >
> > > I apologize for the late reply.
> > >
> > > The query load is more or less stable during the spikes. There are
> always
> > > fluctuations, but nothing on the order of magnitude that could explain
> > this
> > > spike. In fact, the latest spike occured last night when there were
> > almost
> > > noone using it.
> > >
> > > To test a hunch of mine, I tried to deactivate all caches by commenting
> > out
> > > all cache entries in solrconfig.xml. It still spikes, so I dont think
> it
> > > has anything to do with cache warming or hits/misses or anything of the
> > > sort.
> > >
> > > One interesting thing GC though. This is our latest spike with cpu load
> > > (this server has 8 cores, so a load higher than 8 is potentially
> > > troublesome):
> > >
> > > 2012.Nov.27 19:58:18    2.27
> > > 2012.Nov.27 19:57:17    4.06
> > > 2012.Nov.27 19:56:18    8.95
> > > 2012.Nov.27 19:55:17    19.97
> > > 2012.Nov.27 19:54:17    32.27
> > > 2012.Nov.27 19:53:18    1.67
> > > 2012.Nov.27 19:52:17    1.6
> > > 2012.Nov.27 19:51:18    1.77
> > > 2012.Nov.27 19:50:17    1.89
> > >
> > > This is what the GC was doing around that time:
> > >
> > > 2012-11-27T19:50:04.933+0100: [GC [PSYoungGen:
> > 4777586K->277641K(4969216K)]
> > > 8887542K->4387693K(9405824K), 0.0856360 secs] [Times: user=0.54
> sys=0.00,
> > > real=0.09 secs]
> > > 2012-11-27T19:50:30.785+0100: [GC [PSYoungGen:
> > 4749769K->325171K(5068096K)]
> > > 8859821K->4435320K(9504704K), 0.0992160 secs] [Times: user=0.63
> sys=0.00,
> > > real=0.10 secs]
> > > 2012-11-27T19:51:12.293+0100: [GC [PSYoungGen:
> > 4911603K->306181K(5071168K)]
> > > 9021752K->4416617K(9507776K), 0.0957890 secs] [Times: user=0.62
> sys=0.00,
> > > real=0.09 secs]
> > > 2012-11-27T19:51:52.817+0100: [GC [PSYoungGen:
> > 4892613K->376175K(5075328K)]
> > > 9003049K->4486755K(9511936K), 0.1099830 secs] [Times: user=0.79
> sys=0.01,
> > > real=0.11 secs]
> > > 2012-11-27T19:52:29.454+0100: [GC [PSYoungGen:
> > 4972847K->271468K(4868160K)]
> > > 9083427K->4383520K(9304768K), 0.0699660 secs] [Times: user=0.48
> sys=0.01,
> > > real=0.07 secs]
> > > 2012-11-27T19:53:08.176+0100: [GC [PSYoungGen:
> > 4868140K->336421K(5090944K)]
> > > 8980192K->4448572K(9527552K), 0.0824350 secs] [Times: user=0.56
> sys=0.01,
> > > real=0.08 secs]
> > > 2012-11-27T19:54:53.534+0100: [GC [PSYoungGen:
> > 4950373K->340513K(5092864K)]
> > > 9062524K->4468215K(9529472K), 0.1016770 secs] [Times: user=0.71
> sys=0.00,
> > > real=0.10 secs]
> > > 2012-11-27T19:55:02.906+0100: [GC [PSYoungGen:
> > 4954465K->480488K(4952000K)]
> > > 9082167K->4684537K(9388608K), 0.1813290 secs] [Times: user=1.23
> sys=0.09,
> > > real=0.19 secs]
> > > 2012-11-27T19:55:09.114+0100: [GC [PSYoungGen:
> > 4951976K->560434K(5031936K)]
> > > 9156025K->5075285K(9547072K), 0.3511090 secs] [Times: user=2.32
> sys=0.12,
> > > real=0.35 secs]
> > > 2012-11-27T19:55:09.465+0100: [Full GC [PSYoungGen:
> > 560434K->0K(5031936K)]
> > > [PSOldGen: 4514851K->2793342K(5047296K)] 5075285K->2793342K(10079232K)
> > > [PSPermGen: 35285K->35285K(44864K)], 5.2310820 secs] [Times: user=5.23
> > > sys=0.00, real=5.23 secs]
> > > 2012-11-27T19:55:21.359+0100: [GC [PSYoungGen:
> > 4471488K->87249K(4647168K)]
> > > 7264830K->2880592K(9694464K), 0.0426640 secs] [Times: user=0.27
> sys=0.00,
> > > real=0.04 secs]
> > > 2012-11-27T19:55:35.399+0100: [GC [PSYoungGen:
> > 4173969K->171306K(4839552K)]
> > > 6967312K->2964649K(9886848K), 0.0493570 secs] [Times: user=0.34
> sys=0.00,
> > > real=0.05 secs]
> > > 2012-11-27T19:55:41.397+0100: [GC [PSYoungGen:
> > 4258026K->156193K(4802368K)]
> > > 7051369K->2949536K(9849664K), 0.0643190 secs] [Times: user=0.37
> sys=0.00,
> > > real=0.06 secs]
> > > 2012-11-27T19:55:55.609+0100: [GC [PSYoungGen:
> > 4254753K->156028K(4826880K)]
> > > 7048096K->2949371K(9874176K), 0.0590410 secs] [Times: user=0.35
> sys=0.00,
> > > real=0.06 secs]
> > > 2012-11-27T19:56:20.519+0100: [GC [PSYoungGen:
> > 4254588K->413148K(4860288K)]
> > > 7047931K->3206491K(9907584K), 0.1189810 secs] [Times: user=0.91
> sys=0.01,
> > > real=0.12 secs]
> > > 2012-11-27T19:56:56.649+0100: [GC [PSYoungGen:
> > 4564892K->412801K(4866432K)]
> > > 7358235K->3206144K(9913728K), 0.1089870 secs] [Times: user=0.82
> sys=0.00,
> > > real=0.11 secs]
> > > 2012-11-27T19:57:24.304+0100: [GC [PSYoungGen:
> > 4564545K->434702K(4870592K)]
> > > 7357888K->3228045K(9917888K), 0.1172090 secs] [Times: user=0.86
> sys=0.00,
> > > real=0.12 secs]
> > > 2012-11-27T19:57:41.489+0100: [GC [PSYoungGen:
> > 4591822K->500171K(4657344K)]
> > > 7385165K->3293513K(9704640K), 0.1663600 secs] [Times: user=1.12
> sys=0.00,
> > > real=0.16 secs]
> > > 2012-11-27T19:58:24.062+0100: [GC [PSYoungGen:
> > 4657291K->453204K(4849536K)]
> > > 7450633K->3246547K(9896832K), 0.1099200 secs] [Times: user=0.83
> sys=0.01,
> > > real=0.11 secs]
> > > 2012-11-27T19:58:46.081+0100: [GC [PSYoungGen:
> > 4560404K->533101K(4640320K)]
> > > 7353747K->3326443K(9687616K), 0.1252060 secs] [Times: user=0.96
> sys=0.00,
> > > real=0.13 secs]
> > > 2012-11-27T19:59:01.488+0100: [GC [PSYoungGen:
> > 4640301K->480969K(4822720K)]
> > > 7433643K->3274312K(9870016K), 0.1495860 secs] [Times: user=0.99
> sys=0.01,
> > > real=0.15 secs]
> > > 2012-11-27T19:59:25.155+0100: [GC [PSYoungGen:
> > 4534281K->572295K(4625664K)]
> > > 7327624K->3365638K(9672960K), 0.1492270 secs] [Times: user=1.10
> sys=0.01,
> > > real=0.15 secs]
> > > 2012-11-27T19:59:39.923+0100: [GC [PSYoungGen:
> > 4625607K->486964K(4800640K)]
> > > 7418950K->3280306K(9847936K), 0.1244440 secs] [Times: user=0.94
> sys=0.00,
> > > real=0.12 secs]
> > >
> > > So the GC was doing a full collection around that time, but that only
> > > accounts for around 5 secs. Not the couple of minutes the spike lasts.
> > >
> > > I am really at a complete loss as to what this could be. Google has not
> > > given me any clues.
> > >
> > >
> > > Med venlig hilsen / Best regards
> > >
> > > *John Nielsen*
> > > Programmer
> > >
> > >
> > >
> > > *MCB A/S*
> > > Enghaven 15
> > > DK-7500 Holstebro
> > >
> > > Kundeservice: +45 9610 2824
> > > p...@mcb.dk
> > > www.mcb.dk
> > >
> > >
> > >
> > > On Fri, Nov 23, 2012 at 1:56 PM, Otis Gospodnetic <
> > > otis.gospodne...@gmail.com> wrote:
> > >
> > > > Strange indeed. What about query load/ayes during that time? What
> about
> > > GC?
> > > > And does cache hit rate drop?
> > > >
> > > > Otis
> > > > --
> > > > SOLR Performance Monitoring - http://sematext.com/spm
> > > > On Nov 23, 2012 2:45 AM, "John Nielsen" <j...@mcb.dk> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > We are seeing a strange CPU spike on one of our solr4 servers which
> > we
> > > > are
> > > > > unable to explain. The spike, which only lasts for a couple of
> > minutes,
> > > > > sends the disks racing. This happens a few times a times a day.
> This
> > is
> > > > > what the load looks like:
> > > > >
> > > > > 2012.Nov.14 13:37:17    2.77
> > > > > 2012.Nov.14 13:36:17    3.65
> > > > > 2012.Nov.14 13:35:18    3.92
> > > > > 2012.Nov.14 13:34:17    3.95
> > > > > 2012.Nov.14 13:33:18    6.56
> > > > > 2012.Nov.14 13:32:17    10.79
> > > > > 2012.Nov.14 13:31:17    24.38
> > > > > 2012.Nov.14 13:30:17    63.35
> > > > > 2012.Nov.14 13:29:17    24.68
> > > > > 2012.Nov.14 13:28:17    2.44
> > > > > 2012.Nov.14 13:27:18    3.51
> > > > > 2012.Nov.14 13:26:17    5.26
> > > > > 2012.Nov.14 13:25:18    5.71
> > > > > 2012.Nov.14 13:24:17    2.7
> > > > >
> > > > > The problem is that out of a 3 minute spike, I get about 40 seconds
> > of
> > > > > silence in the logs. This log usually adds like a thousand lines
> > every
> > > > > second. Not being able to communicate with the server for this
> long,
> > > > breaks
> > > > > our use case.
> > > > >
> > > > > We have two servers, varnish01 and varnish02. We used to feed data
> to
> > > > > varnish02, replicate it to varnish01 where the data is then read
> > from.
> > > > When
> > > > > we discovered this issue, we moved all traffic to varnish02 so that
> > > data
> > > > is
> > > > > being replicated to varnish01, but other than that, gets zero
> > traffic.
> > > > The
> > > > > spike did not disappear.
> > > > >
> > > > > The spike we are seeing is on varnish01 only.
> > > > >
> > > > > Please note that our use case requires us to continuously feed
> large
> > > > > amounts of data from our main system in the order of up to 1.000
> > > > registers
> > > > > every minute. Our solrconfig.xml is attached.
> > > > >
> > > > > Has anyone seen this phenomenon before?
> > > > >
> > > > > Med venlig hilsen / Best regards
> > > > >
> > > > > *John Nielsen*
> > > > > Programmer
> > > > >
> > > > >
> > > > >
> > > > > *MCB A/S*
> > > > > Enghaven 15
> > > > > DK-7500 Holstebro
> > > > >
> > > > > Kundeservice: +45 9610 2824
> > > > > p...@mcb.dk
> > > > > www.mcb.dk
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: SOLR4 cluster - strange CPU spike on slave

Reply via email to