Hi,

The cluster is running on EC2 using 5x r3.xlarge instances and disks are
1TB gp2 EBS.

I will try to get the logs that Susheel requested but it's not an easy task.

When indexing there's very few IO.

Solr is started with the following flags:
```
/usr/lib/jvm/java-8-oracle/bin/java
  -server
  -Xms15340m
  -Xmx15340m
  -XX:NewRatio=3
  -XX:SurvivorRatio=4
  -XX:TargetSurvivorRatio=90
  -XX:MaxTenuringThreshold=8
  -XX:+UseConcMarkSweepGC
  -XX:+UseParNewGC
  -XX:ConcGCThreads=4
  -XX:ParallelGCThreads=4
  -XX:+CMSScavengeBeforeRemark
  -XX:PretenureSizeThreshold=64m
  -XX:+UseCMSInitiatingOccupancyOnly
  -XX:CMSInitiatingOccupancyFraction=50
  -XX:CMSMaxAbortablePrecleanTime=6000
  -XX:+CMSParallelRemarkEnabled
  -XX:+ParallelRefProcEnabled
  -XX:CompressedClassSpaceSize=250m
  -verbose:gc
  -XX:+PrintHeapAtGC
  -XX:+PrintGCDetails
  -XX:+PrintGCDateStamps
  -XX:+PrintGCTimeStamps
  -XX:+PrintTenuringDistribution
  -XX:+PrintGCApplicationStoppedTime
  -Xloggc:/data/solr/logs/solr_gc.log
  -XX:+UseGCLogFileRotation
  -XX:NumberOfGCLogFiles=9
  -XX:GCLogFileSize=20M
  -Dcom.sun.management.jmxremote
  -Dcom.sun.management.jmxremote.local.only=false
  -Dcom.sun.management.jmxremote.ssl=false
  -Dcom.sun.management.jmxremote.authenticate=false
  -Dcom.sun.management.jmxremote.port=18983
  -Dcom.sun.management.jmxremote.rmi.port=18983
  -DzkClientTimeout=15000
  -DzkHost=zk1,zk2,zk3
  -Dsolr.log.dir=/data/solr/logs
  -Djetty.port=8983
  -DSTOP.PORT=7983
  -DSTOP.KEY=solrrocks
  -Duser.timezone=UTC
  -Djetty.home=/home/solr/solr/server
  -Dsolr.solr.home=/data/solr/data
  -Dsolr.install.dir=/home/solr/solr
  -Dlog4j.configuration=file:/data/solr/log4j.properties
  -Xss256k
  -Dsolr.log.muteconsole
  -XX:OnOutOfMemoryError=/home/solr/solr/bin/oom_solr.sh 8983
/data/solr/logs
  -jar start.jar
  --module=http
```

Not sure if it's related, but when the batches get replicated on the
replicas they don't seem to respect the batch size on the primary.

That's the insert on the primary (batch is 50k)
```
2017-06-07 09:46:00.629 INFO  (qtp592179046-260) [c:collection1 s:shard1
r:core_node17 x:collection1_shard1_replica6] o.a.s.h.d.DocBuilder Import
completed successfully
2017-06-07 09:46:00.638 INFO  (qtp592179046-260) [c:collection1 s:shard1
r:core_node17 x:collection1_shard1_replica6] o.a.s.h.d.DocBuilder Time
taken = 0:0:27.717
2017-06-07 09:46:00.655 INFO  (qtp592179046-260) [c:collection1 s:shard1
r:core_node17 x:collection1_shard1_replica6]
o.a.s.u.p.LogUpdateProcessorFactory [collection1_shard1_replica6]
 webapp=/solr path=/dataimport
params={optimize=false&startId=489247153&synchronous=true&limit=50000&commit=false&clean=false&command=full-import&entity=instagram-users-incremental}{add=[489247178,
489247179, 489247191, 489247238, 489247256, 489247260, 489247279,
489247325, 489247368, 489247369, ... (50000 adds)]} 0 27743
```

And that get's replicated to the replicas as many rows like that. I thought
that replicating batches of 50k rows to batches of 10-20 rows might be a
problem, but I couldn't find a way to tune that behaviour (and I'm not sure
there's one)
```
2017-06-07 09:45:23.640 INFO  (qtp592179046-1808) [c:collection1 s:shard1
r:core_node13 x:collection1_shard1_replica3]
o.a.s.u.p.LogUpdateProcessorFactory [collection1_shard1_replica3]
 webapp=/solr path=/update params={update.distrib=FROMLEADER&distrib.from=
http://10.0.0.159:8983/solr/collection1_shard1_replica2/&wt=javabin&version=2}{add=[488928327
(1569538675702759424), 488928344 (1569538675703808000), 488928391
(1569538675703808001), 488928406 (1569538675704856576), 488928418
(1569538675704856577), 488928451 (1569538675705905152), 488928456
(1569538675706953728), 488928495 (1569538675706953729), 488928538
(1569538675708002304), 488928548 (1569538675708002305), ... (15 adds)]} 0 1
2017-06-07 09:45:23.671 INFO  (qtp592179046-1832) [c:collection1 s:shard2
r:core_node14 x:collection1_shard2_replica2]
o.a.s.u.p.LogUpdateProcessorFactory [collection1_shard2_replica2]
 webapp=/solr path=/update params={update.distrib=FROMLEADER&distrib.from=
http://10.0.0.159:8983/solr/collection1_shard2_replica1/&wt=javabin&version=2}{add=[488928306
(1569538675703808000), 488928329 (1569538675717439488), 488928331
(1569538675718488064), 488928332 (1569538675719536640), 488928378
(1569538675734216704), 488928383 (1569538675735265280), 488928399
(1569538675735265281), 488928426 (1569538675736313856), 488928438
(1569538675742605312), 488928471 (1569538675743653888), ... (13 adds)]} 0 1
2017-06-07 09:45:23.686 INFO  (qtp592179046-1811) [c:collection1 s:shard1
r:core_node13 x:collection1_shard1_replica3]
o.a.s.u.p.LogUpdateProcessorFactory [collection1_shard1_replica3]
 webapp=/solr path=/update params={update.distrib=FROMLEADER&distrib.from=
http://10.0.0.159:8983/solr/collection1_shard1_replica2/&wt=javabin&version=2}{add=[488928827
(1569538675750993920), 488928833 (1569538675753091072), 488928842
(1569538675754139648), 488928888 (1569538675754139649), 488928914
(1569538675755188224), 488928953 (1569538675755188225), 488928958
(1569538675756236800), 488928969 (1569538675756236801), 488928977
(1569538675757285376), 488928996 (1569538675757285377), ... (15 adds)]} 0 1
2017-06-07 09:45:23.706 INFO  (qtp592179046-1861) [c:collection1 s:shard2
r:core_node14 x:collection1_shard2_replica2]
o.a.s.u.p.LogUpdateProcessorFactory [collection1_shard2_replica2]
 webapp=/solr path=/update params={update.distrib=FROMLEADER&distrib.from=
http://10.0.0.159:8983/solr/collection1_shard2_replica1/&wt=javabin&version=2}{add=[488929020
(1569538675758333952), 488929023 (1569538675761479680), 488929027
(1569538675762528256), 488929032 (1569538675763576832), 488929035
(1569538675763576833), 488929046 (1569538675764625408), 488929051
(1569538675764625409), 488929131 (1569538675765673984), 488929141
(1569538675766722560), 488929145 (1569538675766722561), ... (21 adds)]} 0 18
2017-06-07 09:45:25.213 INFO  (qtp592179046-1845) [c:collection1 s:shard2
r:core_node14 x:collection1_shard2_replica2]
o.a.s.u.p.LogUpdateProcessorFactory [collection1_shard2_replica2]
 webapp=/solr path=/update params={update.distrib=FROMLEADER&distrib.from=
http://10.0.0.159:8983/solr/collection1_shard2_replica1/&wt=javabin&version=2}{add=[488929441
(1569538675793985536), 488929535 (1569538675795034112), 488929540
(1569538675795034113), 488929560 (1569538675796082688), 488929597
(1569538675796082689), 488929639 (1569538677352169472), 488929684
(1569538677352169473), 488929713 (1569538677353218048), 488929777
(1569538677353218049), 488929791 (1569538677353218050), ... (19 adds)]} 0
9```

On Wed, Jun 7, 2017 at 10:00 AM, Toke Eskildsen <t...@kb.dk> wrote:

> On Tue, 2017-06-06 at 10:51 +0200, Isart Montane wrote:
> > We are using SolrCloud with 5 nodes, 2 collections, 2 shards each.
> > The problem we are seeing is a huge drop on writes when the number of
> > replicas increase.
> >
> > When we index (using DIH and batches) a collection with no replicas,
> > we are able to index at 1800 inserts/sec. That number decreases to
> > 1200 with 1 replica, 800 with 2 replicas and 400 with 3 replicas and
> > it keeps getting worst when more replicas are added.
>
> That is, as Susheel says, not expected behaviour. If you are running
> everything on a single physical machine that could be an explanation.
> What is your hardware-setup?
> --
> Toke Eskildsen, Royal Danish Library
>

Reply via email to