Hi,

Did you say you have 150 servers in this cluster?  And 10 shards for just
90M docs?  If so, that 150 hosts sounds like too much for all other numbers
I see here.  I'd love to see some metrics here.  e.g. what happens with
disk IO around those commits?  How about GC time/size info?  Are JVM memory
pools full-ish and is the CPU jumping like crazy?  Can you share more info
to give us a more complete picture of your system? SPM for Solr
<http://sematext.com/spm/> will help if you don't already capture these
types of things.
Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On Thu, Feb 12, 2015 at 11:07 AM, Vijay Sekhri <sekhrivi...@gmail.com>
wrote:

> Hi Erick,
> We have following configuration of our solr cloud
>
>    1. 10 Shards
>    2. 15 replicas per shard
>    3. 9 GB of index size per shard
>    4. a total of around 90 mil documents
>    5. 2 collection viz search1 serving live traffic and search 2 for
>    indexing. We swap collection when indexing finishes
>    6. On 150 hosts we have 2 JVMs running one for search1 collection and
>    other for search2 collection
>    7. Each jvm has 12 GB of heap assigned to it while the host has 50GB in
>    total
>    8. Each host has 16 processors
>    9. Linux XXXXXXX 2.6.32-431.5.1.el6.x86_64 #1 SMP Wed Feb 12 00:41:43
>    UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
>    10. We have two ways to index data.
>    1. Bulk indexing . All 90 million docs pumped in from 14 parallel
>       process (on 14 different client hosts). This is done on
> collection that is
>       not serving live traffic
>       2.  Incremental indexing . Only delta changes (Range from 100K to 5
>       Mil) every two hours. This is done on collection also serving live
> traffic
>    11. The request per second count on live collection is around 300 TPS
>    12. Hard commit setting is every 30 second with open searcher false and
>    soft commit setting is every 15 minutes . We have tried a lot of
> different
>    setting here BTW.
>
>
>
>
> Now we have two issues with indexing
> 1) Solr just could not keep up with the bulk indexing when replicas are
> also active. We have concluded this by changing the number of replicas to
> just 2 , to 4 and then to 15. When the number of replicas increases the
> bulk indexing time increase almost exponentially
> We seem to have encountered the same issue reported here
> https://issues.apache.org/jira/browse/SOLR-6816
> It gets to a point that even to index 100 docs the solr cluster would take
> 300 second. It would start of indexing 100 docs in 55 millisecond and
> slowly increase over time and within hour and a half just could not keep
> up. We have a workaround for this and i.e we stop all the replicas , do the
> bulk indexing and bring all the replicas up one by one . This sort of
> defeats the purpose of solr cloud but we can still work with this
> workaround. We can do this because , bulk indexing happen on the collection
> that is not serving live traffic. However we would love to have a solution
> from the solr cloud itself like ask it to stop replication and start via an
> API at the end of indexing.
>
> 2) This issues is related to soft commit with incremental indexing . When
> we do incremental indexing, it is done on the same collection serving live
> traffic with 300 request per second throughput.  Everything is fine except
> whenever the soft commit happens. Each time soft commit (autosoftcommit in
> sorlconfig.xml) happens which BTW happens almost at the same time
> throughout the cluster , there is a spike in the response times and
> throughput decreases almost to 150 tps. The spike continues for 2 minutes
> and then it happens again at the exact interval when the soft commit
> happens. We have monitored the logs and found a direct co relation when the
> soft commit happens and when the response time tanks.
>
> Now the latter issue is quite disturbing , because it is serving live
> traffic and we cannot sustain these periodic degradation. We have played
> around with different soft commit setting . Interval ranging from 2 minutes
> to 30 minutes . Auto warming half cache  , auto warming full cache, auto
> warming only 10 %. Doing warm up queries on every new searcher , doing NONE
> warm up queries on every new searching and all the different setting yields
> the same results . As and when soft commit happens the response time tanks
> and throughput deceases. The difference is almost 50 % in response times
> and 50 % in throughput
>
>
> Our workaround for this solution is to also do incremental delta indexing
> on the collection not serving live traffic and swap when it is done. As you
> can see that this also defeats the purpose of solr cloud . We cannot do
> bulk indexing because replicas cannot keeps up and we cannot do incremental
> indexing because of soft commit performance.
>
> Is there a way to make the cluster not do soft commit all at the same time
> or is there a way to make soft commit not cause this degradation ?
> We are open to any ideas at this time now.
>
>
>
>
>
>
> --
> *********************************************
> Vijay Sekhri
> *********************************************
>

Reply via email to