Re: SolrCloud Scale Struggle

Shalin Shekhar Mangar Fri, 01 Aug 2014 06:24:25 -0700

Comments inline:

On Fri, Aug 1, 2014 at 3:49 PM, anand.mahajan <an...@zerebral.co.in> wrote:

> Hello all,
>
> Struggling to get this going with SolrCloud -
>
> Requirement in brief :
>  - Ingest about 4M Used Cars listings a day and track all unique cars for
> changes
>  - 4M automated searches a day (during the ingestion phase to check if a
> doc
> exists in the index (based on values of 4-5 key fields) or it is a new one
> or an updated version)
>  - Of the 4 M - About 3M Updates to existing docs (for every non-key value
> change)
>  - About 1M inserts a day (I'm assuming these many new listings come in
> every day)
>  - Daily Bulk CSV exports of inserts / updates in last 24 hours of various
> snapshots of the data to various clients
>
> My current deployment :
>  i) I'm using Solr 4.8 and have set up a SolrCloud with 6 dedicated
> machines
> - 24 Core + 96 GB RAM each.
>  ii)There are over 190M docs in the SolrCloud at the moment (for all
> replicas its consuming overall disk 2340GB which implies - each doc is at
> about 5-8kb in size.)
>  iii) The docs are split into 36 Shards - and 3 replica per shard (in all
> 108 Solr Jetty processes split over 6 Servers leaving about 18 Jetty JVMs
> running on each host)
>  iv) There are 60 fields per doc and all fields are stored at the moment
>  :(
> (The backend is only Solr at the moment)
>  v) The current shard/routing key is a combination of Car Year, Make and
> some other car level attributes that help classify the cars
> vi) We are mostly using the default Solr config as of now - no heavy
> caching
> as the search is pretty random in nature
> vii) Autocommit is on - with maxDocs = 1
>
> Current throughput & Issues :
> With the above mentioned deployment the daily throughout is only at about
> 1.5M on average (Inserts + Updates) - falling way short of what is
> required.
> Search is slow - Some queries take about 15 seconds to return - and since
> insert is dependent on at least one Search that degrades the write
> throughput too. (This is not a Solr issue - but the app demands it so)
>
> Questions :
>
> 1. Autocommit with maxDocs = 1 - is that a goof up and could that be
> slowing
> down indexing? Its a requirement that all docs are available as soon as
> indexed.
>

The autoCommit every 1 document is definitely what's causing those
problems. You should set autoSoftCommit with a small time value, say a
minute or two and set autoCommit by both docs and time value (say 10000
docs, 10 minutes). Lucene/Solr do a bunch of work on commit and reducing
them will increase throughput.

Also see
http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

>
> 2. Should I have been better served had I deployed a Single Jetty Solr
> instance per server with multiple cores running inside? The servers do
> start
> to swap out after a couple of days of Solr uptime - right now we reboot the
> entire cluster every 4 days.
>

That's a lot of jetty JVMs. I think you have oversharded your cluster. If
they're all sharing the same disk then it's even worse. We typically deploy
3-4 jetty JVMs per physical machine with 4-8GB of heap (depending on
sorting/faceting requirements). Your write traffic is actually quite small
so you probably don't need 36 shards.

You are swapping because the data is too much relative to the available
RAM. You are trying to serve 2340GB of index with 576GB of RAM under
near-real-time environment. You basically need more RAM.

>
> 3. The routing key is not able to effectively balance the docs on available
> shards - There are a few shards with just about 2M docs - and others over
> 11M docs. Shall I split the larger shards? But I do not have more nodes /
> hardware to allocate to this deployment. In such case would splitting up
> the
> large shards give better read-write throughput?
>

Your throughput problems are because of committing after every document and
less RAM. Splitting shards won't help here.

As far as the uneven shards go, there's really only one option which is
custom sharding where you shard according to your own rules but you already
have a running cluster and changing the routing option on an existing
collection is not possible.

>
> 4. To remain with the current hardware - would it help if I remove 1
> replica
> each from a shard? But that would mean even when just 1 node goes down for
> a
> shard there would be only 1 live node left that would not serve the write
> requests.

It will eventually. The thing is that if there are just two replicas and
one goes down, then the other waits a while before it will elect itself as
the leader. This is to avoid data loss.

>

> 5. Also, is there a way to control where the Split Shard replicas would go?
> Is there a pattern / rule that Solr follows when it creates replicas for
> split shards?
>

Unfortunately no. The split command doesn't accept createNodeSet param but
it is probably a good improvement. The leader of the split shard will
remain on the same node as the leader of the parent shard but the replicas
of the new sub-shard are allocated according to the usual algorithm
(basically it avoids placing two replicas for the same shard on the same
box and it tries to choose the JVM with the least number of cores)

>
> 6. I read somewhere that creating a Core would cost the OS one thread and a
> file handle. Since a core repsents an index in its entirty would it not be
> allocated the configured number of write threads? (The dafault that is 8)
>

That's not true. The write threads are per core.

>
> 7. The Zookeeper cluster is deployed on the same boxes as the Solr instance
> - Would separating the ZK cluster out help?
>

It is recommended practice if only because a slow ZK can cause shards to go
into recovery and leader failure. I doubt it will make things faster in
your case. However, if you can, you should move ZK instances to separate
machines.

>
> Sorry for the long thread _ I thought of asking these all at once rather
> than posting separate ones.
>
> Thanks,
> Anand
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SolrCloud-Scale-Struggle-tp4150592.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

-- 
Regards,
Shalin Shekhar Mangar.

Re: SolrCloud Scale Struggle

Reply via email to