Re: SolrCloud Scale Struggle

Bill Bell Sat, 02 Aug 2014 10:16:04 -0700

Auto correct not good

Corrected below


Bill Bell
Sent from mobile


> On Aug 2, 2014, at 11:11 AM, Bill Bell <billnb...@gmail.com> wrote:
> 
> Seems way overkill. Are you using /get at all ? If you need the docs avail 
> right away - why ? How about after 30 seconds ? How many docs do you get 
> added per second during peak ? Even Google has a delay when you do Adwords. 
> 
> One idea is to have an empty core that you insert into and then shard into 
> the queries. So one core would be called newdocs and then you would add this 
> core into your query. There are a couple issues with this with scoring but it 
> works nicely. I would not even use Solrcloud for that core.
> 
> Try to reduce number of Java instances running. Reduce memory and use one 
> java per machine. 
> 
> Then if you need faster avail of docs you really need to ask why. Why not 
> later? Do you need search or just showing the user the info ? If for showing 
> maybe query a indexed table for the few not yet indexed ?? Or just store in a 
> db to show the user the info and index later?
> 
> Bill Bell
> Sent from mobile
> 
> 
>> On Aug 1, 2014, at 4:19 AM, "anand.mahajan" <an...@zerebral.co.in> wrote:
>> 
>> Hello all,
>> 
>> Struggling to get this going with SolrCloud - 
>> 
>> Requirement in brief :
>> - Ingest about 4M Used Cars listings a day and track all unique cars for
>> changes
>> - 4M automated searches a day (during the ingestion phase to check if a doc
>> exists in the index (based on values of 4-5 key fields) or it is a new one
>> or an updated version)
>> - Of the 4 M - About 3M Updates to existing docs (for every non-key value
>> change)
>> - About 1M inserts a day (I'm assuming these many new listings come in
>> every day)
>> - Daily Bulk CSV exports of inserts / updates in last 24 hours of various
>> snapshots of the data to various clients
>> 
>> My current deployment : 
>> i) I'm using Solr 4.8 and have set up a SolrCloud with 6 dedicated machines
>> - 24 Core + 96 GB RAM each.
>> ii)There are over 190M docs in the SolrCloud at the moment (for all
>> replicas its consuming overall disk 2340GB which implies - each doc is at
>> about 5-8kb in size.)
>> iii) The docs are split into 36 Shards - and 3 replica per shard (in all
>> 108 Solr Jetty processes split over 6 Servers leaving about 18 Jetty JVMs
>> running on each host)
>> iv) There are 60 fields per doc and all fields are stored at the moment  :( 
>> (The backend is only Solr at the moment)
>> v) The current shard/routing key is a combination of Car Year, Make and
>> some other car level attributes that help classify the cars
>> vi) We are mostly using the default Solr config as of now - no heavy caching
>> as the search is pretty random in nature 
>> vii) Autocommit is on - with maxDocs = 1
>> 
>> Current throughput & Issues :
>> With the above mentioned deployment the daily throughout is only at about
>> 1.5M on average (Inserts + Updates) - falling way short of what is required.
>> Search is slow - Some queries take about 15 seconds to return - and since
>> insert is dependent on at least one Search that degrades the write
>> throughput too. (This is not a Solr issue - but the app demands it so)
>> 
>> Questions :
>> 
>> 1. Autocommit with maxDocs = 1 - is that a goof up and could that be slowing
>> down indexing? Its a requirement that all docs are available as soon as
>> indexed.
>> 
>> 2. Should I have been better served had I deployed a Single Jetty Solr
>> instance per server with multiple cores running inside? The servers do start
>> to swap out after a couple of days of Solr uptime - right now we reboot the
>> entire cluster every 4 days.
>> 
>> 3. The routing key is not able to effectively balance the docs on available
>> shards - There are a few shards with just about 2M docs - and others over
>> 11M docs. Shall I split the larger shards? But I do not have more nodes /
>> hardware to allocate to this deployment. In such case would splitting up the
>> large shards give better read-write throughput? 
>> 
>> 4. To remain with the current hardware - would it help if I remove 1 replica
>> each from a shard? But that would mean even when just 1 node goes down for a
>> shard there would be only 1 live node left that would not serve the write
>> requests.
>> 
>> 5. Also, is there a way to control where the Split Shard replicas would go?
>> Is there a pattern / rule that Solr follows when it creates replicas for
>> split shards?
>> 
>> 6. I read somewhere that creating a Core would cost the OS one thread and a
>> file handle. Since a core repsents an index in its entirty would it not be
>> allocated the configured number of write threads? (The dafault that is 8)
>> 
>> 7. The Zookeeper cluster is deployed on the same boxes as the Solr instance
>> - Would separating the ZK cluster out help?
>> 
>> Sorry for the long thread _ I thought of asking these all at once rather
>> than posting separate ones.
>> 
>> Thanks,
>> Anand
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/SolrCloud-Scale-Struggle-tp4150592.html
>> Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrCloud Scale Struggle

Reply via email to