Thanks Guys. i will try two level document routing in case of file_collection.
i really don't understand why index size is high for file_collection as same file is available in main_collection. (each file indexed as one document with all commands in main collection and same file is indexed as number of documents, each command as a solr document in file_collection). will index size grows with more distinct words or few distinct words with more number of documents ? let me know if i have not put the question correctly. Thanks, Anil On 15 March 2016 at 01:00, Susheel Kumar <susheel2...@gmail.com> wrote: > If you can find/know which fields (or combination) in your document divides > / groups the data together would be the fields for custom routing. Solr > supports up to two level. > > E.g. if you have field with say documentType or country or etc. would > help. See the document routing at > > https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud > > > > On Mon, Mar 14, 2016 at 3:14 PM, Erick Erickson <erickerick...@gmail.com> > wrote: > > > Usually I just let the compositeId do its thing and only go for custom > > routing when the default proves inadequate. > > > > Note: your 480M documents may very well be too many for three shards! > > You really have to test.... > > > > Erick > > > > > > On Mon, Mar 14, 2016 at 10:04 AM, Anil <anilk...@gmail.com> wrote: > > > Hi Erick, > > > In b/w, Do you recommend any effective shard distribution method ? > > > > > > Regards, > > > Anil > > > > > > On 14 March 2016 at 22:30, Erick Erickson <erickerick...@gmail.com> > > wrote: > > > > > >> Try shards.info=true, but pinging the shard directly is the most > > certain. > > >> > > >> > > >> Best, > > >> Erick > > >> > > >> On Mon, Mar 14, 2016 at 9:48 AM, Anil <anilk...@gmail.com> wrote: > > >> > HI Erik, > > >> > > > >> > we have used document routing to balance the shards load and for > > >> > expand/collapse. it is mainly used for main_collection which holds > > one to > > >> > many relationship records. In file_collection, it is only for load > > >> > distribution. > > >> > > > >> > 25GB for entire solr service. each machine will act as shard for > some > > >> > collections. > > >> > > > >> > we have not stress tested our servers at least for solr service. i > > have > > >> > read the the link you have shared, i will do something on it. thanks > > for > > >> > sharing. > > >> > > > >> > i have checked other collections, where index size is max 90GB and 5 > > M as > > >> > max number of documents. but for the particular file_collection_2014 > > , i > > >> > see total index size across replicas is 147 GB. > > >> > > > >> > Can we get any hints if we run the query with debugQuery=true ? > what > > is > > >> > the effective way of load distribution ? Please advice. > > >> > > > >> > Regards, > > >> > Anil > > >> > > > >> > On 14 March 2016 at 20:32, Erick Erickson <erickerick...@gmail.com> > > >> wrote: > > >> > > > >> >> bq: The slowness is happening for file_collection. though it has 3 > > >> shards, > > >> >> documents are available in 2 shards. shard1 - 150M docs and shard2 > > has > > >> 330M > > >> >> docs , shard3 is empty. > > >> >> > > >> >> Well, this collection terribly balanced. Putting 330M docs on a > > single > > >> >> shard is > > >> >> pushing the limits, the only time I've seen that many docs on a > > shard, > > >> >> particularly > > >> >> with 25G of ram, they were very small records. My guess is that you > > will > > >> >> find > > >> >> the queries you send to that shard substantially slower than the > 150M > > >> >> shard, > > >> >> although 150M could also be pushing your limits. You can measure > this > > >> >> by sending the query to the specific core (something like > > >> >> > > >> >> solr/files_shard1_replica1/query?(your queryhere)&distrib=false > > >> >> > > >> >> My bet is that your QTime will be significantly different with the > > two > > >> >> shards. > > >> >> > > >> >> It also sounds like you're using implicit routing where you control > > >> where > > >> >> the > > >> >> files go, it's easy to have unbalanced shards in that case, why did > > you > > >> >> decide > > >> >> to do it this way? There are valid reasons, but... > > >> >> > > >> >> In short, my guess is that you've simply overloaded your shard with > > >> >> 330M docs. It's > > >> >> not at all clear that even 150 will give you satisfactory > > performance, > > >> >> have you stress > > >> >> tested your servers? Here's the long form of sizing: > > >> >> > > >> >> > > >> >> > > >> > > > https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ > > >> >> > > >> >> Best, > > >> >> Erick > > >> >> > > >> >> On Mon, Mar 14, 2016 at 7:05 AM, Susheel Kumar < > > susheel2...@gmail.com> > > >> >> wrote: > > >> >> > For each of the solr machines/shards you have. Thanks. > > >> >> > > > >> >> > On Mon, Mar 14, 2016 at 10:04 AM, Susheel Kumar < > > >> susheel2...@gmail.com> > > >> >> > wrote: > > >> >> > > > >> >> >> Hello Anil, > > >> >> >> > > >> >> >> Can you go to Solr Admin Panel -> Dashboard and share all 4 > memory > > >> >> >> parameters under System / share the snapshot. ? > > >> >> >> > > >> >> >> Thanks, > > >> >> >> Susheel > > >> >> >> > > >> >> >> On Mon, Mar 14, 2016 at 5:36 AM, Anil <anilk...@gmail.com> > wrote: > > >> >> >> > > >> >> >>> HI Toke and Jack, > > >> >> >>> > > >> >> >>> Please find the details below. > > >> >> >>> > > >> >> >>> * How large are your 3 shards in bytes? (total index across > > >> replicas) > > >> >> >>> -- *146G. i am using CDH (cloudera), not sure how to > > >> check > > >> >> the > > >> >> >>> index size of each collection on each shard* > > >> >> >>> * What storage system do you use (local SSD, local spinning > > drives, > > >> >> remote > > >> >> >>> storage...)? *Local (hdfs) spinning drives* > > >> >> >>> * How much physical memory does your system have? *we have 15 > > data > > >> >> nodes. > > >> >> >>> multiple services installed on each data node (252 GB RAM for > > each > > >> data > > >> >> >>> node). 25 gb RAM allocated for solr service.* > > >> >> >>> * How much memory is free for disk cache? *i could not find.* > > >> >> >>> * How many concurrent queries do you issue? *very less. i dont > > see > > >> any > > >> >> >>> concurrent queries to this file_collection for now.* > > >> >> >>> * Do you update while you search? *Yes.. its very less.* > > >> >> >>> * What does a full query (rows, faceting, grouping, > highlighting, > > >> >> >>> everything) look like? *for the file_collection, rows - 100, > > >> >> highlights = > > >> >> >>> false, no facets, expand = false.* > > >> >> >>> * How many documents does a typical query match (hitcount)? *it > > >> varies > > >> >> >>> with > > >> >> >>> each file. i have sort on int field to order commands in the > > query.* > > >> >> >>> > > >> >> >>> we have two sets of collections on solr cluster ( 17 data > nodes) > > >> >> >>> > > >> >> >>> 1. main_collection - collection created per year. each > collection > > >> uses > > >> >> 8 > > >> >> >>> shards 2 replicas ex: main_collection_2016, > main_collection_2015 > > etc > > >> >> >>> > > >> >> >>> 2. file_collection (where files having commands are indexed) - > > >> >> collection > > >> >> >>> created per 2 years. it uses 3 shards and 2 replicas. ex : > > >> >> >>> file_collection_2014, file_collection_2016 > > >> >> >>> > > >> >> >>> The slowness is happening for file_collection. though it has 3 > > >> shards, > > >> >> >>> documents are available in 2 shards. shard1 - 150M docs and > > shard2 > > >> has > > >> >> >>> 330M > > >> >> >>> docs , shard3 is empty. > > >> >> >>> > > >> >> >>> main_collection is looks good. > > >> >> >>> > > >> >> >>> please let me know if you need any additional details. > > >> >> >>> > > >> >> >>> Regards, > > >> >> >>> Anil > > >> >> >>> > > >> >> >>> > > >> >> >>> On 13 March 2016 at 21:48, Anil <anilk...@gmail.com> wrote: > > >> >> >>> > > >> >> >>> > Thanks Toke and Jack. > > >> >> >>> > > > >> >> >>> > Jack, > > >> >> >>> > > > >> >> >>> > Yes. it is 480 million :) > > >> >> >>> > > > >> >> >>> > I will share the additional details soon. thanks. > > >> >> >>> > > > >> >> >>> > > > >> >> >>> > Regards, > > >> >> >>> > Anil > > >> >> >>> > > > >> >> >>> > > > >> >> >>> > > > >> >> >>> > > > >> >> >>> > > > >> >> >>> > On 13 March 2016 at 21:06, Jack Krupansky < > > >> jack.krupan...@gmail.com> > > >> >> >>> > wrote: > > >> >> >>> > > > >> >> >>> >> (We should have a wiki/doc page for the "usual list of > > suspects" > > >> >> when > > >> >> >>> >> queries are/appear slow, rather than need to repeat the same > > >> >> mantra(s) > > >> >> >>> for > > >> >> >>> >> every inquiry on this topic.) > > >> >> >>> >> > > >> >> >>> >> > > >> >> >>> >> -- Jack Krupansky > > >> >> >>> >> > > >> >> >>> >> On Sun, Mar 13, 2016 at 11:29 AM, Toke Eskildsen < > > >> >> >>> t...@statsbiblioteket.dk> > > >> >> >>> >> wrote: > > >> >> >>> >> > > >> >> >>> >> > Anil <anilk...@gmail.com> wrote: > > >> >> >>> >> > > i have indexed a data (commands from files) with 10 > fields > > >> and > > >> >> 3 of > > >> >> >>> >> them > > >> >> >>> >> > is > > >> >> >>> >> > > text fields. collection is created with 3 shards and 2 > > >> >> replicas. I > > >> >> >>> >> have > > >> >> >>> >> > > used document routing as well. > > >> >> >>> >> > > > >> >> >>> >> > > Currently collection holds 47,80,01,405 records. > > >> >> >>> >> > > > >> >> >>> >> > ...480 million, right? Funny digit grouping in India. > > >> >> >>> >> > > > >> >> >>> >> > > text search against text field taking around 5 sec. solr > > is > > >> >> query > > >> >> >>> just > > >> >> >>> >> > and > > >> >> >>> >> > > of two terms with fl as 7 fields > > >> >> >>> >> > > > >> >> >>> >> > > fileId:"file unique id" AND command_text:(system login) > > >> >> >>> >> > > > >> >> >>> >> > While not an impressive response time, it might just be > that > > >> your > > >> >> >>> >> hardware > > >> >> >>> >> > is not enough to handle that amount of documents. The > usual > > >> >> culprit > > >> >> >>> is > > >> >> >>> >> IO > > >> >> >>> >> > speed, so chances are you have a system with spinning > drives > > >> and > > >> >> not > > >> >> >>> >> enough > > >> >> >>> >> > RAM: Switch to SSD and/or add more RAM. > > >> >> >>> >> > > > >> >> >>> >> > To give better advice, we need more information. > > >> >> >>> >> > > > >> >> >>> >> > * How large are your 3 shards in bytes? > > >> >> >>> >> > * What storage system do you use (local SSD, local > spinning > > >> >> drives, > > >> >> >>> >> remote > > >> >> >>> >> > storage...)? > > >> >> >>> >> > * How much physical memory does your system have? > > >> >> >>> >> > * How much memory is free for disk cache? > > >> >> >>> >> > * How many concurrent queries do you issue? > > >> >> >>> >> > * Do you update while you search? > > >> >> >>> >> > * What does a full query (rows, faceting, grouping, > > >> highlighting, > > >> >> >>> >> > everything) look like? > > >> >> >>> >> > * How many documents does a typical query match > (hitcount)? > > >> >> >>> >> > > > >> >> >>> >> > - Toke Eskildsen > > >> >> >>> >> > > > >> >> >>> >> > > >> >> >>> > > > >> >> >>> > > > >> >> >>> > > >> >> >> > > >> >> >> > > >> >> > > >> > > >