Thanks Guys.

i will try two level document routing in case of file_collection.

i really don't understand why index size is high for file_collection as
same file is available in main_collection.

(each file indexed as one document with all commands in main  collection
and same file is indexed as number of documents, each command as a solr
document in file_collection).

will index size grows with more distinct words or few distinct words with
more number of documents ? let me know if i have not put the question
correctly.

Thanks,
Anil

On 15 March 2016 at 01:00, Susheel Kumar <susheel2...@gmail.com> wrote:

> If you can find/know which fields (or combination) in your document divides
> / groups the data together would be the fields for custom routing.  Solr
> supports up to two level.
>
> E.g. if you have field with say documentType or country or etc. would
> help.  See the document routing at
>
> https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud
>
>
>
> On Mon, Mar 14, 2016 at 3:14 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
> > Usually I just let the compositeId do its thing and only go for custom
> > routing when the default proves inadequate.
> >
> > Note: your 480M documents may very well be too many for three shards!
> > You really have to test....
> >
> > Erick
> >
> >
> > On Mon, Mar 14, 2016 at 10:04 AM, Anil <anilk...@gmail.com> wrote:
> > > Hi Erick,
> > > In b/w, Do you recommend any effective shard distribution method ?
> > >
> > > Regards,
> > > Anil
> > >
> > > On 14 March 2016 at 22:30, Erick Erickson <erickerick...@gmail.com>
> > wrote:
> > >
> > >> Try shards.info=true, but pinging the shard directly is the most
> > certain.
> > >>
> > >>
> > >> Best,
> > >> Erick
> > >>
> > >> On Mon, Mar 14, 2016 at 9:48 AM, Anil <anilk...@gmail.com> wrote:
> > >> > HI Erik,
> > >> >
> > >> > we have used document routing to balance the shards load and for
> > >> > expand/collapse. it is mainly used for main_collection which holds
> > one to
> > >> > many relationship records. In file_collection, it is only for load
> > >> > distribution.
> > >> >
> > >> > 25GB for entire solr service. each machine will act as shard for
> some
> > >> > collections.
> > >> >
> > >> > we have not stress tested our servers at least for solr service. i
> > have
> > >> > read the the link you have shared, i will do something on it. thanks
> > for
> > >> > sharing.
> > >> >
> > >> > i have checked other collections, where index size is max 90GB and 5
> > M as
> > >> > max number of documents. but for the particular file_collection_2014
> > , i
> > >> > see total index size across replicas is 147 GB.
> > >> >
> > >> > Can we get any hints if we run the query with debugQuery=true ?
> what
> > is
> > >> > the effective way of load distribution ? Please advice.
> > >> >
> > >> > Regards,
> > >> > Anil
> > >> >
> > >> > On 14 March 2016 at 20:32, Erick Erickson <erickerick...@gmail.com>
> > >> wrote:
> > >> >
> > >> >> bq: The slowness is happening for file_collection. though it has 3
> > >> shards,
> > >> >> documents are available in 2 shards. shard1 - 150M docs and shard2
> > has
> > >> 330M
> > >> >> docs , shard3 is empty.
> > >> >>
> > >> >> Well, this collection terribly balanced. Putting 330M docs on a
> > single
> > >> >> shard is
> > >> >> pushing the limits, the only time I've seen that many docs on a
> > shard,
> > >> >> particularly
> > >> >> with 25G of ram, they were very small records. My guess is that you
> > will
> > >> >> find
> > >> >> the queries you send to that shard substantially slower than the
> 150M
> > >> >> shard,
> > >> >> although 150M could also be pushing your limits. You can measure
> this
> > >> >> by sending the query to the specific core (something like
> > >> >>
> > >> >> solr/files_shard1_replica1/query?(your queryhere)&distrib=false
> > >> >>
> > >> >> My bet is that your QTime will be significantly different with the
> > two
> > >> >> shards.
> > >> >>
> > >> >> It also sounds like you're using implicit routing where you control
> > >> where
> > >> >> the
> > >> >> files go, it's easy to have unbalanced shards in that case, why did
> > you
> > >> >> decide
> > >> >> to do it this way? There are valid reasons, but...
> > >> >>
> > >> >> In short, my guess is that you've simply overloaded your shard with
> > >> >> 330M docs. It's
> > >> >> not at all clear that even 150 will give you satisfactory
> > performance,
> > >> >> have you stress
> > >> >> tested your servers? Here's the long form of sizing:
> > >> >>
> > >> >>
> > >> >>
> > >>
> >
> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> > >> >>
> > >> >> Best,
> > >> >> Erick
> > >> >>
> > >> >> On Mon, Mar 14, 2016 at 7:05 AM, Susheel Kumar <
> > susheel2...@gmail.com>
> > >> >> wrote:
> > >> >> > For each of the solr machines/shards you have.  Thanks.
> > >> >> >
> > >> >> > On Mon, Mar 14, 2016 at 10:04 AM, Susheel Kumar <
> > >> susheel2...@gmail.com>
> > >> >> > wrote:
> > >> >> >
> > >> >> >> Hello Anil,
> > >> >> >>
> > >> >> >> Can you go to Solr Admin Panel -> Dashboard and share all 4
> memory
> > >> >> >> parameters under System / share the snapshot. ?
> > >> >> >>
> > >> >> >> Thanks,
> > >> >> >> Susheel
> > >> >> >>
> > >> >> >> On Mon, Mar 14, 2016 at 5:36 AM, Anil <anilk...@gmail.com>
> wrote:
> > >> >> >>
> > >> >> >>> HI Toke and Jack,
> > >> >> >>>
> > >> >> >>> Please find the details below.
> > >> >> >>>
> > >> >> >>> * How large are your 3 shards in bytes? (total index across
> > >> replicas)
> > >> >> >>>           --  *146G. i am using CDH (cloudera), not sure how to
> > >> check
> > >> >> the
> > >> >> >>> index size of each collection on each shard*
> > >> >> >>> * What storage system do you use (local SSD, local spinning
> > drives,
> > >> >> remote
> > >> >> >>> storage...)? *Local (hdfs) spinning drives*
> > >> >> >>> * How much physical memory does your system have? *we have 15
> > data
> > >> >> nodes.
> > >> >> >>> multiple services installed on each data node (252 GB RAM for
> > each
> > >> data
> > >> >> >>> node). 25 gb RAM allocated for solr service.*
> > >> >> >>> * How much memory is free for disk cache? *i could not find.*
> > >> >> >>> * How many concurrent queries do you issue? *very less. i dont
> > see
> > >> any
> > >> >> >>> concurrent queries to this file_collection for now.*
> > >> >> >>> * Do you update while you search? *Yes.. its very less.*
> > >> >> >>> * What does a full query (rows, faceting, grouping,
> highlighting,
> > >> >> >>> everything) look like? *for the file_collection, rows - 100,
> > >> >> highlights =
> > >> >> >>> false, no facets, expand = false.*
> > >> >> >>> * How many documents does a typical query match (hitcount)? *it
> > >> varies
> > >> >> >>> with
> > >> >> >>> each file. i have sort on int field to order commands in the
> > query.*
> > >> >> >>>
> > >> >> >>> we have two sets of collections on solr cluster ( 17 data
> nodes)
> > >> >> >>>
> > >> >> >>> 1. main_collection - collection created per year. each
> collection
> > >> uses
> > >> >> 8
> > >> >> >>> shards 2 replicas ex: main_collection_2016,
> main_collection_2015
> > etc
> > >> >> >>>
> > >> >> >>> 2. file_collection (where files having commands are indexed) -
> > >> >> collection
> > >> >> >>> created per 2 years. it uses 3 shards and 2 replicas. ex :
> > >> >> >>> file_collection_2014, file_collection_2016
> > >> >> >>>
> > >> >> >>> The slowness is happening for file_collection. though it has 3
> > >> shards,
> > >> >> >>> documents are available in 2 shards. shard1 - 150M docs and
> > shard2
> > >> has
> > >> >> >>> 330M
> > >> >> >>> docs , shard3 is empty.
> > >> >> >>>
> > >> >> >>> main_collection is looks good.
> > >> >> >>>
> > >> >> >>> please let me know if you need any additional details.
> > >> >> >>>
> > >> >> >>> Regards,
> > >> >> >>> Anil
> > >> >> >>>
> > >> >> >>>
> > >> >> >>> On 13 March 2016 at 21:48, Anil <anilk...@gmail.com> wrote:
> > >> >> >>>
> > >> >> >>> > Thanks Toke and Jack.
> > >> >> >>> >
> > >> >> >>> > Jack,
> > >> >> >>> >
> > >> >> >>> > Yes. it is 480 million :)
> > >> >> >>> >
> > >> >> >>> > I will share the additional details soon. thanks.
> > >> >> >>> >
> > >> >> >>> >
> > >> >> >>> > Regards,
> > >> >> >>> > Anil
> > >> >> >>> >
> > >> >> >>> >
> > >> >> >>> >
> > >> >> >>> >
> > >> >> >>> >
> > >> >> >>> > On 13 March 2016 at 21:06, Jack Krupansky <
> > >> jack.krupan...@gmail.com>
> > >> >> >>> > wrote:
> > >> >> >>> >
> > >> >> >>> >> (We should have a wiki/doc page for the "usual list of
> > suspects"
> > >> >> when
> > >> >> >>> >> queries are/appear slow, rather than need to repeat the same
> > >> >> mantra(s)
> > >> >> >>> for
> > >> >> >>> >> every inquiry on this topic.)
> > >> >> >>> >>
> > >> >> >>> >>
> > >> >> >>> >> -- Jack Krupansky
> > >> >> >>> >>
> > >> >> >>> >> On Sun, Mar 13, 2016 at 11:29 AM, Toke Eskildsen <
> > >> >> >>> t...@statsbiblioteket.dk>
> > >> >> >>> >> wrote:
> > >> >> >>> >>
> > >> >> >>> >> > Anil <anilk...@gmail.com> wrote:
> > >> >> >>> >> > > i have indexed a data (commands from files) with 10
> fields
> > >> and
> > >> >> 3 of
> > >> >> >>> >> them
> > >> >> >>> >> > is
> > >> >> >>> >> > > text fields. collection is created with 3 shards and 2
> > >> >> replicas. I
> > >> >> >>> >> have
> > >> >> >>> >> > > used document routing as well.
> > >> >> >>> >> >
> > >> >> >>> >> > > Currently collection holds 47,80,01,405 records.
> > >> >> >>> >> >
> > >> >> >>> >> > ...480 million, right? Funny digit grouping in India.
> > >> >> >>> >> >
> > >> >> >>> >> > > text search against text field taking around 5 sec. solr
> > is
> > >> >> query
> > >> >> >>> just
> > >> >> >>> >> > and
> > >> >> >>> >> > > of two terms with fl as 7 fields
> > >> >> >>> >> >
> > >> >> >>> >> > > fileId:"file unique id" AND command_text:(system login)
> > >> >> >>> >> >
> > >> >> >>> >> > While not an impressive response time, it might just be
> that
> > >> your
> > >> >> >>> >> hardware
> > >> >> >>> >> > is not enough to handle that amount of documents. The
> usual
> > >> >> culprit
> > >> >> >>> is
> > >> >> >>> >> IO
> > >> >> >>> >> > speed, so chances are you have a system with spinning
> drives
> > >> and
> > >> >> not
> > >> >> >>> >> enough
> > >> >> >>> >> > RAM: Switch to SSD and/or add more RAM.
> > >> >> >>> >> >
> > >> >> >>> >> > To give better advice, we need more information.
> > >> >> >>> >> >
> > >> >> >>> >> > * How large are your 3 shards in bytes?
> > >> >> >>> >> > * What storage system do you use (local SSD, local
> spinning
> > >> >> drives,
> > >> >> >>> >> remote
> > >> >> >>> >> > storage...)?
> > >> >> >>> >> > * How much physical memory does your system have?
> > >> >> >>> >> > * How much memory is free for disk cache?
> > >> >> >>> >> > * How many concurrent queries do you issue?
> > >> >> >>> >> > * Do you update while you search?
> > >> >> >>> >> > * What does a full query (rows, faceting, grouping,
> > >> highlighting,
> > >> >> >>> >> > everything) look like?
> > >> >> >>> >> > * How many documents does a typical query match
> (hitcount)?
> > >> >> >>> >> >
> > >> >> >>> >> > - Toke Eskildsen
> > >> >> >>> >> >
> > >> >> >>> >>
> > >> >> >>> >
> > >> >> >>> >
> > >> >> >>>
> > >> >> >>
> > >> >> >>
> > >> >>
> > >>
> >
>

Reply via email to