Re: Strategy for handling large (and growing) index: horizontal partitioning?

Kevin Lewandowski Mon, 03 Mar 2008 14:34:04 -0800

How many documents are in the index?

If you haven't already done this I'd take a really close look at your
schema and make sure you're only storing the things that should really
be stored, same with the indexed fields. I drastically reduced my
index size just by changing some indexed/stored options on a few
fields.


On Thu, Feb 28, 2008 at 10:54 PM, Otis Gospodnetic
<[EMAIL PROTECTED]> wrote:
> James,
>
>  I can't comment more on the SN's arch choices.
>
>  Here is the story about your questions
>  - 1 Solr instance can hold 1+ indices, either via JNDI (see Wiki) or via the 
> new multi-core support which works, but is still being hacked on
>  - See SOLR-303 in JIRA for distributed search.  Yonik committed it just the 
> other day, so now that's in nightly builds if you want to try it.  There are 
> 2 Wiki pages about that, too, see Recent changes log on the Wiki to quickly 
> find them.
>
>
>  Otis
>  --
>  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>  ----- Original Message ----
>  > From: James Brady <[EMAIL PROTECTED]>
>  > To: solr-user@lucene.apache.org
>
>
> > Sent: Friday, February 29, 2008 1:11:07 AM
>  > Subject: Re: Strategy for handling large (and growing) index: horizontal 
> partitioning?
>  >
>  > Hi Otis,
>  > Thanks for your comments -- I didn't realise the wiki is open to
>  > editing; my apologies. I've put in a few words to try and clear
>  > things up a bit.
>  >
>  > So determining n will probably be a best guess followed by trial and
>  > error, that's fine. I'm still not clear about whether single Solr
>  > servers can operate across several indices, however.. can anyone give
>  > me some pointers here?
>  > An alternative would be to have 1 index per instance, and n instances
>  > per server, where n is small. This might actually be a practical
>  > solution -- I'm spending ~20% of my time committing, so I should
>  > probably only have 3 or 4 indices in total per server to avoid two
>  > committing at the same time.
>  >
>  > Your mention of The Large Social Network was interesting! A social
>  > network's data is by definition pretty poorly partitioned by user id,
>  > so unless they've done something extremely clever like co-locating
>  > social cliques in the same indices, I would have though it would be a
>  > sub-optimal architecture. If me and my friends are scattered around
>  > different indices, each search would have to be federated massively.
>  >
>  > James
>  >
>  >
>  > On 28 Feb 2008, at 20:49, Otis Gospodnetic wrote:
>  >
>  > > James,
>  > >
>  > > Regarding your questions about n users per index - this is a fine
>  > > approach.  The largest Social Network that you know of uses the
>  > > same approach for various things, including full-text indices (not
>  > > Solr, but close).  You'd have to maintain user->shard/index mapping
>  > > somewhere, of course.  What should the n be, you ask?  Look at the
>  > > overall index size, I'd say, against server capabilities (RAM,
>  > > disk, CPU), increase n up to a point where you're maximizing your
>  > > hardware at some target query rate.
>  > >
>  > > Otis
>  > > --
>  > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>  > >
>  > > ----- Original Message ----
>  > >> From: James Brady
>
>
> > >> To: solr-user@lucene.apache.org
>  > >> Sent: Wednesday, February 27, 2008 10:08:02 PM
>  > >> Subject: Strategy for handling large (and growing) index:
>  > >> horizontal partitioning?
>  > >>
>  > >> Hi all,
>  > >> Our current setup is a master and slave pair on a single machine,
>  > >> with an index size of ~50GB.
>  > >>
>  > >> Query and update times are still respectable, but commits are taking
>  > >> ~20% of time on the master, while our daily index optimise can up to
>  > >> 4 hours...
>  > >> Here's the most relevant part of solrconfig.xml:
>  > >>      true
>  > >>      10
>  > >>      1000
>  > >>      10000
>  > >>      10000
>  > >>
>  > >> I've given both master and slave 2.5GB of RAM.
>  > >>
>  > >> Does an index optimise read and re-write the whole thing? If so,
>  > >> taking about 4 hours is pretty good! However, the documentation here:
>  > >> http://wiki.apache.org/solr/CollectionDistribution?highlight=%28ten
>  > >> +minutes%29#head-cf174eea2524ae45171a8486a13eea8b6f511f8b
>  > >> states "Optimizations can take nearly ten minutes to run..." which
>  > >> leads me to think that we've grossly misconfigured something...
>  > >>
>  > >> Firstly, we would obviously love any way to reduce this optimise time
>  > >> - I have yet to experiment extensively with the settings above, and
>  > >> optimise frequency, but some general guidance would be great.
>  > >>
>  > >> Secondly, this index size is increasing monotonously over time and as
>  > >> we acquire new users. We need to take action to ensure we can scale
>  > >> in the future. The approach we're favouring at the moment is
>  > >> horizontal partitioning of indices by user id as our data suits this
>  > >> scheme well. A given index would hold the indexed data for n users,
>  > >> where n would probably be between 1 and 100 users, and we will have
>  > >> multiple indices per search server.
>  > >>
>  > >> Running server per index is impractical, especially for a small n, so
>  > >> is a sinlge Solr instance capable of managing multiple searchers and
>  > >> writers in this way? Following on from that, does anyone know of
>  > >> limiting factors in Solr or Lucene that would influence our decision
>  > >> on the value of n - the number of users per index?
>  > >>
>  > >> Thanks!
>  > >> James
>  > >>
>  > >>
>  > >>
>  > >>
>  > >
>  > >
>  >
>  >
>
>
>

Re: Strategy for handling large (and growing) index: horizontal partitioning?

Reply via email to