Re: Strategy for handling large (and growing) index: horizontal partitioning?

James Brady Mon, 03 Mar 2008 20:52:42 -0800

Hi Kevin,

Thanks for your suggestions - I've got about 6 million, and am beingquite stingy with my schema at the moment I'm afraid.

If anything, the size of each document is going to go up, not down,but I might be able to prune some older, unused data.


James

On 3 Mar 2008, at 14:33, Kevin Lewandowski wrote:

How many documents are in the index?

If you haven't already done this I'd take a really close look at your
schema and make sure you're only storing the things that should really
be stored, same with the indexed fields. I drastically reduced my
index size just by changing some indexed/stored options on a few
fields.

On Thu, Feb 28, 2008 at 10:54 PM, Otis Gospodnetic
<[EMAIL PROTECTED]> wrote:
James,

 I can't comment more on the SN's arch choices.

 Here is the story about your questions
- 1 Solr instance can hold 1+ indices, either via JNDI (see Wiki)or via the new multi-core support which works, but is still beinghacked on- See SOLR-303 in JIRA for distributed search. Yonik committedit just the other day, so now that's in nightly builds if you wantto try it. There are 2 Wiki pages about that, too, see Recentchanges log on the Wiki to quickly find them.
 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

 ----- Original Message ----
From: James Brady <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Friday, February 29, 2008 1:11:07 AM
Subject: Re: Strategy for handling large (and growing) index:horizontal partitioning?
Hi Otis,
Thanks for your comments -- I didn't realise the wiki is open to
editing; my apologies. I've put in a few words to try and clear
things up a bit.

So determining n will probably be a best guess followed by trial and
error, that's fine. I'm still not clear about whether single Solr
servers can operate across several indices, however.. can anyonegive
me some pointers here?
An alternative would be to have 1 index per instance, and ninstances
per server, where n is small. This might actually be a practical
solution -- I'm spending ~20% of my time committing, so I should
probably only have 3 or 4 indices in total per server to avoid two
committing at the same time.

Your mention of The Large Social Network was interesting! A social
network's data is by definition pretty poorly partitioned by userid,
so unless they've done something extremely clever like co-locating
social cliques in the same indices, I would have though it wouldbe a
sub-optimal architecture. If me and my friends are scattered around
different indices, each search would have to be federated massively.

James


On 28 Feb 2008, at 20:49, Otis Gospodnetic wrote:
James,

Regarding your questions about n users per index - this is a fine
approach.  The largest Social Network that you know of uses the
same approach for various things, including full-text indices (not
Solr, but close).  You'd have to maintain user->shard/index mapping
somewhere, of course.  What should the n be, you ask?  Look at the
overall index size, I'd say, against server capabilities (RAM,
disk, CPU), increase n up to a point where you're maximizing your
hardware at some target query rate.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: James Brady
To: solr-user@lucene.apache.org
Sent: Wednesday, February 27, 2008 10:08:02 PM
Subject: Strategy for handling large (and growing) index:
horizontal partitioning?

Hi all,
Our current setup is a master and slave pair on a single machine,
with an index size of ~50GB.
Query and update times are still respectable, but commits aretaking~20% of time on the master, while our daily index optimise canup to
4 hours...
Here's the most relevant part of solrconfig.xml:
     true
     10
     1000
     10000
     10000

I've given both master and slave 2.5GB of RAM.

Does an index optimise read and re-write the whole thing? If so,
taking about 4 hours is pretty good! However, the documentationhere:http://wiki.apache.org/solr/CollectionDistribution?highlight=%28ten
+minutes%29#head-cf174eea2524ae45171a8486a13eea8b6f511f8b
states "Optimizations can take nearly ten minutes to run..." which
leads me to think that we've grossly misconfigured something...
Firstly, we would obviously love any way to reduce thisoptimise time- I have yet to experiment extensively with the settings above,and
optimise frequency, but some general guidance would be great.
Secondly, this index size is increasing monotonously over timeand aswe acquire new users. We need to take action to ensure we canscale
in the future. The approach we're favouring at the moment is
horizontal partitioning of indices by user id as our data suitsthisscheme well. A given index would hold the indexed data for nusers,where n would probably be between 1 and 100 users, and we willhave
multiple indices per search server.
Running server per index is impractical, especially for a smalln, sois a sinlge Solr instance capable of managing multiplesearchers and
writers in this way? Following on from that, does anyone know of
limiting factors in Solr or Lucene that would influence ourdecision
on the value of n - the number of users per index?

Thanks!
James

Re: Strategy for handling large (and growing) index: horizontal partitioning?

Reply via email to