Re: Strategy for handling large (and growing) index: horizontal partitioning?

James Brady Thu, 28 Feb 2008 22:11:54 -0800

Hi Otis,

Thanks for your comments -- I didn't realise the wiki is open toediting; my apologies. I've put in a few words to try and clearthings up a bit.

So determining n will probably be a best guess followed by trial anderror, that's fine. I'm still not clear about whether single Solrservers can operate across several indices, however.. can anyone giveme some pointers here?An alternative would be to have 1 index per instance, and n instancesper server, where n is small. This might actually be a practicalsolution -- I'm spending ~20% of my time committing, so I shouldprobably only have 3 or 4 indices in total per server to avoid twocommitting at the same time.

Your mention of The Large Social Network was interesting! A socialnetwork's data is by definition pretty poorly partitioned by user id,so unless they've done something extremely clever like co-locatingsocial cliques in the same indices, I would have though it would be asub-optimal architecture. If me and my friends are scattered arounddifferent indices, each search would have to be federated massively.


James


On 28 Feb 2008, at 20:49, Otis Gospodnetic wrote:

James,

Regarding your questions about n users per index - this is a fineapproach. The largest Social Network that you know of uses thesame approach for various things, including full-text indices (notSolr, but close). You'd have to maintain user->shard/index mappingsomewhere, of course. What should the n be, you ask? Look at theoverall index size, I'd say, against server capabilities (RAM,disk, CPU), increase n up to a point where you're maximizing yourhardware at some target query rate.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----

From: James Brady <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, February 27, 2008 10:08:02 PM

Subject: Strategy for handling large (and growing) index:horizontal partitioning?


Hi all,
Our current setup is a master and slave pair on a single machine,
with an index size of ~50GB.

Query and update times are still respectable, but commits are taking
~20% of time on the master, while our daily index optimise can up to
4 hours...
Here's the most relevant part of solrconfig.xml:
     true
     10
     1000
     10000
     10000

I've given both master and slave 2.5GB of RAM.

Does an index optimise read and re-write the whole thing? If so,
taking about 4 hours is pretty good! However, the documentation here:
http://wiki.apache.org/solr/CollectionDistribution?highlight=%28ten
+minutes%29#head-cf174eea2524ae45171a8486a13eea8b6f511f8b
states "Optimizations can take nearly ten minutes to run..." which
leads me to think that we've grossly misconfigured something...

Firstly, we would obviously love any way to reduce this optimise time
- I have yet to experiment extensively with the settings above, and
optimise frequency, but some general guidance would be great.

Secondly, this index size is increasing monotonously over time and as
we acquire new users. We need to take action to ensure we can scale
in the future. The approach we're favouring at the moment is
horizontal partitioning of indices by user id as our data suits this
scheme well. A given index would hold the indexed data for n users,
where n would probably be between 1 and 100 users, and we will have
multiple indices per search server.

Running server per index is impractical, especially for a small n, so
is a sinlge Solr instance capable of managing multiple searchers and
writers in this way? Following on from that, does anyone know of
limiting factors in Solr or Lucene that would influence our decision
on the value of n - the number of users per index?

Thanks!
James

Re: Strategy for handling large (and growing) index: horizontal partitioning?

Reply via email to