Re: scaling / sharding questions

Phillip Farber Wed, 18 Jun 2008 14:54:32 -0700

This may be slightly off topic, for which I apologize, but is related tothe question of searching several indexes as Lance describes below, quoting:

"We also found that searching a few smaller indexes via the Solr 1.3Distributed Search feature is actually faster than searching one large

index, YMMV."

The wiki describing distributed search lists several limitations whichset me to wonder about two limitations in particular and what the valueis mainly with respect to scoring:


1) No distributed idf

Does this mean that the Lucene scoring algorithm is computed without theidf factor, i.e. we just get term frequency scoring?

2) Doesn't support consistency between stages, e.g. a shard index can bechanged between STAGE_EXECUTE_QUERY and STAGE_GET_FIELDS


What does this mean or where can I find out what it means?

Thanks!

Phil




Lance Norskog wrote:

Yes, I've done this split-by-delete several times. The halved index still
uses as much disk space until you optimize it.

As to splitting policy: we use an MD5 signature as our unique ID. This has
the lovely property that we can wildcard.  'contentid:f*' denotes 1/16 of
the whole index. This 1/16 is a very random sample of the whole index. We
use this for several things. If we use this for shards, we have a query that
matches a shard's contents.

The Solr/Lucene syntax does not support modular arithmetic,and so it will
not let you query a subset that matches one of your shards.

We also found that searching a few smaller indexes via the Solr 1.3
Distributed Search feature is actually faster than searching one large
index, YMMV. So for us, a large pile of shards will be optimal anyway, so we
have to need "rebalance".

It sounds like you're not storing the data in a backing store, but are
storing all data in the index itself. We have found this "challenging".

Cheers,

Lance Norskog

-----Original Message-----
From: Jeremy Hinegardner [mailto:[EMAIL PROTECTED]Sent: Friday, June 13, 2008 3:36 PM
To: solr-user@lucene.apache.org
Subject: Re: scaling / sharding questions

Sorry for not keeping this thread alive, lets see what we can do...

One option I've thought of for 'resharding' would splitting an index into
two by just copying it, the deleting 1/2 the documents from one, doing a
commit, and delete the other 1/2 from the other index and commit.  That is:

  1) Take original index
  2) copy to b1 and b2
  3) delete docs from b1 that match a particular query A
  4) delete docs from b2 that do not match a particular query A
  5) commit b1 and b2

Has anyone tried something like that?

As for how to know where each document is stored, generally we're
considering unique_document_id % N.  If we rebalance we change N and
redistribute, but that
probably will take too much time.    That makes us move more towards a
staggered
age based approach where the most recent docs filter down to "permanent"
indexes based upon time.

Another thought we've had recently is to have many many many physical
shards, on the indexing writer side, but then merge groups of them into
logical shards which are snapshotted to reader solrs' on a frequent basis.
I haven't done any testing along these lines, but logically it seems like an
idea worth pursuing.

enjoy,

-jeremy

On Fri, Jun 06, 2008 at 03:14:10PM +0200, Marcus Herou wrote:
Cool sharding technique.
We as well are thinking of howto "move" docs from one index to anotherbecause we need to re-balance the docs when we add new nodes to the
cluster.
We do only store id's in the index otherwise we could have moved stuffaround with IndexReader.document(x) or so. Luke(http://www.getopt.org/luke/) is able to reconstruct the indexed Document
data so it should be doable.
However I'm thinking of actually just delete the docs from the oldindex and add new Documents to the new node. It would be cool to notwaste cpu cycles by reindexing already indexed stuff but...
And we as well will have data amounts in the range you are talkingabout. We perhaps could share ideas ?
How do you plan to store where each document is located ? I mean youprobably need to store info about the Document and it's locationsomewhere perhaps in a clustered DB ? We will probably go for HBase for
this.
I think the number of documents is less important than the actual datasize (just speculating). We currently search 10M (will get much muchlarger) indexed blog entries on one machine where the JVM has 1G heap,the index size is 3G and response times are still quite fast. This isa readonly node though and is updated every morning with a freshlyoptimized index. Someone told me that you probably need twice the RAMif you plan to both index and search at the same time. If I were you Iwould just test to index X entries of your data and then start tosearch in the index with lower JVM settings each round and whenresponse times get too slow or you hit OOE then you get a rough estimate
of the bare minimum X RAM needed for Y entries.
I think we will do with something like 2G per 50M docs but I will needto test it out.
If you get an answer in this matter please let me know.

Kindly

//Marcus
On Fri, Jun 6, 2008 at 7:21 AM, Jeremy Hinegardner<[EMAIL PROTECTED]>
wrote:
Hi all,
This may be a bit rambling, but let see how it goes. I'm not aLucene or Solr guru by any means, I have been prototyping with solrand understanding how all the pieces and parts fit together.
We are migrating our current document storage infrastructure to adecent sized solr cluster, using 1.3-snapshots right now.Eventually this will be in the
billion+ documents, with about 1M new documents added per day.
Our main sticking point right now is that a significant number ofour documents will be updated, at least once, but possibly more thanonce. The volatility of a document decreases over time.
With this in mind, we've been considering using a cascading seriesof shard clusters. That is :
1) a cluster of shards holding recent data ( most recent week ortwo ) smaller
   indexes that take a small amount of time to commit updates and
optimise,
   since this will hold the most volatile documents.
2) Following that another cluster of shards that holds somerelatively recent( 3-6 months ? ), but not super volatile, documents, these areitems that
   could potentially receive updates, but generally not.

 3) A final set of 'archive' shards holding the final resting place for
   documents.  These would not receive updates.  These would be online
for
   searching and analysis "forever".
We are not sure if this is the best way to go, but it is theapproach we are leaning toward right now. I would like somefeedback from the folks here if you think that is a reasonableapproach.
One of the other things I'm wondering about is how to manipulateindexes We'll need to roll documents around between indexes overtime, or at least migrate indexes from one set of shards to another as
the documents 'age'
and
merge/aggregate them with more 'stable' indexes.   I know about merging
complete
indexes together, but what about migrating a subset of documentsfrom one index into another index?
In addition, what is generally considered a 'manageable' index oflarge size? I was attempting to find some information on therelationship between search response times, the amount of memory forused for a search, and the number of documents in an index, but Iwasn't having much luck.
I'm not sure if I'm making sense here, but just thought I wouldthrow this out there and see what people think. Ther eis thedistinct possibility that I am not asking the right questions orconsidering the right parameters, so feel free to correct me, or askquestions as you see fit.
And yes, I will report how we are doing things when we get this allfigured out, and if there are items that we can contribute back toSolr we will. If nothing else there will be a nice article of howwe manage TB of data with Solr.
enjoy,

-jeremy

--
========================================================================
 Jeremy Hinegardner                              [EMAIL PROTECTED]
--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[EMAIL PROTECTED]
http://www.tailsweep.com/
http://blogg.tailsweep.com/
--
========================================================================
Jeremy Hinegardner [EMAIL PROTECTED]

Re: scaling / sharding questions

Reply via email to