Re: scaling / sharding questions

Jeremy Hinegardner Fri, 13 Jun 2008 17:18:45 -0700

Hi,

I agree, there is definitely no generic answer, the best sources I can find
so far, relating to performance are:


  http://wiki.apache.org/solr/SolrPerformanceData
  http://wiki.apache.org/solr/SolrPerformanceFactors
  http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
  http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
  http://lucene.apache.org/java/docs/benchmarks.html

Although most of the items discussed on these pages relate directly to speeding
up searching and indexing.  The relationship I am looking for is how does index
size relate to searching and indexing, that particular question doesn't appear
to be answered.   

If no one has any information on that front I guess I'll just have to dive in
and figure it out :-).

As for storing the fields, our initial testing is showing that we get better
performance overall by storing the data in Solr and returning it with the
results instead of using the results to go look up the original documents
elsewhere.

Is there something I am missing here?

enjoy,

-jeremy

On Fri, Jun 06, 2008 at 09:01:14AM -0700, Otis Gospodnetic wrote:
> Hola,
> 
> That's a pretty big an open question, but here is some info.
> 
> Jeremy's sharding approach sounds OK.  We did something similar at Technorati,
> where a document/blog timestamp was the main sharding factor.  You can't
> really move individual docs without reindexing (i.e. delete docX from shard1
> and index docX to shard2), unless all your fields are stored, which you will
> not want to do with data volumes that you are describing.
> 
> 
> As for how much can be handled by a single machine, this is a FAQ and we
> really need to put it on Lucene/Solr FAQ wiki page if it's not there already.
> The answer is this depends on many factors (size of index, # of concurrent
> searches, complexity of queries, number of searchers, type of disk, amount of
> RAM, cache settings, # of CPUs...)
> 
> The questions are right, it's just that there is no single non-generic answer.
> 
> Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> ----- Original Message ----
> > From: Marcus Herou <[EMAIL PROTECTED]> To:
> > solr-user@lucene.apache.org; [EMAIL PROTECTED] Sent: Friday, June 6,
> > 2008 9:14:10 AM Subject: Re: scaling / sharding questions
> > 
> > Cool sharding technique.
> > 
> > We as well are thinking of howto "move" docs from one index to another
> > because we need to re-balance the docs when we add new nodes to the cluster.
> > We do only store id's in the index otherwise we could have moved stuff
> > around with IndexReader.document(x) or so. Luke
> > (http://www.getopt.org/luke/) is able to reconstruct the indexed Document
> > data so it should be doable.  However I'm thinking of actually just delete
> > the docs from the old index and add new Documents to the new node. It would
> > be cool to not waste cpu cycles by reindexing already indexed stuff but...
> > 
> > And we as well will have data amounts in the range you are talking about. We
> > perhaps could share ideas ?
> > 
> > How do you plan to store where each document is located ? I mean you
> > probably need to store info about the Document and it's location somewhere
> > perhaps in a clustered DB ? We will probably go for HBase for this.
> > 
> > I think the number of documents is less important than the actual data size
> > (just speculating). We currently search 10M (will get much much larger)
> > indexed blog entries on one machine where the JVM has 1G heap, the index
> > size is 3G and response times are still quite fast. This is a readonly node
> > though and is updated every morning with a freshly optimized index. Someone
> > told me that you probably need twice the RAM if you plan to both index and
> > search at the same time. If I were you I would just test to index X entries
> > of your data and then start to search in the index with lower JVM settings
> > each round and when response times get too slow or you hit OOE then you get
> > a rough estimate of the bare minimum X RAM needed for Y entries.
> > 
> > I think we will do with something like 2G per 50M docs but I will need to
> > test it out.
> > 
> > If you get an answer in this matter please let me know.
> > 
> > Kindly
> > 
> > //Marcus
> > 
> > 
> > On Fri, Jun 6, 2008 at 7:21 AM, Jeremy Hinegardner wrote:
> > 
> > > Hi all,
> > >
> > > This may be a bit rambling, but let see how it goes.  I'm not a Lucene or
> > > Solr guru by any means, I have been prototyping with solr and
> > > understanding how all the pieces and parts fit together.
> > >
> > > We are migrating our current document storage infrastructure to a decent
> > > sized solr cluster, using 1.3-snapshots right now.  Eventually this will
> > > be in the billion+ documents, with about 1M new documents added per day.
> > >
> > > Our main sticking point right now is that a significant number of our
> > > documents will be updated, at least once, but possibly more than once.
> > > The volatility of a document decreases over time.
> > >
> > > With this in mind, we've been considering using a cascading series of
> > > shard clusters.  That is :
> > >
> > >  1) a cluster of shards holding recent data ( most recent week or two )
> > >  smaller indexes that take a small amount of time to commit updates and
> > >  optimise, since this will hold the most volatile documents.
> > >
> > >  2) Following that another cluster of shards that holds some relatively
> > >  recent ( 3-6 months ? ), but not super volatile, documents, these are
> > >  items that could potentially receive updates, but generally not.
> > >
> > >  3) A final set of 'archive' shards holding the final resting place for
> > >  documents.  These would not receive updates.  These would be online for
> > >  searching and analysis "forever".
> > >
> > > We are not sure if this is the best way to go, but it is the approach we
> > > are leaning toward right now.  I would like some feedback from the folks
> > > here if you think that is a reasonable approach.
> > >
> > > One of the other things I'm wondering about is how to manipulate indexes
> > > We'll need to roll documents around between indexes over time, or at least
> > > migrate indexes from one set of shards to another as the documents 'age'
> > > and merge/aggregate them with more 'stable' indexes.   I know about
> > > merging complete indexes together, but what about migrating a subset of
> > > documents from one index into another index?
> > >
> > > In addition, what is generally considered a 'manageable' index of large
> > > size?  I was attempting to find some information on the relationship
> > > between search response times, the amount of memory for used for a search,
> > > and the number of documents in an index, but I wasn't having much luck.
> > >
> > > I'm not sure if I'm making sense here, but just thought I would throw this
> > > out there and see what people think.  Ther eis the distinct possibility
> > > that I am not asking the right questions or considering the right
> > > parameters, so feel free to correct me, or ask questions as you see fit.
> > >
> > > And yes, I will report how we are doing things when we get this all
> > > figured out, and if there are items that we can contribute back to Solr we
> > > will.  If nothing else there will be a nice article of how we manage TB of
> > > data with Solr.
> > >
> > > enjoy,
> > >
> > > -jeremy
> > >
> > > --
> > > ========================================================================
> > > Jeremy Hinegardner                              [EMAIL PROTECTED]
> > >
> > >

Re: scaling / sharding questions

Reply via email to