Hi. We as well use md5 as the uid.
I guess by saying each 1/16th is because the md5 is hex, right? (0-f). Thinking about md5 sharding. 1 shard: 0-f 2 shards: 0-7:8-f 3 shards: problem! 4 shards: 0-3.... This technique would require that you double the amount of shards each time you split right ? Split by delete sounds really smart, damn that I did'nt think of that :) Anyway over time the technique of moving the whole index to a new shard and then delete would probably be more than challenging. I will never ever store the data in Lucene mainly because of bad exp and since I want to create modules which are fast, scalable and flexible and storing the data alongside with the index do not match that for me at least. So yes I will have the need to do a "foreach id in ids get document" approach in the searcher code, but at least I can optimize the retrieval of docs myself and let Lucene do what it's good at: indexing and searching not storage. I am more and more thinking in terms of having different levels of searching instead of searcing in all shards at the same time. Let's say you start with 4 shards where you each document is replicated 4 times based on publishdate. Since all shards have the same data you can lb the query to any of the 4 shards. One day you find that 4 shards is not enough because of search performance so you add 4 new shards. Now you only index these 4 new shards with the new documents making the old ones readonly. The searcher would then prioritize the new shards and only if the query returns less than X results you start querying the old shards. This have a nice side effect of having the most relevant/recent entries in the index which is searched the most. Since the old shards will be mostly idle you can as well convert 2 of the old shards to "new" shards reducing the need for buying new servers. What I'm trying to say is that you will end up with an architecture which have many nodes on top which each have few documents and fewer and fewer nodes as you go down the architecture but where each node store more documents since the search speed get's less and less relevant. Something like this: xxxxxxxx - Primary: 10M docs per shard, make sure 95% of the results comes from here. yyyy - Standby: 100M docs per shard - merges of 10 primary indices. zz - Archive: 1000M docs per shard - merges of 10 standby indices. Search top-down. The numbers are just speculative. The drawback with this architecture is that you get no indexing benefit at all if the architecture drawn above is the same as which you use for indexing. I think personally you should use X indexers which then merge indices (MapReduce) for max performance and lay them out as described above. I think Google do something like this. Kindly //Marcus On Sat, Jun 14, 2008 at 2:27 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: > Yes, I've done this split-by-delete several times. The halved index still > uses as much disk space until you optimize it. > > As to splitting policy: we use an MD5 signature as our unique ID. This has > the lovely property that we can wildcard. 'contentid:f*' denotes 1/16 of > the whole index. This 1/16 is a very random sample of the whole index. We > use this for several things. If we use this for shards, we have a query > that > matches a shard's contents. > > The Solr/Lucene syntax does not support modular arithmetic,and so it will > not let you query a subset that matches one of your shards. > > We also found that searching a few smaller indexes via the Solr 1.3 > Distributed Search feature is actually faster than searching one large > index, YMMV. So for us, a large pile of shards will be optimal anyway, so > we > have to need "rebalance". > > It sounds like you're not storing the data in a backing store, but are > storing all data in the index itself. We have found this "challenging". > > Cheers, > > Lance Norskog > > -----Original Message----- > From: Jeremy Hinegardner [mailto:[EMAIL PROTECTED] > Sent: Friday, June 13, 2008 3:36 PM > To: solr-user@lucene.apache.org > Subject: Re: scaling / sharding questions > > Sorry for not keeping this thread alive, lets see what we can do... > > One option I've thought of for 'resharding' would splitting an index into > two by just copying it, the deleting 1/2 the documents from one, doing a > commit, and delete the other 1/2 from the other index and commit. That is: > > 1) Take original index > 2) copy to b1 and b2 > 3) delete docs from b1 that match a particular query A > 4) delete docs from b2 that do not match a particular query A > 5) commit b1 and b2 > > Has anyone tried something like that? > > As for how to know where each document is stored, generally we're > considering unique_document_id % N. If we rebalance we change N and > redistribute, but that > probably will take too much time. That makes us move more towards a > staggered > age based approach where the most recent docs filter down to "permanent" > indexes based upon time. > > Another thought we've had recently is to have many many many physical > shards, on the indexing writer side, but then merge groups of them into > logical shards which are snapshotted to reader solrs' on a frequent basis. > I haven't done any testing along these lines, but logically it seems like > an > idea worth pursuing. > > enjoy, > > -jeremy > > On Fri, Jun 06, 2008 at 03:14:10PM +0200, Marcus Herou wrote: > > Cool sharding technique. > > > > We as well are thinking of howto "move" docs from one index to another > > because we need to re-balance the docs when we add new nodes to the > cluster. > > We do only store id's in the index otherwise we could have moved stuff > > around with IndexReader.document(x) or so. Luke > > (http://www.getopt.org/luke/) is able to reconstruct the indexed > Document > data so it should be doable. > > However I'm thinking of actually just delete the docs from the old > > index and add new Documents to the new node. It would be cool to not > > waste cpu cycles by reindexing already indexed stuff but... > > > > And we as well will have data amounts in the range you are talking > > about. We perhaps could share ideas ? > > > > How do you plan to store where each document is located ? I mean you > > probably need to store info about the Document and it's location > > somewhere perhaps in a clustered DB ? We will probably go for HBase for > this. > > > > I think the number of documents is less important than the actual data > > size (just speculating). We currently search 10M (will get much much > > larger) indexed blog entries on one machine where the JVM has 1G heap, > > the index size is 3G and response times are still quite fast. This is > > a readonly node though and is updated every morning with a freshly > > optimized index. Someone told me that you probably need twice the RAM > > if you plan to both index and search at the same time. If I were you I > > would just test to index X entries of your data and then start to > > search in the index with lower JVM settings each round and when > > response times get too slow or you hit OOE then you get a rough estimate > of the bare minimum X RAM needed for Y entries. > > > > I think we will do with something like 2G per 50M docs but I will need > > to test it out. > > > > If you get an answer in this matter please let me know. > > > > Kindly > > > > //Marcus > > > > > > On Fri, Jun 6, 2008 at 7:21 AM, Jeremy Hinegardner > > <[EMAIL PROTECTED]> > > wrote: > > > > > Hi all, > > > > > > This may be a bit rambling, but let see how it goes. I'm not a > > > Lucene or Solr guru by any means, I have been prototyping with solr > > > and understanding how all the pieces and parts fit together. > > > > > > We are migrating our current document storage infrastructure to a > > > decent sized solr cluster, using 1.3-snapshots right now. > > > Eventually this will be in the > > > billion+ documents, with about 1M new documents added per day. > > > > > > Our main sticking point right now is that a significant number of > > > our documents will be updated, at least once, but possibly more than > > > once. The volatility of a document decreases over time. > > > > > > With this in mind, we've been considering using a cascading series > > > of shard clusters. That is : > > > > > > 1) a cluster of shards holding recent data ( most recent week or > > > two ) smaller > > > indexes that take a small amount of time to commit updates and > optimise, > > > since this will hold the most volatile documents. > > > > > > 2) Following that another cluster of shards that holds some > > > relatively recent > > > ( 3-6 months ? ), but not super volatile, documents, these are > > > items that > > > could potentially receive updates, but generally not. > > > > > > 3) A final set of 'archive' shards holding the final resting place for > > > documents. These would not receive updates. These would be online > for > > > searching and analysis "forever". > > > > > > We are not sure if this is the best way to go, but it is the > > > approach we are leaning toward right now. I would like some > > > feedback from the folks here if you think that is a reasonable > > > approach. > > > > > > One of the other things I'm wondering about is how to manipulate > > > indexes We'll need to roll documents around between indexes over > > > time, or at least migrate indexes from one set of shards to another as > the documents 'age' > > > and > > > merge/aggregate them with more 'stable' indexes. I know about merging > > > complete > > > indexes together, but what about migrating a subset of documents > > > from one index into another index? > > > > > > In addition, what is generally considered a 'manageable' index of > > > large size? I was attempting to find some information on the > > > relationship between search response times, the amount of memory for > > > used for a search, and the number of documents in an index, but I > > > wasn't having much luck. > > > > > > I'm not sure if I'm making sense here, but just thought I would > > > throw this out there and see what people think. Ther eis the > > > distinct possibility that I am not asking the right questions or > > > considering the right parameters, so feel free to correct me, or ask > > > questions as you see fit. > > > > > > And yes, I will report how we are doing things when we get this all > > > figured out, and if there are items that we can contribute back to > > > Solr we will. If nothing else there will be a nice article of how > > > we manage TB of data with Solr. > > > > > > enjoy, > > > > > > -jeremy > > > > > > -- > > > > ======================================================================== > > > Jeremy Hinegardner > [EMAIL PROTECTED] > > > > > > > > > > > > -- > > Marcus Herou CTO and co-founder Tailsweep AB > > +46702561312 > > [EMAIL PROTECTED] > > http://www.tailsweep.com/ > > http://blogg.tailsweep.com/ > > -- > ======================================================================== > Jeremy Hinegardner [EMAIL PROTECTED] > > > -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 [EMAIL PROTECTED] http://www.tailsweep.com/ http://blogg.tailsweep.com/