No, this isn't what sharding is all about. Sharding is taking a single logical index and splitting it up amongst a number of physical units, often on individual machines. "Load and unload partitions dynamically" doesn't make any sense when talking about shards.
So let's back up. You could create your own _cores_ that you load/unload and take over the distribution of the incoming queries manually. By that I mean your "once in 10,000 queries" instance you go ahead and send your queries to older cores and then unload them when you're done. You could even fire off a query to one core, unload it, fire off the query to the next core, unload it, etc. Of course your query would be very slow, but in such a rare case this may be acceptable. Or you could get some more memory/machines and just have some unused resources. Best Erick On Wed, May 9, 2012 at 5:08 AM, Yuval Dotan <yuvaldo...@gmail.com> wrote: > Thanks Lance > > There is already a clear partition - as you assumed, by date. > > My requirement is for the best setup for: > 1. A *single machine* > 2. Quickly changing index - so i need to have the option to load and unload > partitions dynamically > > Do you think that the sharding model that solr offers is the most suitable > for this setup? > What about the solr multi core model? > > On Wed, May 9, 2012 at 12:23 AM, Lance Norskog <goks...@gmail.com> wrote: > >> Lucene does not support more 2^32 unique documents, so you need to >> partition. In Solr this is done with Distributed Search: >> >> http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/DistributedSearch >> >> First, you have to decide a policy for which documents go to which >> 'shard'. It is common to make a hash code as the unique id, then >> distribute the documents modulo this value. This gives a roughly equal >> distribution of documents. If there is already a clear partition, like >> the date of the document (like newspaper articles) you could use that >> also. >> >> You have new documents and existing documents. For new documents you >> need code for this policy to get all new documents to the right index. >> This could be one master program that passes them out, or each indexer >> could know which documents it gets. >> >> If you want to split up your current index, that's different. I have >> done this: for each shard, make a copy of the full index, >> delete-by-query all of the documents that are NOT in that shard, and >> optimize. We had to do this in sequence so it took a few days :) You >> don't need a full optimize. Use 'maxSegments=50' or '100' to suppress >> that last final giant merge. >> >> On Tue, May 8, 2012 at 12:02 AM, Yuval Dotan <yuvaldo...@gmail.com> wrote: >> > Hi >> > Can someone please guide me to the right way to partition the solr index? >> > >> > On Mon, May 7, 2012 at 11:41 AM, Yuval Dotan <yuvaldo...@gmail.com> >> wrote: >> > >> >> Hi All >> >> Jan, thanks for the reply - answers for your questions are located below >> >> Please update me if you have ideas that can solve my problems. >> >> >> >> First, some corrections to my previous mail: >> >> >> >> > Hi All >> >> > We have an index of ~2,000,000,000 Documents and the query and facet >> >> times >> >> > are too slow for us - our index in fact will be much larger >> >> >> >> > Most of our queries will be limited by time, hence we want to >> partition >> >> the >> >> > data by date/time - even when unlimited – which is mostly what will >> >> happen, we have results in the recent records and querying the whole >> >> dataset is redundant >> >> >> >> > We want to partition the data because the index size is too big and >> >> doesn't >> >> > fit into memory (80 Gb's) - our data actually continuously grows over >> >> time, it will never fit into memory, but has to be available for >> queries in >> >> case results are found in older records or a full facet is required >> >> >> >> > >> >> > 1. Is multi core the best way to implement my requirement? >> >> > 2. I noticed there are some LOAD / UNLOAD actions on a core - should i >> >> use >> >> > these action when managing my cores? if so how can i LOAD a core that >> i >> >> > have unloaded >> >> > for example: >> >> > I have 7 partitions / cores - one for each day of the week - we might >> >> have 2000 per day >> >> >> >> > In most cases I will search for documents only on the last day core. >> >> > Once every 10000 queries I need documents from all cores. >> >> > Question: Do I need to unload all of the old cores and then load them >> on >> >> > demand (when i see i need data from these cores)? >> >> > 3. If the question to the last answer is no, how do i ensure that only >> >> > cores that are loaded into memory are the ones I want? >> >> > >> >> > Thanks >> >> > Yuval >> >> * >> >> * >> >> *Answers to Jan:* >> >> >> >> Hi, >> >> >> >> First you need to investigate WHY faceting and querying is too slow. >> >> What exactly do you mean by slow? Can you please tell us more about your >> >> setup? >> >> >> >> * How large documents and how many fields? >> >> small records ~200bytes, 20 fields avg most of them are not stored - >> >> attached schema and config file >> >> >> >> * What kind of queries? How many hits? How many facets? Have you studies >> >> &debugQuery=true output? >> >> problem is not with queries being slow per se, it is with getting 50 >> >> matches out of billions of matching docs >> >> >> >> * Do you use filter queries (fq) extensively? >> >> user generated queries, fq would not reduce the dataset for some of our >> >> usecases >> >> >> >> * What data do you facet on? Many unique values per field? Text or >> ranges? >> >> What facet.method? >> >> problem is not just faceting, it’s with queries – let’s start there >> >> >> >> * What kind of hardware? RAM/CPU >> >> HP DL180G6 , 2 E5645 (12 core) >> >> 48 GB RAM >> >> * How have you configured your JVM? How much memory? GC? >> >> java -Xms512M -Xmx40960M -jar start.jar >> >> >> >> As you see, you will have to provide a lot more information on your use >> >> case and setup in order for us to judge correct action to take. You >> might >> >> need to adjust your config, or to optimize your queries or caches, slim >> >> your schema, buy some more RAM, or an SSD :) >> >> >> >> Normally, going multi core on one box will not necessarily help in >> itself, >> >> as there is overhead in sharding multi cores as well. However, it COULD >> be >> >> a solution since you say that most of the time you only need to consider >> >> 1/7 of your data. I would perhaps consider one "hot" core for last 24h, >> and >> >> one "archive" core for older data. You could then tune these differently >> >> regarding caches etc. >> >> >> >> Can you get back with some more details? >> >> >> >> -- >> >> Jan Høydahl, search solution architect >> >> Cominvent AS - www.cominvent.com >> >> Solr Training - www.solrtraining.com >> >> >> >> >> >> >> >> -- >> Lance Norskog >> goks...@gmail.com >>