Re: Partition Question

Erick Erickson Sat, 12 May 2012 11:48:46 -0700

No, this isn't what sharding is all about. Sharding is taking a single
logical index and splitting it up amongst a number of physical
units, often on individual machines. "Load and unload partitions
dynamically" doesn't make any sense when talking about shards.


So let's back up. You could create your own _cores_ that you load/unload
and take over the distribution of the incoming queries manually. By that I mean
your "once in 10,000 queries" instance you go ahead and send your queries
to older cores and then unload them when you're done. You could even
fire off a query to one core, unload it, fire off the query to the next core,
unload it, etc.

Of course your query would be very slow, but in such a rare case this may
be acceptable.

Or you could get some more memory/machines and just have some unused
resources.

Best
Erick

On Wed, May 9, 2012 at 5:08 AM, Yuval Dotan <yuvaldo...@gmail.com> wrote:
> Thanks Lance
>
> There is already a clear partition - as you assumed, by date.
>
> My requirement is for the best setup for:
> 1. A *single machine*
> 2. Quickly changing index - so i need to have the option to load and unload
> partitions dynamically
>
> Do you think that the sharding model that solr offers is the most suitable
> for this setup?
> What about the solr multi core model?
>
> On Wed, May 9, 2012 at 12:23 AM, Lance Norskog <goks...@gmail.com> wrote:
>
>> Lucene does not support more 2^32 unique documents, so you need to
>> partition. In Solr this is done with Distributed Search:
>>
>> http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/DistributedSearch
>>
>> First, you have to decide a policy for which documents go to which
>> 'shard'. It is common to make a hash code as the unique id, then
>> distribute the documents modulo this value. This gives a roughly equal
>> distribution of documents. If there is already a clear partition, like
>> the date of the document (like newspaper articles) you could use that
>> also.
>>
>> You have new documents and existing documents. For new documents you
>> need code for this policy to get all new documents to the right index.
>> This could be one master program that passes them out, or each indexer
>> could know which documents it gets.
>>
>> If you want to split up your current index, that's different. I have
>> done this: for each shard, make a copy of the full index,
>> delete-by-query all of the documents that are NOT in that shard, and
>> optimize. We had to do this in sequence so it took a few days :) You
>> don't need a full optimize. Use 'maxSegments=50' or '100' to suppress
>> that last final giant merge.
>>
>> On Tue, May 8, 2012 at 12:02 AM, Yuval Dotan <yuvaldo...@gmail.com> wrote:
>> > Hi
>> > Can someone please guide me to the right way to partition the solr index?
>> >
>> > On Mon, May 7, 2012 at 11:41 AM, Yuval Dotan <yuvaldo...@gmail.com>
>> wrote:
>> >
>> >> Hi All
>> >> Jan, thanks for the reply - answers for your questions are located below
>> >> Please update me if you have ideas that can solve my problems.
>> >>
>> >> First, some corrections to my previous mail:
>> >>
>> >> > Hi All
>> >> > We have an index of ~2,000,000,000 Documents and the query and facet
>> >> times
>> >> > are too slow for us - our index in fact will be much larger
>> >>
>> >> > Most of our queries will be limited by time, hence we want to
>> partition
>> >> the
>> >> > data by date/time - even when unlimited – which is mostly what will
>> >> happen, we have results in the recent records and querying the whole
>> >> dataset is redundant
>> >>
>> >> > We want to partition the data because the index size is too big and
>> >> doesn't
>> >> > fit into memory (80 Gb's) - our data actually continuously grows over
>> >> time, it will never fit into memory, but has to be available for
>> queries in
>> >> case results are found in older records or a full facet is required
>> >>
>> >> >
>> >> > 1. Is multi core the best way to implement my requirement?
>> >> > 2. I noticed there are some LOAD / UNLOAD actions on a core - should i
>> >> use
>> >> > these action when managing my cores? if so how can i LOAD a core that
>> i
>> >> > have unloaded
>> >> > for example:
>> >> > I have 7 partitions / cores - one for each day of the week - we might
>> >> have 2000 per day
>> >>
>> >> > In most cases I will search for documents only on the last day core.
>> >> > Once every 10000 queries I need documents from all cores.
>> >> > Question: Do I need to unload all of the old cores and then load them
>> on
>> >> > demand (when i see i need data from these cores)?
>> >> > 3. If the question to the last answer is no, how do i ensure that only
>> >> > cores that are loaded into memory are the ones I want?
>> >> >
>> >> > Thanks
>> >> > Yuval
>> >> *
>> >> *
>> >> *Answers to Jan:*
>> >>
>> >> Hi,
>> >>
>> >> First you need to investigate WHY faceting and querying is too slow.
>> >> What exactly do you mean by slow? Can you please tell us more about your
>> >> setup?
>> >>
>> >> * How large documents and how many fields?
>> >> small records ~200bytes, 20 fields avg most of them are not stored -
>> >> attached schema and config file
>> >>
>> >> * What kind of queries? How many hits? How many facets? Have you studies
>> >> &debugQuery=true output?
>> >> problem is not with queries being slow per se, it is with getting 50
>> >> matches out of billions of matching docs
>> >>
>> >> * Do you use filter queries (fq) extensively?
>> >> user generated queries, fq would not reduce the dataset for some of our
>> >> usecases
>> >>
>> >> * What data do you facet on? Many unique values per field? Text or
>> ranges?
>> >> What facet.method?
>> >>  problem is not just faceting, it’s with queries – let’s start there
>> >>
>> >> * What kind of hardware? RAM/CPU
>> >> HP DL180G6 , 2 E5645 (12 core)
>> >> 48 GB RAM
>> >>  * How have you configured your JVM? How much memory? GC?
>> >> java -Xms512M -Xmx40960M -jar start.jar
>> >>
>> >> As you see, you will have to provide a lot more information on your use
>> >> case and setup in order for us to judge correct action to take. You
>> might
>> >> need to adjust your config, or to optimize your queries or caches, slim
>> >> your schema, buy some more RAM, or an SSD :)
>> >>
>> >> Normally, going multi core on one box will not necessarily help in
>> itself,
>> >> as there is overhead in sharding multi cores as well. However, it COULD
>> be
>> >> a solution since you say that most of the time you only need to consider
>> >> 1/7 of your data. I would perhaps consider one "hot" core for last 24h,
>> and
>> >> one "archive" core for older data. You could then tune these differently
>> >> regarding caches etc.
>> >>
>> >> Can you get back with some more details?
>> >>
>> >> --
>> >> Jan Høydahl, search solution architect
>> >> Cominvent AS - www.cominvent.com
>> >> Solr Training - www.solrtraining.com
>> >>
>> >>
>>
>>
>>
>> --
>> Lance Norskog
>> goks...@gmail.com
>>

Re: Partition Question

Reply via email to