Lucene does not support more 2^32 unique documents, so you need to
partition. In Solr this is done with Distributed Search:
http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/DistributedSearch

First, you have to decide a policy for which documents go to which
'shard'. It is common to make a hash code as the unique id, then
distribute the documents modulo this value. This gives a roughly equal
distribution of documents. If there is already a clear partition, like
the date of the document (like newspaper articles) you could use that
also.

You have new documents and existing documents. For new documents you
need code for this policy to get all new documents to the right index.
This could be one master program that passes them out, or each indexer
could know which documents it gets.

If you want to split up your current index, that's different. I have
done this: for each shard, make a copy of the full index,
delete-by-query all of the documents that are NOT in that shard, and
optimize. We had to do this in sequence so it took a few days :) You
don't need a full optimize. Use 'maxSegments=50' or '100' to suppress
that last final giant merge.

On Tue, May 8, 2012 at 12:02 AM, Yuval Dotan <yuvaldo...@gmail.com> wrote:
> Hi
> Can someone please guide me to the right way to partition the solr index?
>
> On Mon, May 7, 2012 at 11:41 AM, Yuval Dotan <yuvaldo...@gmail.com> wrote:
>
>> Hi All
>> Jan, thanks for the reply - answers for your questions are located below
>> Please update me if you have ideas that can solve my problems.
>>
>> First, some corrections to my previous mail:
>>
>> > Hi All
>> > We have an index of ~2,000,000,000 Documents and the query and facet
>> times
>> > are too slow for us - our index in fact will be much larger
>>
>> > Most of our queries will be limited by time, hence we want to partition
>> the
>> > data by date/time - even when unlimited – which is mostly what will
>> happen, we have results in the recent records and querying the whole
>> dataset is redundant
>>
>> > We want to partition the data because the index size is too big and
>> doesn't
>> > fit into memory (80 Gb's) - our data actually continuously grows over
>> time, it will never fit into memory, but has to be available for queries in
>> case results are found in older records or a full facet is required
>>
>> >
>> > 1. Is multi core the best way to implement my requirement?
>> > 2. I noticed there are some LOAD / UNLOAD actions on a core - should i
>> use
>> > these action when managing my cores? if so how can i LOAD a core that i
>> > have unloaded
>> > for example:
>> > I have 7 partitions / cores - one for each day of the week - we might
>> have 2000 per day
>>
>> > In most cases I will search for documents only on the last day core.
>> > Once every 10000 queries I need documents from all cores.
>> > Question: Do I need to unload all of the old cores and then load them on
>> > demand (when i see i need data from these cores)?
>> > 3. If the question to the last answer is no, how do i ensure that only
>> > cores that are loaded into memory are the ones I want?
>> >
>> > Thanks
>> > Yuval
>> *
>> *
>> *Answers to Jan:*
>>
>> Hi,
>>
>> First you need to investigate WHY faceting and querying is too slow.
>> What exactly do you mean by slow? Can you please tell us more about your
>> setup?
>>
>> * How large documents and how many fields?
>> small records ~200bytes, 20 fields avg most of them are not stored -
>> attached schema and config file
>>
>> * What kind of queries? How many hits? How many facets? Have you studies
>> &debugQuery=true output?
>> problem is not with queries being slow per se, it is with getting 50
>> matches out of billions of matching docs
>>
>> * Do you use filter queries (fq) extensively?
>> user generated queries, fq would not reduce the dataset for some of our
>> usecases
>>
>> * What data do you facet on? Many unique values per field? Text or ranges?
>> What facet.method?
>>  problem is not just faceting, it’s with queries – let’s start there
>>
>> * What kind of hardware? RAM/CPU
>> HP DL180G6 , 2 E5645 (12 core)
>> 48 GB RAM
>>  * How have you configured your JVM? How much memory? GC?
>> java -Xms512M -Xmx40960M -jar start.jar
>>
>> As you see, you will have to provide a lot more information on your use
>> case and setup in order for us to judge correct action to take. You might
>> need to adjust your config, or to optimize your queries or caches, slim
>> your schema, buy some more RAM, or an SSD :)
>>
>> Normally, going multi core on one box will not necessarily help in itself,
>> as there is overhead in sharding multi cores as well. However, it COULD be
>> a solution since you say that most of the time you only need to consider
>> 1/7 of your data. I would perhaps consider one "hot" core for last 24h, and
>> one "archive" core for older data. You could then tune these differently
>> regarding caches etc.
>>
>> Can you get back with some more details?
>>
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Solr Training - www.solrtraining.com
>>
>>



-- 
Lance Norskog
goks...@gmail.com

Reply via email to