Hi Shawn

That was so great of you to explain the architecture in such a detail. I 
enjoyed reading it multiple times.

I have a question here:

You mentioned that we can use crc32(DocumentId)% NumServers. Now actually I am 
using that in my data-config.xml in the sql query itself, something like:

For Documents to be indexed on Server 1: select DocumentId,PNum,... from Sample 
where crc32(DocumentId)%2=0;
For Documents to be indexed on Server 2: select DocumentId,PNum,... from Sample 
where crc32(DocumentId)%2=1;

Will that be a right way? Will it not be a slow query?

Thanks once again.



-----Original Message-----
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Monday, November 21, 2011 7:47 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Performance/Architecture

On 11/21/2011 12:41 AM, Husain, Yavar wrote:
> Number of rows in SQL Table (Indexed till now using Solr): 1 million
> Total Size of Data in the table: 4GB
> Total Index Size: 3.5 GB
>
> Total Number of Rows that I have to index: 20 Million (approximately 100 GB 
> Data) and growing
>
> What is the best practices with respect to distributing the index? What I 
> mean to say here is when should I distribute and what is the magic number 
> that I can have for index size per instance?
>
> For 1 million itself Solr instance running on a VM is taking roughly 2.5 hrs 
> to index for me. So for 20 million roughly it would take 60 -70 hrs. That 
> would be too much.
>
> What would be the best distributed architecture for my case? It will be great 
> if people may share their best practices and experience.

I have a MySQL database with 66 million rows at the moment, always 
growing.  My Solr index is split into six large shards and a small shard 
with the newest data.  The small shard (incremental) is calculated by 
looking at counts of data in hourly increments between 7 and 3.5 days 
old, and either choosing a boundary that results in less than 500,000 
documents or the 3.5 day boundary.  This index is usually about 1GB in size.

The rest of the documents are split between the other six shards using 
crc32(did) % 6.  The did field is a mysql bigint autoincrement field.  
These large shards are very close to 11 million records and 20GB each.  
By indexing all six at once, I can complete a full index rebuild in 
about 3.5 hours.

Each full index chain lives on two 64GB Dell servers with dual quad-core 
processors.  Each server contains a Solr instance with 8GB of heap, 
running three large shards.  One server contains the incremental index, 
the other server runs the load balancer.  Both servers run an index-free 
Solr core that we call the broker.  Its search handlers have the shards 
parameter in solrconfig.xml, pointed at the appropriate cores for that 
index chain.

To keep index size down and search speed up, it's important that your 
index only contain the fields needed for two purposes: Searching 
(indexed fields) and displaying a results grid (stored fields).  Any 
other information should be excluded from your schema.xml and/or DIH 
config.  Full item details should be populated from the database or 
other information store (possibly a filesystem), using the unique 
identifier from the search results.

If you are aggregating data from more than one table, see if you can 
have your database get the information into one SELECT statement with 
JOINs, rather than having more than one entity in your DIH config.  
Alternatively, if your secondary tables are small, try using the 
CachedSQLEntityProcessor on them so they are loaded entirely into RAM 
for the import.  Your database software is usually much better at 
combining tables than Solr, so take advantage of it.

If you have multivalued search fields from secondary entities in DIH, 
you can often get your database software to CONCAT them together into a 
single field, then use an appropriate tokenizer to split them into 
separate terms.  I have one such field that is semicolon separated by a 
database JOIN that's specified in a view, then I use a pattern tokenizer 
that splits it at index time.

I hope this is helpful.

Thanks,
Shawn

******************************************************************************************
 
This message may contain confidential or proprietary information intended only 
for the use of the 
addressee(s) named above or may contain information that is legally privileged. 
If you are 
not the intended addressee, or the person responsible for delivering it to the 
intended addressee, 
you are hereby notified that reading, disseminating, distributing or copying 
this message is strictly 
prohibited. If you have received this message by mistake, please immediately 
notify us by 
replying to the message and delete the original message and any copies 
immediately thereafter. 

Thank you.- 
******************************************************************************************
FAFLD

Reply via email to