Hi Shawn That was so great of you to explain the architecture in such a detail. I enjoyed reading it multiple times.
I have a question here: You mentioned that we can use crc32(DocumentId)% NumServers. Now actually I am using that in my data-config.xml in the sql query itself, something like: For Documents to be indexed on Server 1: select DocumentId,PNum,... from Sample where crc32(DocumentId)%2=0; For Documents to be indexed on Server 2: select DocumentId,PNum,... from Sample where crc32(DocumentId)%2=1; Will that be a right way? Will it not be a slow query? Thanks once again. -----Original Message----- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Monday, November 21, 2011 7:47 PM To: solr-user@lucene.apache.org Subject: Re: Solr Performance/Architecture On 11/21/2011 12:41 AM, Husain, Yavar wrote: > Number of rows in SQL Table (Indexed till now using Solr): 1 million > Total Size of Data in the table: 4GB > Total Index Size: 3.5 GB > > Total Number of Rows that I have to index: 20 Million (approximately 100 GB > Data) and growing > > What is the best practices with respect to distributing the index? What I > mean to say here is when should I distribute and what is the magic number > that I can have for index size per instance? > > For 1 million itself Solr instance running on a VM is taking roughly 2.5 hrs > to index for me. So for 20 million roughly it would take 60 -70 hrs. That > would be too much. > > What would be the best distributed architecture for my case? It will be great > if people may share their best practices and experience. I have a MySQL database with 66 million rows at the moment, always growing. My Solr index is split into six large shards and a small shard with the newest data. The small shard (incremental) is calculated by looking at counts of data in hourly increments between 7 and 3.5 days old, and either choosing a boundary that results in less than 500,000 documents or the 3.5 day boundary. This index is usually about 1GB in size. The rest of the documents are split between the other six shards using crc32(did) % 6. The did field is a mysql bigint autoincrement field. These large shards are very close to 11 million records and 20GB each. By indexing all six at once, I can complete a full index rebuild in about 3.5 hours. Each full index chain lives on two 64GB Dell servers with dual quad-core processors. Each server contains a Solr instance with 8GB of heap, running three large shards. One server contains the incremental index, the other server runs the load balancer. Both servers run an index-free Solr core that we call the broker. Its search handlers have the shards parameter in solrconfig.xml, pointed at the appropriate cores for that index chain. To keep index size down and search speed up, it's important that your index only contain the fields needed for two purposes: Searching (indexed fields) and displaying a results grid (stored fields). Any other information should be excluded from your schema.xml and/or DIH config. Full item details should be populated from the database or other information store (possibly a filesystem), using the unique identifier from the search results. If you are aggregating data from more than one table, see if you can have your database get the information into one SELECT statement with JOINs, rather than having more than one entity in your DIH config. Alternatively, if your secondary tables are small, try using the CachedSQLEntityProcessor on them so they are loaded entirely into RAM for the import. Your database software is usually much better at combining tables than Solr, so take advantage of it. If you have multivalued search fields from secondary entities in DIH, you can often get your database software to CONCAT them together into a single field, then use an appropriate tokenizer to split them into separate terms. I have one such field that is semicolon separated by a database JOIN that's specified in a view, then I use a pattern tokenizer that splits it at index time. I hope this is helpful. Thanks, Shawn ****************************************************************************************** This message may contain confidential or proprietary information intended only for the use of the addressee(s) named above or may contain information that is legally privileged. If you are not the intended addressee, or the person responsible for delivering it to the intended addressee, you are hereby notified that reading, disseminating, distributing or copying this message is strictly prohibited. If you have received this message by mistake, please immediately notify us by replying to the message and delete the original message and any copies immediately thereafter. Thank you.- ****************************************************************************************** FAFLD