As a data point, I routinely see clients index 5M items on normal > hardware in approx. 1 hour (give or take 30 minutes).
Our master solr machine is running 64-bit RHEL 5.4 on dedicated machine with 4 cores and 16G ram so I think we are good on the hardware. Our DB is MySQL version 5.0.67 (exact stats i don't know of the top of my head) When you say "quite large", what do you mean? Are we talking books here or maybe a couple pages of text or just a couple KB of data? Our item descriptions are very similar to an ebay listing and can include HTML. We are talking about a couple of pages of text. How long does it take you to get that data out (and, from the sounds of it, merge it with your item) w/o going to Solr? I'll have to get back to you on that one. DataImportHandler now supports multiple threads. When you say "now", what do you mean? I am running version 1.4. The absolute fastest way that I know of to index is via multiple threads sending batches of documents at a time (at least 100) Is there a wiki explaining how this multiple thread process works? Which batch size would work best? I am currently using a -1 batch size. You may want to write your own multithreaded client to index. This sounds like a viable option. Can you point me in the right direction on where to begin (what classes to look at, prior examples, etc)? Here is my field type I am using for the item description. Maybe its not the best? <fieldType name="text" class="solr.TextField" omitNorms="false"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumber="1" catenateAll="1" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType> Here is an overview of my data-config.xml. Thoughts? <entity name="item" dataSource="datasource1" query="select * from items"> ... <entity name="item_description" dataSource="datasource2" query="select description from item_descriptions where id=${item.id}"/> </entity> I appreciate the help. -- View this message in context: http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865091.html Sent from the Solr - User mailing list archive at Nabble.com.