As a data point, I routinely see clients index 5M items on normal
> hardware in approx. 1 hour (give or take 30 minutes).  

Our master solr machine is running 64-bit RHEL 5.4 on dedicated machine with
4 cores and 16G ram so I think we are good on the hardware. Our DB is MySQL
version 5.0.67 (exact stats i don't know of the top of my head)


When you say "quite large", what do you mean?  Are we talking books here or
maybe a couple pages of text or just a couple KB of data?

Our item descriptions are very similar to an ebay listing and can include
HTML. We are talking about a couple of pages of text.


How long does it take you to get that data out (and, from the sounds of it,
merge it with your item) w/o going to Solr? 

I'll have to get back to you on that one.


DataImportHandler now supports multiple threads. 

When you say "now", what do you mean? I am running version 1.4.


The absolute fastest way that I know of to index is via multiple threads
sending batches of documents at a time (at least 100)

 Is there a wiki explaining how this multiple thread process works? Which
batch size would work best? I am currently using a -1 batch size. 


You may want to write your own multithreaded client to index. 

This sounds like a viable option. Can you point me in the right direction on
where to begin (what classes to look at, prior examples, etc)?

Here is my field type I am using for the item description. Maybe its not the
best?

  <fieldType name="text" class="solr.TextField" omitNorms="false">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory"
                generateWordParts="1"
                generateNumberParts="1"
                catenateWords="1"
                catenateNumber="1"
                catenateAll="1"
                splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter
class="com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

Here is an overview of my data-config.xml. Thoughts?

 <entity name="item" 
            dataSource="datasource1"
            query="select * from items">
     ...
    <entity name="item_description" 
                dataSource="datasource2" 
                query="select description from item_descriptions where
id=${item.id}"/>
 </entity>

I appreciate the help.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865091.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to