On Jul 10, 2013, at 9:16am, Shawn Heisey <s...@elyograg.org> wrote:

> On 7/10/2013 9:59 AM, Tom Burton-West wrote:
>> The Javadoc for NRTCachingDirectoy (
>> http://lucene.apache.org/core/4_3_1/core/org/apache/lucene/store/NRTCachingDirectory.html?is-external=true)
>>  says:
>> 
>>  "This class is likely only useful in a near real-time context, where
>> indexing rate is lowish but reopen rate is highish, resulting in many tiny
>> files being written..."
>> 
>> It seems like we have exactly the opposite use case, so we would like
>> advice on what directory implementation to use instead.
>> 
>> We are doing offline batch indexing, so no searches are being done.  So we
>> don't need NRT.  We also have a high indexing rate as we are trying to
>> index 3 billion pages as quickly as possible.
>> 
>> I am not clear what determines the reopen rate.   Is it only related to
>> searching or is it involved in indexing as well?
>> 
>>  Does the NRTCachingDirectory have any benefit for indexing under the use
>> case noted above?
>> 
>> I'm guessing we should just use the solrStandardDirectoryFactory instead.
>>  Is this correct?
> 
> The NRT directory object in Solr uses the MMap implementation as its default 
> delegate.  

The code I see seems to be using an FSDirectory, or is there another layer of 
wrapping going on here?

    return new NRTCachingDirectory(FSDirectory.open(new File(path)), 
maxMergeSizeMB, maxCachedMB);

> I would use MMapDirectoryFactory (the default for most of the 3.x releases) 
> for testing whether you can get any improvement from moving away from the 
> default.  The advantages of memory mapping are not something you'd want to 
> give up.

Tom - did you ever get any useful results from testing here? I'm also curious 
about the impact of various xxxDirectoryFactory implementations for batch 
indexing.

Thanks,

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply via email to