On Jul 10, 2013, at 9:16am, Shawn Heisey <[email protected]> wrote:
> On 7/10/2013 9:59 AM, Tom Burton-West wrote:
>> The Javadoc for NRTCachingDirectoy (
>> http://lucene.apache.org/core/4_3_1/core/org/apache/lucene/store/NRTCachingDirectory.html?is-external=true)
>> says:
>>
>> "This class is likely only useful in a near real-time context, where
>> indexing rate is lowish but reopen rate is highish, resulting in many tiny
>> files being written..."
>>
>> It seems like we have exactly the opposite use case, so we would like
>> advice on what directory implementation to use instead.
>>
>> We are doing offline batch indexing, so no searches are being done. So we
>> don't need NRT. We also have a high indexing rate as we are trying to
>> index 3 billion pages as quickly as possible.
>>
>> I am not clear what determines the reopen rate. Is it only related to
>> searching or is it involved in indexing as well?
>>
>> Does the NRTCachingDirectory have any benefit for indexing under the use
>> case noted above?
>>
>> I'm guessing we should just use the solrStandardDirectoryFactory instead.
>> Is this correct?
>
> The NRT directory object in Solr uses the MMap implementation as its default
> delegate.
The code I see seems to be using an FSDirectory, or is there another layer of
wrapping going on here?
return new NRTCachingDirectory(FSDirectory.open(new File(path)),
maxMergeSizeMB, maxCachedMB);
> I would use MMapDirectoryFactory (the default for most of the 3.x releases)
> for testing whether you can get any improvement from moving away from the
> default. The advantages of memory mapping are not something you'd want to
> give up.
Tom - did you ever get any useful results from testing here? I'm also curious
about the impact of various xxxDirectoryFactory implementations for batch
indexing.
Thanks,
-- Ken
--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr