Hi,

I have a huge Lucene index, which I'd like to split between machines ("Grid").

E.g. say I have a chain of book-stores, in different countries, and I'm aiming 
for the following:
- Each country has its own index file, on its own machine (e.g. books from 
Japan are indexed on machine "japan1")
- Most users search only within their own country (e.g. search only the 
"japan1" index)
- But sometimes, they might ask to search the entire chain (all countries), 
meaning some sort of "map/reduce" (=collect data from all countries).


The main challenge is the "entire chain search", especially if I want 
reasonable ranking.

After some investigation (+great help from Hibernate Search forum), I've seen 
the following suggestions:


1) Implement a LuceneDirectory that transparently spreads across several 
machines.

I'm not sure how the Search would work - can I ask each index for *relevant* 
data only?
Or would I need to maintain one huge combined file, allowing "random access" 
for the Searcher?


2) Run an IndexReader on each machine.

They tell me each reader can report its relevant term-frequencies, and based on 
that I can fetch relevant results from each machine.
Apparently the ranking won't be perfect (for the overhaul result), but bearable.

Now, I'm not familiar with Lucene internals, and would really appreciate your 
views on it.
- Any good articles on Lucene "Gridding"?
- Any idea whether approach #1 makes any sense (IMHO it's not very sensible if 
I need to merge everything to a single huge file).
- Any good implementations (of either approaches)? So far I found Hibernate 
Search 4, and Solandra.


Thanks very much.

Reply via email to