Lucene Grid question

2011-09-13 Thread sol myr
Hi,

I have a huge Lucene index, which I'd like to split between machines ("Grid").

E.g. say I have a chain of book-stores, in different countries, and I'm aiming 
for the following:
- Each country has its own index file, on its own machine (e.g. books from 
Japan are indexed on machine "japan1")
- Most users search only within their own country (e.g. search only the 
"japan1" index)
- But sometimes, they might ask to search the entire chain (all countries), 
meaning some sort of "map/reduce" (=collect data from all countries).


The main challenge is the "entire chain search", especially if I want 
reasonable ranking.

After some investigation (+great help from Hibernate Search forum), I've seen 
the following suggestions:


1) Implement a LuceneDirectory that transparently spreads across several 
machines.

I'm not sure how the Search would work - can I ask each index for *relevant* 
data only?
Or would I need to maintain one huge combined file, allowing "random access" 
for the Searcher?


2) Run an IndexReader on each machine.

They tell me each reader can report its relevant term-frequencies, and based on 
that I can fetch relevant results from each machine.
Apparently the ranking won't be perfect (for the overhaul result), but bearable.

Now, I'm not familiar with Lucene internals, and would really appreciate your 
views on it.
- Any good articles on Lucene "Gridding"?
- Any idea whether approach #1 makes any sense (IMHO it's not very sensible if 
I need to merge everything to a single huge file).
- Any good implementations (of either approaches)? So far I found Hibernate 
Search 4, and Solandra.


Thanks very much.



Re: Lucene Grid question

2011-10-03 Thread sol myr
Thank you very much (sorry for the delayed reply).




From: Chris Hostetter 
To: solr-users ; sol myr 
Sent: Wednesday, September 21, 2011 4:15 AM
Subject: Re: Lucene Grid question


: E.g. say I have a chain of book-stores, in different countries, and I'm 
aiming for the following:
: - Each country has its own index file, on its own machine (e.g. books from 
Japan are indexed on machine "japan1")
: - Most users search only within their own country (e.g. search only the 
"japan1" index)
: - But sometimes, they might ask to search the entire chain (all countries), 
meaning some sort of "map/reduce" (=collect data from all countries).

what you're describing is one possible usecase of "Distributed Search"

http://wiki.apache.org/solr/DistributedSearch

as long as each of the individual "country" indexes have schemas that 
overlap (ie: share some common fields) and have the same uniqueKey field, 
with an id space that does *not* overlap between countries (ie: document 
"1" can only be in one index, not in any others) then you can do a 
distributed query that is distributed out to all of hte individual 
indexes, and then merged together to generate aggregate results.


-Hoss