I just downloaded Solr to try out, it seems like it will replace a
ton of code I've written. I saw a few posts about the
FederatedSearch and skimmed the ideas at
http://wiki.apache.org/solr/FederatedSearch. The project I am
working on has several Lucene indexes 20-40GB in size spread among a
few machines. I've also run into problems figuring out how to work
with Lucene in a distributed fashion, though all of my difficulties
were in indexing, searching with Multisearcher and a few custom
classes on top of the hits was not that difficult.
Indexing involved using a SQL database as a master db so you could
find documents by their unique ID and a JMS server to distribute
additions, deletions and updates to each of the indexing servers. I
eventually replaced the JMS server with someone custom I wrote that
is much more lightweight, and less prone to bogging down.
I'd be curious if Yonik was still on the list and if he or anyone
had any new ideas for Federated Searching.
I'm also interested in this. For me, I don't need sorted output,
faceted browsing, or alternative output formats - so something along
the lines of the "Merge XML responses w/o Schema" proposal would be
just fine.
Open issues:
1. How much better (if at all) would it be to use Hadoop PRC (versus
HTTP) to call the sub-searchers? I'm assuming it has better
performance, and there might be fewer connectivity issues, but then
you aren't leveraging the work being done on embedded Jetty, for
example. Anybody have data points on relative performance?
2. Is there one master schema on the "main" search server that could
get distributed to the remote searchers, or would that be part of a
snappuller-ish update mechanism?
Thanks,
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"