Jack L wrote:
This is very interesting discussion. I have a few question while
reading Tim and Venkatesh's email:
To Tim:
1. is there any reason you don't want to use HTTP? Since solr has
an HTTP interface already, I suppose using HTTP is the simplest
way to communicate the solr servers from the merger/search broker.
hadoop and ice would both require some additional work - this is
if you are using solr and not lucent directly.
2. "Do you broadcast to the slaves as to who owns a document?"
Do the searchers need to know who has what document?
To Venkatesh:
1. I suppose solr is ok to handle 20 million document - I hope I'm
right because that's what I'm planning on doing :) Is it because
of storage capacity why you you choose to use multiple solr
servers?
An open question: what's the best way to manage server addition?
- If a hash value-based partitioning is used, re-indexing all
the document will be needed.
- Otherwise, a database seems to be required to track the documents.
Jack,
My big stumbling blocks were with indexing more so than searching. I
did put together an RMI based system to search multiple lucene servers.
And the searchers don't need to know where everything is. However
with indexing at some point something needs to know where to send the
documents for updating or who to tell to delete a document, whether it
is the server that does the processing or some sort of broker. The
processing machines could do the DB look up and talk to Solr over HTTP
no problem and this is part of what I am considering doing. However I
have some extra code on the indexing machines to handle DB updates
etc..., though I might find a way to move this elsewhere in the system
so I can have pretty much a pure solr server with just a few custom
items (like my own Similarity or QueryParser).
I suppose the DB could be moved to lucene from SQL in the future as well.