Look at http://issues.apache.org/jira/browse/SOLR-303
Please note that it is still work in progress. So you may not be able to use it immeadiately. On Jan 16, 2008 10:53 AM, Srikant Jakilinki <[EMAIL PROTECTED]> wrote: > Hi All, > > There is a requirement in our group of indexing and searching several > millions of documents (TREC) in real-time and millisecond responses. > For the moment we are preferring scale-out (throw more commodity > machines) approaches rather than scale-up (faster disks, more > RAM). This is in-turn inspired by the "Scale-out vs. Scale-up" paper > (mail me if you want a copy) in which it was proven that this kind of > distribution scales better and is more resilient. > > So, are there any resources available (Wiki, Tutorials, Slides, README > etc.) that throw light and guide newbies on how to run Solr in a > multi-machine scenario? I have gone through the mailing lists and site > but could not really find any answers or hands-on stuff to do so. An > adhoc guideline to get things working with 2 machines might just be > enough but for the sake of thinking out loud and solicit responses > from the list, here are my questions: > > 1) Solr that has to handle a fairly large index which has to be split > up on multiple disks (using Multicore?) > - Space is not a problem since we can use NFS but that is not > recommended as we would only exploit 1 processor > 2) Solr that has to handle a large collective index which has to be > split up on multi-machines > - The index is ever increasing (TB scale) and dynamic and all of it > has to be searched at any point > 3) Solr that has to exploit multi-machines because we have plenty of > them in a tightly coupled P2P scenario > - Machines are not a problem but will they be if they are of varied > configurations (PIII to Core2; Linux to Vista; 32-bit to 64-bit; J2SE > 1.1 to 1.6) > 4) Solr that has to distribute load on several machines > - The index(s) could be common though like say using a distributed > filesystem (Hadoop?) > > In each the above cases (we might use all of these strategies at > various use cases) the application should use Solr as a strict backend > and named service (IP or host:port) so that we can expose this > application (and the service) to the web or intranet. Machine failures > should be tolerated too. Also, does Solr manage load balancing out of > the box if it was indeed configured to work with multi-machines? > > Maybe it is superfluous but is Solr and/or Nutch the only way to use > Lucene in a multi-machine environment? Or is there some hidden > document/project somewhere that makes it possible by exposing a > regular Lucene process over the network using RMI or something? It is > my understanding (could be wrong) that Nutch and to some extent, Solr > do not perform well when there is a lot of indexing activity in > parallel to search. Batch processing is also there and perhaps we can > use Nutch/Solr there. Even so, we need multi-machine directions. > > I am sure that multi-machines make possible for a lot of other ways > which might solve the goal better and that others have practical > experience on. So, any advise and tips are also very welcome. We > intend to document things and do some benchmarking along the way in > the open spirit. > > Really sorry for the length but I hope some answers are forthcoming. > > Cheers, > Srikant > -- Regards, Shalin Shekhar Mangar.