Federated Search

2007-02-27 Thread Tim Patton
I just downloaded Solr to try out, it seems like it will replace a ton 
of code I've written.  I saw a few posts about the FederatedSearch and 
skimmed the ideas at http://wiki.apache.org/solr/FederatedSearch.  The 
project I am working on has several Lucene indexes 20-40GB in size 
spread among a few machines.  I've also run into problems figuring out 
how to work with Lucene in a distributed fashion, though all of my 
difficulties were in indexing, searching with Multisearcher and a few 
custom classes on top of the hits was not that difficult.


Indexing involved using a SQL database as a master db so you could find 
documents by their unique ID and a JMS server to distribute additions, 
deletions and updates to each of the indexing servers.  I eventually 
replaced the JMS server with someone custom I wrote that is much more 
lightweight, and less prone to bogging down.


I'd be curious if Yonik was still on the list and if he or anyone had 
any new ideas for Federated Searching.


Tim P.



Casting Exception with Similarity

2007-03-01 Thread Tim Patton
I'm trying toconvert some of my code over to Solr, but I keep getting 
class cast exceptions when I try to use my own similarity class, like this:


Caused by: java.lang.ClassCastException: 
dealcatcher.kolinka.lucene.similarity.T

estSimilarity cannot be cast to org.apache.lucene.search.Similarity
at 
org.apache.solr.schema.IndexSchema.readConfig(IndexSchema.java:363)

... 21 more

Here is TestSimilarity:

package dealcatcher.kolinka.lucene.similarity;

import org.apache.lucene.search.DefaultSimilarity;

public class TestSimilarity extends DefaultSimilarity
{
}

And my schema.xml:




It works fine if I use:



The jar with my class is located in example/ext, I get a class not found 
if I put it elsewhere.  Should I be locating this jar elsewhere?  Do I 
need to put lucene-nightly in the same directory?  I also get a class 
not found error if lucene isn't located there, which seems strange since 
solr should be able to find lucene classes without my help.




Re: Casting Exception with Similarity

2007-03-02 Thread Tim Patton

Chris,

I figured out my problem.  My own jar must be in the examples/solr/lib 
directory (which does not exist in the download).  I found a hint to 
this on the mailing list.  The docs don't indicate this anywhere 
promenant.  Perhaps the lib directory should exist in the default 
download in the future?


Tim

Chris Hostetter wrote:

: class cast exceptions when I try to use my own similarity class, like this:

: public class TestSimilarity extends DefaultSimilarity
: {
: }

I have two alternate guesses

 1) this may be a missleading error message, the real problem may be that
without a default constructor, it can't instantiate your Similarity.
 2) this may be the behavior when the version of
Similarity/DefaultSimilarity you compile your class against don't
match the versions loaded by the JVM solr is running in (i think there
is a more specific JVM error when that happens though)

...to be safe, add a default constructor, and compile against the
lucene-core jar in the lib directory of the wolr release you are using.

: The jar with my class is located in example/ext, I get a class not found
: if I put it elsewhere.  Should I be locating this jar elsewhere?  Do I

you should create a "lib" directory in your solr.home directory and put
the jar there ... example/ext is where jars you want jetty to load in one
of the really low level class loaders live -- there's no need to put
anything SOlr specific there.

more details are in the solr home readme (example/solr/README.txt) and on
the SolrPlugins wiki...

http://wiki.apache.org/solr/SolrPlugins

: need to put lucene-nightly in the same directory?  I also get a class

no, that's just because you tried to use example/ext which is loaded well
before hte solr.war, so in order for you to put your jar in example/ext,
everything else you jar refrences needs to be there as well.


-Hoss






Re: Federated Search

2007-03-05 Thread Tim Patton



Venkatesh Seetharam wrote:

Hi Tim,

Howdy. I saw your post on Solr newsgroup and caught my attention. I'm 
working on a similar problem for searching a vault of over 100 million 
XML documents. I already have the encoding part done using Hadoop and 
Lucene. It works like a  charm. I create N index partitions and have 
been trying to wrap Solr to search each partition, have a Search broker 
that merges the results and returns.


I'm curious about how have you solved the distribution of additions, 
deletions and updates to each of the indexing servers.I use a 
partitioner based on a hash of the document id. Do you broadcast to the 
slaves as to who owns a document?


Also, I'm looking at Hadoop RPC and ICE ( www.zeroc.com 
) for distributing the search across these Solr 
servers. I'm not using HTTP.


Any ideas are greatly appreciated.

PS: I did subscribe to solr newsgroup now but  did not receive a 
confirmation and hence sending it to you directly.


--
Thanks,
Venkatesh

"Perfection (in design) is achieved not when there is nothing more to 
add, but rather when there is nothing more to take away."

- Antoine de Saint-Exupéry



I used a SQL database to keep track of which server had which document. 
   Then I originally used JMS and would use a selector for which server 
number the document should go to.  I switched over to a home grown, 
lightweight message server since JMS behaves really badly when it backs 
up and I couldn't find a server that would simply pause the producers if 
there was a problem with the consumers.  Additions are pretty much 
assigned randomly to whichever server gets them first.  At this point I 
am up to around 20 million documents.


The hash idea sounds really interesting and if I had a fixed number of 
indexes it would be perfect.  But I don't know how big the index will 
grow and I wanted to be able to add servers at any point.  I would like 
to eliminate any outside dependencies (SQL, JMS), which is why a 
distributed Solr would let me focus on other areas.


How did you work around not being able to update a lucene index that is 
stored in Hadoop?  I know there were changes in Lucene 2.1 to support 
this but I haven't looked that far into it yet, I've just been testing 
the new IndexWriter.  As an aside, I hope those features can be used by 
Solr soon (if they aren't already in the nightlys).


Tim



Re[2]: Federated Search

2007-03-05 Thread Tim Patton



Jack L wrote:

This is very interesting discussion. I have a few question while
reading Tim and Venkatesh's email:

To Tim:
1. is there any reason you don't want to use HTTP? Since solr has
   an HTTP interface already, I suppose using HTTP is the simplest
   way to communicate the solr servers from the merger/search broker.
   hadoop and ice would both require some additional work - this is
   if you are using solr and not lucent directly.

2. "Do you broadcast to the slaves as to who owns a document?"
   Do the searchers need to know who has what document?
   
To Venkatesh:

1. I suppose solr is ok to handle 20 million document - I hope I'm
   right because that's what I'm planning on doing :) Is it because
   of storage capacity why you you choose to use multiple solr
   servers?

An open question: what's the best way to manage server addition?
- If a hash value-based partitioning is used, re-indexing all
  the document will be needed.
- Otherwise, a database seems to be required to track the documents.



Jack,

My big stumbling blocks were with indexing more so than searching.  I 
did put together an RMI based system to search multiple lucene servers. 
 And the searchers don't need to know where everything is.  However 
with indexing at some point something needs to know where to send the 
documents for updating or who to tell to delete a document, whether it 
is the server that does the processing or some sort of broker.   The 
processing machines could do the DB look up and talk to Solr over HTTP 
no problem and this is part of what I am considering doing.  However I 
have some extra code on the indexing machines to handle DB updates 
etc..., though I might find a way to move this elsewhere in the system 
so I can have pretty much a pure solr server with just a few custom 
items (like my own Similarity or QueryParser).


I suppose the DB could be moved to lucene from SQL in the future as well.



Re: Casting Exception with Similarity

2007-03-14 Thread Tim Patton



Chris Hostetter wrote:

: I figured out my problem.  My own jar must be in the examples/solr/lib
: directory (which does not exist in the download).  I found a hint to
: this on the mailing list.  The docs don't indicate this anywhere
: promenant.  Perhaps the lib directory should exist in the default
: download in the future?

it's mentioned in both the plugin wiki i listed, as well as the README for
solr.home (example/solr/README.txt) ... do you have any suggestions about
where else we should document it?

we don't include the lib directory in the example solr home because adding
"plugins" is considered a little above and beyond the basic usage .. we
try to keep the example as simple as possible.


-Hoss




Makes sense, I guess I was looking for a mention in the online 
documentation for the xml file where it mentions how to specify your own 
similarity.  Somehow I never stumbled on the other two spots.




Re: SPAM-LOW: Re: Federated Search

2007-03-14 Thread Tim Patton
I have several indexes now (4 at the moment, 20gb each, and I want to be 
able to drop in a new machine easily).  I'm using SQL server as a DB and 
it scales well.  The DB doesn't get hit too hard, mostly doing location 
lookups, and the app does some checking to make sure a document  has 
really changed before updating that back in the DB or the index.  When a 
new server is added it randomly picks up additions from the message 
server (it's approximately round-robin) and the rest of the system 
really doesn't even need to know about it.


I've realized partitioned indexing is a difficult, but solvable problem. 
 It could be a big project though.  I mean we have all solved it in our 
own way but no one has a general solution.  Distributed searching might 
be a better area to add to Solr since that should basically be the same 
for everyone.  I'm going to mess around with Jini on my own indexes, 
there's finally a new book out to go with the newer versions.


How were you planning on using Solr with Hadoop?  Maybe I don't fully 
understand how hadoop works.


Tim

Venkatesh Seetharam wrote:

Hi Tim,

Thanks for your response. Interesting idea. Does the DB scale?  Do you have
one single index which you plan to use Solr for or you have multiple
indexes?


But I don't know how big the index will grow and I wanted to be able to

add servers at any point.
I'm thinking of having N partitions with a max of 10 million documents per
partition. Adding a server should not be a problem but the newly added
server would take time to grow so that distribution of documents are equal
in the cluster. I've tested with 50 million documents of 10 size each and
looks very promising.


The hash idea sounds really interesting and if I had a fixed number of

indexes it would be perfect.
I'm infact looking around for a reverse-hash algorithm where in given a
docId, I should be able to find which partition contains the document so I
can save cycles on broadcasting slaves.

I mean, even if you use a DB, how have you solved the problem of
distribution when a new server is added into the mix.

We have the same problem since we get daily updates to documents and
document metadata.


How did you work around not being able to update a lucene index that is

stored in Hadoop?
I do not use HDFS. I use a NetApp mounted on all the nodes in the cluster
and hence did not need any change to Lucene.

I plan to index using Lucene/Hadoop and use Solr as the partition searcher
and a broker which would merge the results and return 'em.

Thanks,
Venkatesh

On 3/5/07, Tim Patton <[EMAIL PROTECTED]> wrote:




Venkatesh Seetharam wrote:
> Hi Tim,
>
> Howdy. I saw your post on Solr newsgroup and caught my attention. I'm
> working on a similar problem for searching a vault of over 100 million
> XML documents. I already have the encoding part done using Hadoop and
> Lucene. It works like a  charm. I create N index partitions and have
> been trying to wrap Solr to search each partition, have a Search broker
> that merges the results and returns.
>
> I'm curious about how have you solved the distribution of additions,
> deletions and updates to each of the indexing servers.I use a
> partitioner based on a hash of the document id. Do you broadcast to the
> slaves as to who owns a document?
>
> Also, I'm looking at Hadoop RPC and ICE ( www.zeroc.com
> <http://www.zeroc.com>) for distributing the search across these Solr
> servers. I'm not using HTTP.
>
> Any ideas are greatly appreciated.
>
> PS: I did subscribe to solr newsgroup now but  did not receive a
> confirmation and hence sending it to you directly.
>
> --
> Thanks,
> Venkatesh
>
> "Perfection (in design) is achieved not when there is nothing more to
> add, but rather when there is nothing more to take away."
> - Antoine de Saint-Exupéry


I used a SQL database to keep track of which server had which document.
Then I originally used JMS and would use a selector for which server
number the document should go to.  I switched over to a home grown,
lightweight message server since JMS behaves really badly when it backs
up and I couldn't find a server that would simply pause the producers if
there was a problem with the consumers.  Additions are pretty much
assigned randomly to whichever server gets them first.  At this point I
am up to around 20 million documents.

The hash idea sounds really interesting and if I had a fixed number of
indexes it would be perfect.  But I don't know how big the index will
grow and I wanted to be able to add servers at any point.  I would like
to eliminate any outside dependencies (SQL, JMS), which is why a
distributed Solr would let me focus on other areas.

How did you work around not being able to update a lucene index that is
stored in Hadoop?  I know there

Re: Casting Exception with Similarity

2007-03-14 Thread Tim Patton

Sweet, looks like someone beat me to it.

Tim

Chris Hostetter wrote:

:
: Makes sense, I guess I was looking for a mention in the online
: documentation for the xml file where it mentions how to specify your own
: similarity.  Somehow I never stumbled on the other two spots.


Hmmm... you mean http://wiki.apache.org/solr/SchemaXml right?

yeah i can see how that would be a little confusing ... i've made some
updates, feel free to edit the docs further if you think it's still not
clear.


-Hoss