Re: Index & search questions; special cases

2006-11-14 Thread Chris Hostetter

: > : Nutch has phrase pre-filtering which helps with this. It indexes the
: > : phrase fragments as separate terms and uses that set of matches to
: > : filter the set of matching documents.

: > That reminds me ... i seem to remember someone saying once that Nutch lso
: > builds word based n-grams out of it's stop words, so searches on "the"
: > or "on" won't match anything because those words are never indexed as a
: > single tokens, but if a document contains "the dog in the house" it would
: > match a search on "in the" because the Analyzer would treat that as a
: > single token "in_the".

: This looks like exactly what I'm looking for. Is it related to the above
: 'nutch pre-filtering'? This way if I stopword single letters and
: numbers, it would still index 'hepatitis_a' as a single token, and match
: a search on 'hepatitis a' (non-phrase search) without hitting 'a patient
: has hepatitis'? I guess i'd have to apply the filter to the query too,
: so it turns the query into hepatitis_a?

right ... i think we were both talking baout the same feature, which Otis
says is in Nutch's "CommonGrams" class...

http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/analysis/CommonGrams.java?view=markup

: Any chance at all this kind of filter gets implemented into solr? If
: not, indications on how to do it myself would be appreciated - I can't

CommonGrams itself seems to have some other dependencies on nutch because
of other utilities in the same class, but based on a quick skim, what you
really want is the nested "private static class Filter extends
TokenFilter" which doesn't really have any external dependencies.  If you
extract that class into some more specificly named "CommonGramsFilter",
all you need after that to use it in Solr is a simple little
"FilterFactory" so you can refrence it in your schema.xml ... you can use
the StopFilterFactory as a template since you'll need exactly the same
initalization (get the name of a word list file from the init params,
parse it, and build a word set out of it)...

http://svn.apache.org/viewvc/incubator/solr/trunk/src/java/org/apache/solr/analysis/StopFilterFactory.java?view=markup

...all you really need to change is that the "create" method should return
a new "CommonGramsFilter" instead of a StopFilter.

Incidently: most of the code in CommonGrams.Filter seems to be dealing
with the buffering of tokens ... it may be easier to reimpliment the logic
with Solr's BufferedTokenStream as a base class.

-Hoss



Re: Index & search questions; special cases

2006-11-14 Thread Erik Hatcher


On Nov 14, 2006, at 2:00 PM, Chris Hostetter wrote:
CommonGrams itself seems to have some other dependencies on nutch  
because
of other utilities in the same class, but based on a quick skim,  
what you

really want is the nested "private static class Filter extends
TokenFilter" which doesn't really have any external dependencies.   
If you
extract that class into some more specificly named  
"CommonGramsFilter",...


Yeah, the Nutch code is highly intertwined with its unique  
configuration infrastructure and makes it hard to pull pieces of it  
out like this.


Erik



New Feature: ${solr.home}/lib/ dir for "plugins"

2006-11-14 Thread Chris Hostetter

Hey folks, I just wanted to let you all know about a new feature just
committed yesterday (now available in the solr-2006-11-15 nightly build).

While Solr has always had some really handy hooks for loading your own
code to do analysis, request handlers, output writers, field types, cache
implementations, and all sorts of other goodies, It wasn't until I
started trying to document how to *use* custom code in various Servlet
Containers that I realized it was pretty much impossible unless you:

   a) repackaged the Solr WAR with your code inside.
   or  b) used Caucho Resin.

(guess which servlet container I've been using)

Which brings us to the new functionality: It's now possible to create a
"lib/"  directory inside of your Solr Home directory, and place JARs in
that lib directory containing custom code you'd like to use -- Solr will
look at JARs in this directory when resolving class references you may
have in your solrconfig.xml and schema.xml.

This has been tested in a variety of different servlet containers, and
seems to work fine -- but I'd like to request that anyone currently
"repacking" the solr.war to include any other code try using this new
functionality instead, and report any bugs you may encounter (along with
info about your OS, the Servlet Container you are using, the relevant
configuration files, etc...)

Repacking the solr.war to include your code will always be an option
available to Solr users who want a single WAR containing everything they
care about, but the goal of this new functionality is to make it easier to
use custom plugins without needing to jump through the repacking hoops.

More information about Plugins can be found on the wiki...
   http://wiki.apache.org/solr/SolrPlugins

More information about this feature can be found in Jira...
   http://issues.apache.org/jira/browse/SOLR-68


-Hoss



Re: New Feature: ${solr.home}/lib/ dir for "plugins"

2006-11-14 Thread David Halsted

Extremely cool.  This is going to be a big help for some things I'm
working on and I'm sure for others.  Many thanks!!

On 11/14/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:


Hey folks, I just wanted to let you all know about a new feature just
committed yesterday (now available in the solr-2006-11-15 nightly build).

While Solr has always had some really handy hooks for loading your own
code to do analysis, request handlers, output writers, field types, cache
implementations, and all sorts of other goodies, It wasn't until I
started trying to document how to *use* custom code in various Servlet
Containers that I realized it was pretty much impossible unless you:

   a) repackaged the Solr WAR with your code inside.
   or  b) used Caucho Resin.

(guess which servlet container I've been using)

Which brings us to the new functionality: It's now possible to create a
"lib/"  directory inside of your Solr Home directory, and place JARs in
that lib directory containing custom code you'd like to use -- Solr will
look at JARs in this directory when resolving class references you may
have in your solrconfig.xml and schema.xml.

This has been tested in a variety of different servlet containers, and
seems to work fine -- but I'd like to request that anyone currently
"repacking" the solr.war to include any other code try using this new
functionality instead, and report any bugs you may encounter (along with
info about your OS, the Servlet Container you are using, the relevant
configuration files, etc...)

Repacking the solr.war to include your code will always be an option
available to Solr users who want a single WAR containing everything they
care about, but the goal of this new functionality is to make it easier to
use custom plugins without needing to jump through the repacking hoops.

More information about Plugins can be found on the wiki...
   http://wiki.apache.org/solr/SolrPlugins

More information about this feature can be found in Jira...
   http://issues.apache.org/jira/browse/SOLR-68


-Hoss