Re: Integrating external stemmer in Solr and pre-processing text

2008-09-30 Thread Jaco
Hi, The suggested approach with a TokenFilter extending the BufferedTokenStream class works fine, performance is OK - the external stemmer is now invoked only once for the complete search text. Also, from a functional point of view, the approach is useful, because it allows for other filtering (i.

Re: Integrating external stemmer in Solr and pre-processing text

2008-09-26 Thread Jaco
Thanks for these suggestions, will try it in the coming days and post my findings in this thread. Bye, Jaco. 2008/9/26 Grant Ingersoll <[EMAIL PROTECTED]> > > On Sep 26, 2008, at 12:05 PM, Jaco wrote: > > Hi Grant, >> >> In reply to your questions: >> >> 1. Are you having to restart/initialize

Re: Integrating external stemmer in Solr and pre-processing text

2008-09-26 Thread Grant Ingersoll
On Sep 26, 2008, at 12:05 PM, Jaco wrote: Hi Grant, In reply to your questions: 1. Are you having to restart/initialize the stemmer every time for your "slow" approach? Does that really need to happen? It is invoking a COM object in Windows. The object is instantiated once for a token

Re: Integrating external stemmer in Solr and pre-processing text

2008-09-26 Thread Jaco
The overhead is not in the instantiation, but in the actual call to the COM object. The approach with one time instantiation in the TokenFilterFactory, and the use of that object in the TokenFilter is exactly what I tried. There is a factor of 10 performance gain when being able to do a single call

Re: Integrating external stemmer in Solr and pre-processing text

2008-09-26 Thread Chris Hostetter
: It is invoking a COM object in Windows. The object is instantiated once for : a token stream, and then invoked once for each token. The invoke always has : an overhead, not much to do about that (sigh...) I also know nothing about COM, but based on your comments it sounds like instantiating yo

Re: Integrating external stemmer in Solr and pre-processing text

2008-09-26 Thread Jaco
Hi Grant, In reply to your questions: 1. Are you having to restart/initialize the stemmer every time for your "slow" approach? Does that really need to happen? It is invoking a COM object in Windows. The object is instantiated once for a token stream, and then invoked once for each token. The i

Re: Integrating external stemmer in Solr and pre-processing text

2008-09-26 Thread Grant Ingersoll
On Sep 26, 2008, at 9:40 AM, Jaco wrote: Hi, Here's some of the code of my Tokenizer: public class MyTokenizerFactory extends BaseTokenizerFactory { public WhitespaceTokenizer create(Reader input) { String text, normalizedText; try { text = IOUtils.toString(in

Re: Integrating external stemmer in Solr and pre-processing text

2008-09-26 Thread Jaco
Hi, Here's some of the code of my Tokenizer: public class MyTokenizerFactory extends BaseTokenizerFactory { public WhitespaceTokenizer create(Reader input) { String text, normalizedText; try { text = IOUtils.toString(input); normalizedText= *i

Re: Integrating external stemmer in Solr and pre-processing text

2008-09-26 Thread Grant Ingersoll
How are you creating the tokens? What are you setting for the offsets and the positions? One thing that is helpful is Solr's built in Analysis tool via the Admin interface (http://localhost:8983/solr/admin/) From there, you can plug in verbose mode, and see what the position and offsets a