Re: Integrating external stemmer in Solr and pre-processing text

Grant Ingersoll Fri, 26 Sep 2008 07:53:16 -0700


On Sep 26, 2008, at 9:40 AM, Jaco wrote:

Hi,

Here's some of the code of my Tokenizer:

public class MyTokenizerFactory extends BaseTokenizerFactory
{
   public WhitespaceTokenizer create(Reader input)
   {
       String text, normalizedText;

       try {
           text  = IOUtils.toString(input);
           normalizedText    = *invoke my stemmer(text)*;

       }
       catch( IOException ex ) {
throw newSolrException( SolrException.ErrorCode.SERVER_ERROR,
ex );
       }
StringReader stringReader = newStringReader(normalizedText);
       return new WhitespaceTokenizer(stringReader);
   }
}
I see what's going in the analysis tool now, and I think Iunderstand theproblem. For instance, the text: abcdxxx defgxxx. Let's assume thestemmer
gets rid of xxx.

I would then see this in the analysis tool after the tokenizer stage:
- abcd - term position 1; start: 1; end:  3
- defg - term position 2; start: 4; end: 7
These positions are not in line with the initial search text - thismust bewhy the highlighting goes wrong. I guess my little trick to do thiswas a
bit too simple because it messes up the positions basically because
something different from the original source text is tokenized.

Yes, this is exactly the problem. I don't know enough about com4J oryour stemmer, but some things come to mind:

1. Are you having to restart/initialize the stemmer every time foryour "slow" approach? Does that really need to happen?2. Can the stemmer return something other than a String? Say a Stringarray of all the stemmed words? Or maybe even some type of objectthat tells you the original word and the stemmed word?


-Grant

Re: Integrating external stemmer in Solr and pre-processing text

Reply via email to