On Sep 26, 2008, at 9:40 AM, Jaco wrote:
Hi,

Here's some of the code of my Tokenizer:

public class MyTokenizerFactory extends BaseTokenizerFactory
{
   public WhitespaceTokenizer create(Reader input)
   {
       String text, normalizedText;

       try {
           text  = IOUtils.toString(input);
           normalizedText    = *invoke my stemmer(text)*;

       }
       catch( IOException ex ) {
throw new SolrException( SolrException.ErrorCode.SERVER_ERROR,
ex );
       }

StringReader stringReader = new StringReader(normalizedText);
       return new WhitespaceTokenizer(stringReader);
   }
}

I see what's going in the analysis tool now, and I think I understand the problem. For instance, the text: abcdxxx defgxxx. Let's assume the stemmer
gets rid of xxx.

I would then see this in the analysis tool after the tokenizer stage:
- abcd - term position 1; start: 1; end:  3
- defg - term position 2; start: 4; end: 7

These positions are not in line with the initial search text - this must be why the highlighting goes wrong. I guess my little trick to do this was a
bit too simple because it messes up the positions basically because
something different from the original source text is tokenized.
Yes, this is exactly the problem.  I don't know enough about com4J or  
your stemmer, but some things come to mind:
1. Are you having to restart/initialize the stemmer every time for  
your "slow" approach?  Does that really need to happen?
2. Can the stemmer return something other than a String?  Say a String  
array of all the stemmed words?  Or maybe even some type of object  
that tells you the original word and the stemmed word?
-Grant

Reply via email to