On Sep 26, 2008, at 9:40 AM, Jaco wrote:
Hi,
Here's some of the code of my Tokenizer:
public class MyTokenizerFactory extends BaseTokenizerFactory
{
public WhitespaceTokenizer create(Reader input)
{
String text, normalizedText;
try {
text = IOUtils.toString(input);
normalizedText = *invoke my stemmer(text)*;
}
catch( IOException ex ) {
throw new
SolrException( SolrException.ErrorCode.SERVER_ERROR,
ex );
}
StringReader stringReader = new
StringReader(normalizedText);
return new WhitespaceTokenizer(stringReader);
}
}
I see what's going in the analysis tool now, and I think I
understand the
problem. For instance, the text: abcdxxx defgxxx. Let's assume the
stemmer
gets rid of xxx.
I would then see this in the analysis tool after the tokenizer stage:
- abcd - term position 1; start: 1; end: 3
- defg - term position 2; start: 4; end: 7
These positions are not in line with the initial search text - this
must be
why the highlighting goes wrong. I guess my little trick to do this
was a
bit too simple because it messes up the positions basically because
something different from the original source text is tokenized.
Yes, this is exactly the problem. I don't know enough about com4J or
your stemmer, but some things come to mind:
1. Are you having to restart/initialize the stemmer every time for
your "slow" approach? Does that really need to happen?
2. Can the stemmer return something other than a String? Say a String
array of all the stemmed words? Or maybe even some type of object
that tells you the original word and the stemmed word?
-Grant