Re: HTML decoder is splitting tokens

2009-08-27 Thread Anders Melchiorsen
Hello. Thanks for the hints. Still some trouble, though. I added just the HTMLStripCharFilterFactory because, according to documentation, it should also replace HTML entities. It did, but still left a space after the entity, so I got two tokens from "Günther". That seems like a bug? Adding Mappi

Re: HTML decoder is splitting tokens

2009-08-26 Thread Koji Sekiguchi
Hi Anders, Sorry, I don't know this is a bug or a feature, but I'd like to show an alternate way if you'd like. In Solr trunk, HTMLStripWhitespaceTokenizerFactory is marked as deprecated. Instead, HTMLStripCharFilterFactory and an arbitrary TokenizerFactory are encouraged to use. And I'd recomme

HTML decoder is splitting tokens

2009-08-26 Thread Anders Melchiorsen
Hi. When indexing the string "Günther" with HTMLStripWhitespaceTokenizerFactory (in analysis.jsp), I get two tokens, "Gü" and "nther". Is this a bug, or am I doing something wrong? (Using a Solr nightly from 2009-05-29) Anders.