Re: How does HTMLStripWhitespaceTokenizerFactory work?

Mike Klaas Mon, 11 Jun 2007 11:57:14 -0700

On 11-Jun-07, at 3:54 AM, Thierry Collogne wrote:

Ok. Is it possible to get back the content without the html tags?

Well, it isn't stored anywhere in Solr. It's best to think of lucene/solr as two systems: the indexer applies a tokenizationtransformation to the data and creates an inverted index; the storagesystem keeps track of the data you give it _before_ analysis/tokenization. If there is analysis you'd like to do that alsoapplies to the stored status of the doc, it's probably easier toapply it before passing the data to Solr.


-MIke

On 08/06/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:

On 6/8/07, Thierry Collogne <[EMAIL PROTECTED]> wrote:
> I am trying to use the solr.HTMLStripWhitespaceTokenizerFactoryanalyzer
> with no luck.
[...]
> Is this normal? Shouldn't the html code and the white spaces beremoved
from
> the field?

For indexing purposes, yes.  The stored field you get back will be
unchanged though.
If you want to see what will be indexed, try the analysis debugger in
the admin pages.

-Yonik

Re: How does HTMLStripWhitespaceTokenizerFactory work?

Reply via email to