Re: Solr Newbie question: doubts about how to handle html content

Marcio Pinto Motta Thu, 05 Oct 2006 08:27:28 -0700

On 10/5/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:

On 10/5/06, Erik Hatcher <[EMAIL PROTECTED]> wrote:
> On Oct 5, 2006, at 7:17 AM, Marcio Pinto Motta wrote:
> > &lt;br&gt;&lt;p&gt;  A Brasil Telecom ... </str>
> >
> > the html code was "changed".
>
> It wasn't "changed" per se... but rather it was encoded.  If you use
> an XML API to read the response you would not see these encoded
> characters.

You can also use a different output syntax to verify that the internal
form is unchanged...
for example, add a wt=json to the HTTP parameters to see the results
in JSON format.

See HTMLStripWhitespaceTokenizerFactory if you don't want XML/HTML
tags indexed.  As Erik said, regardless of how you analyze a field,
you can always get an un-analyzed version back when you markthe field
as "stored".

-Yonik



Hi folks,



What I want is avoid Data Base Server as much as it possible. I don't want
to allow "<>" searches, but is vital to retrieve the "text" in html content.
But also, I need the content ready to be show as soon as possible.
Approaches like solr.HTMLStripWhitespaceTokenizerFactory and  Json in Solr
are amazing, and very productive(saving a lot of code to be write).  More I
test, more I became amazed about it, and I don't test the replication yet
(which is my main goal) J



Thanks a lot for all responses (very quick J).



BR,



Marcio

Re: Solr Newbie question: doubts about how to handle html content

Reply via email to