Re: XmlUpdateRequestHandler with HTMLStripCharFilterFactory

Erick Erickson Mon, 11 Jan 2010 12:35:27 -0800

Ah, I read your post too fast and ignored the title. Sorry 'bout that.

Erick


On Mon, Jan 11, 2010 at 2:55 PM, darniz <rnizamud...@edmunds.com> wrote:

>
> Well thats the whole discussion we are talking about.
> I had the impression that the html tags are filtered and then the field is
> stored without tags. But looks like the html tags are removed and terms are
> indexed purely for indexing, and the actual text is stored in raw format.
>
> Lets say for example if i enter a field like
> <field name="body"><p>honda car road review</field>
> When i do analysis on the body field the html filter removes the <p> tag
> and
> indexed works honda, car, road, review. But when i fetch body field to
> display in my document it returns <p>honda car road review
>
> I hope i make sense.
> thanks
> darniz
>
>
>
> Erick Erickson wrote:
> >
> > This page: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
> > <http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters>shows you
> > many
> > of the SOLR analyzers and filters. Would one of
> > the various *HTMLStrip* stuff work?
> >
> > HTH
> > ERick
> >
> > On Mon, Jan 11, 2010 at 2:44 PM, darniz <rnizamud...@edmunds.com> wrote:
> >
> >>
> >> Thanks we were having the saem issue.
> >> We are trying to store article content and we are strong a field like
> >> <p>This article is for blah </p>.
> >> Wheni see the analysis.jsp page it does strip out the <p> tags and is
> >> indexed. but when we fetch the document it returns the field with the
> <p>
> >> tags.
> >> From solr point of view, its correct but our issue is that this kind of
> >> html
> >> tags is screwing up our display of our page. Is there an easy way to
> >> esure
> >> how to strip out hte html tags, or do we have to take care of manually.
> >>
> >> Thanks
> >> Rashid
> >>
> >>
> >> aseem cheema wrote:
> >> >
> >> > Alright. It turns out that escapedTags is not for what I thought it is
> >> > for.
> >> > The problem that I am having with HTMLStripCharFilterFactory is that
> >> > it strips the html while indexing the field, but not while storing the
> >> > field. That is why what is see in analysis.jsp, which is index
> >> > analysis, does not match what gets stored... because.. well HTML is
> >> > stripped only for indexing. Makes so much sense.
> >> >
> >> > Thanks to Ryan McKinley for clarifying this.
> >> > Aseem
> >> >
> >> > On Wed, Nov 11, 2009 at 9:50 AM, aseem cheema <aseemche...@gmail.com>
> >> > wrote:
> >> >> I am trying to post a document with the following content using
> SolrJ:
> >> >> <center>content</center>
> >> >> I need the xml/html tags to be ignored. Even though this works fine
> in
> >> >> analysis.jsp, this does not work with SolrJ, as the client escapes
> the
> >> >> < and > with &lt; and &gt; and HTMLStripCharFilterFactory does not
> >> >> strip those escaped tags. How can I achieve this? Any ideas will be
> >> >> highly appreciated.
> >> >>
> >> >> There is escapedTags in HTMLStripCharFilterFactory constructor. Is
> >> >> there a way to get that to work?
> >> >> Thanks
> >> >> --
> >> >> Aseem
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Aseem
> >> >
> >> >
> >>
> >> --
> >> View this message in context:
> >>
> http://old.nabble.com/XmlUpdateRequestHandler-with-HTMLStripCharFilterFactory-tp26305561p27116434.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://old.nabble.com/XmlUpdateRequestHandler-with-HTMLStripCharFilterFactory-tp26305561p27116601.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: XmlUpdateRequestHandler with HTMLStripCharFilterFactory

Reply via email to