Re: XmlUpdateRequestHandler with HTMLStripCharFilterFactory

2010-01-13 Thread Shalin Shekhar Mangar
On Wed, Jan 13, 2010 at 7:48 AM, Lance Norskog wrote: > You can do this stripping in the DataImportHandler. You would have to > write your own stripping code using regular expresssions. Note that DIH has a HTMLStripTransformer which wraps Solr's HTMLStripReader. -- Regards, Shalin Shekhar Man

Re: XmlUpdateRequestHandler with HTMLStripCharFilterFactory

2010-01-12 Thread Lance Norskog
You can do this stripping in the DataImportHandler. You would have to write your own stripping code using regular expresssions. Also, the ExtractingRequestHandler strips out the html markup when you use it to index an html file: http://wiki.apache.org/solr/ExtractingRequestHandler On Mon, Jan 11,

Re: XmlUpdateRequestHandler with HTMLStripCharFilterFactory

2010-01-11 Thread darniz
no problem Erick Erickson wrote: > > Ah, I read your post too fast and ignored the title. Sorry 'bout that. > > Erick > > On Mon, Jan 11, 2010 at 2:55 PM, darniz wrote: > >> >> Well thats the whole discussion we are talking about. >> I had the impression that the html tags are filtered and t

Re: XmlUpdateRequestHandler with HTMLStripCharFilterFactory

2010-01-11 Thread Erick Erickson
Ah, I read your post too fast and ignored the title. Sorry 'bout that. Erick On Mon, Jan 11, 2010 at 2:55 PM, darniz wrote: > > Well thats the whole discussion we are talking about. > I had the impression that the html tags are filtered and then the field is > stored without tags. But looks lik

Re: XmlUpdateRequestHandler with HTMLStripCharFilterFactory

2010-01-11 Thread Chris Hostetter
: stored without tags. But looks like the html tags are removed and terms are : indexed purely for indexing, and the actual text is stored in raw format. Correct. Analysis is all about "indexing" it has nothing to do with "stored" content. You can write UpdateProcessors that modify the content

Re: XmlUpdateRequestHandler with HTMLStripCharFilterFactory

2010-01-11 Thread darniz
Well thats the whole discussion we are talking about. I had the impression that the html tags are filtered and then the field is stored without tags. But looks like the html tags are removed and terms are indexed purely for indexing, and the actual text is stored in raw format. Lets say for examp

Re: XmlUpdateRequestHandler with HTMLStripCharFilterFactory

2010-01-11 Thread Erick Erickson
This page: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters shows you many of the SOLR analyzers and filters. Would one of the various *HTMLStrip* stuff work? HTH ERick On Mon, Jan 11, 2010 at 2:44 PM, darniz wrote: > > Tha

Re: XmlUpdateRequestHandler with HTMLStripCharFilterFactory

2010-01-11 Thread darniz
Thanks we were having the saem issue. We are trying to store article content and we are strong a field like This article is for blah . Wheni see the analysis.jsp page it does strip out the tags and is indexed. but when we fetch the document it returns the field with the tags. >From solr point of

Re: XmlUpdateRequestHandler with HTMLStripCharFilterFactory

2009-11-11 Thread aseem cheema
Alright. It turns out that escapedTags is not for what I thought it is for. The problem that I am having with HTMLStripCharFilterFactory is that it strips the html while indexing the field, but not while storing the field. That is why what is see in analysis.jsp, which is index analysis, does not m