Re: strip html from data

2011-08-15 Thread Merlin Morgenstern
2011/8/11 Ahmet Arslan > > Is there a way to strip the html tags completly and not > > index them? If not, > > how to I retrieve the results without html tags? > > How do you push documents to solr? You need to strip html tags before the > analysis chain. For example, if you are using Data Import

Re: strip html from data

2011-08-13 Thread Erick Erickson
Right, this is expected behavior, it trips a lot of people up. When you specify ' indexed="true" ' in your field definitions, the contents of the input stream are put into the inverted index etc, *after* all the transformations you specify via tokenizers, filters, charFilters, etc are applied. In

Re: strip html from data

2011-08-11 Thread Alexei Martchenko
You can use like here in this example. Check the docs about your specific SOLR version because something has changed in the htmlstrip syntax in 1.4 and 3.x 2011/8/11 Merlin Morgenstern > I am sorry, but I do not really understand the difference of indexed and > returned result set. > > I l

Re: strip html from data

2011-08-11 Thread Ahmet Arslan
> Is there a way to strip the html tags completly and not > index them? If not, > how to I retrieve the results without html tags? How do you push documents to solr? You need to strip html tags before the analysis chain. For example, if you are using Data Import Handler, you can use HTMLStripTra

Re: strip html from data

2011-08-11 Thread Merlin Morgenstern
I am sorry, but I do not really understand the difference of indexed and returned result set. I look on the "returned" dataset via this command: solr/select/?q=id:533563&terms=true which gives me html tags like this ones: I also tried to turn on TermsComponent, but it did not change anything: s

Re: strip html from data

2011-08-09 Thread Erick Erickson
OK, what does "not working" mean? You never answered Markus' question: "Are you looking at the returned result set or what you've actually indexed? Analyzers are not run on the stored data, only on indexed data." If "not working" means that your returned results contain the markup, then you're co

Re: strip html from data

2011-08-08 Thread Merlin Morgenstern
Unfortunatelly I still cant get it running. The code I am using is the following:

Re: strip html from data

2011-07-25 Thread Mike Sokolov
Hmm that looks like it's working fine. I stand corrected. On 07/25/2011 12:24 PM, Markus Jelsma wrote: I've seen that issue too and read comments on the list yet i've never had trouble with the order, don't know what's going on. Check this analyzer, i've moved the charFilter to the bottom:

Re: strip html from data

2011-07-25 Thread Markus Jelsma
I've seen that issue too and read comments on the list yet i've never had trouble with the order, don't know what's going on. Check this analyzer, i've moved the charFilter to the bottom: The analysis chain still does its job as i expect for the input: bla bla Index Analyzer org.apa

Re: strip html from data

2011-07-25 Thread Mike Sokolov
Hmm - I'm not sure about that; see https://issues.apache.org/jira/browse/SOLR-2119 On 07/25/2011 12:01 PM, Markus Jelsma wrote: charFilters are executed first regardless of their position in the analyzer. On Monday 25 July 2011 17:53:59 Mike Sokolov wrote: I think you need to list the cha

Re: strip html from data

2011-07-25 Thread Markus Jelsma
charFilters are executed first regardless of their position in the analyzer. On Monday 25 July 2011 17:53:59 Mike Sokolov wrote: > I think you need to list the charfilter earlier in the analysis chain; > before the tokenizer. Porbably Solr should tell you this... > > -Mike > > On 07/25/2011 09:

Re: strip html from data

2011-07-25 Thread Mike Sokolov
I think you need to list the charfilter earlier in the analysis chain; before the tokenizer. Porbably Solr should tell you this... -Mike On 07/25/2011 09:03 AM, Merlin Morgenstern wrote: sounds logical. I just changed it to the following, restarted and reindexed with commit:

Re: strip html from data

2011-07-25 Thread Markus Jelsma
Are you looking at the returned result set or what you've actually indexed? Analyzers are not run on the stored data, only on indexed data. On Monday 25 July 2011 15:03:18 Merlin Morgenstern wrote: > sounds logical. I just changed it to the following, restarted and reindexed > with commit: > >

Re: strip html from data

2011-07-25 Thread Merlin Morgenstern
sounds logical. I just changed it to the following, restarted and reindexed with commit:

Re: strip html from data

2011-07-25 Thread Markus Jelsma
You've three analyzer elements, i wonder what that would do. You need to add the char filter to the index-time analyzer. On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote: > Hi there, > > I am trying to strip html tags from the data before adding the documents to > the index. To do that I