Hmm that looks like it's working fine. I stand corrected.
On 07/25/2011 12:24 PM, Markus Jelsma wrote:
I've seen that issue too and read comments on the list yet i've never had
trouble with the order, don't know what's going on. Check this analyzer, i've
moved the charFilter to the bottom:
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"
splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="false" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="false"
words="stopwords.txt"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" protected="protwords.txt"
language="Dutch"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
</analyzer>
The analysis chain still does its job as i expect for the input:
<span>bla bla</span>
Index Analyzer
org.apache.solr.analysis.HTMLStripCharFilterFactory
{luceneMatchVersion=LUCENE_34}
text bla bla
org.apache.solr.analysis.WhitespaceTokenizerFactory
{luceneMatchVersion=LUCENE_34}
position 1 2
term text bla bla
startOffset 6 10
endOffset 9 13
org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1,
generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_34,
generateWordParts=1, catenateAll=0, catenateNumbers=1}
position 1 2
term text bla bla
startOffset 6 10
endOffset 9 13
type word word
org.apache.solr.analysis.LowerCaseFilterFactory {luceneMatchVersion=LUCENE_34}
position 1 2
term text bla bla
startOffset 6 10
endOffset 9 13
type word word
org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=true, ignoreCase=false, luceneMatchVersion=LUCENE_34}
position 1 2
term text bla bla
type word word
startOffset 6 10
endOffset 9 13
org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
ignoreCase=false, luceneMatchVersion=LUCENE_34}
position 1 2
term text bla bla
type word word
startOffset 6 10
endOffset 9 13
org.apache.solr.analysis.ASCIIFoldingFilterFactory
{luceneMatchVersion=LUCENE_34}
position 1 2
term text bla bla
type word word
startOffset 6 10
endOffset 9 13
org.apache.solr.analysis.SnowballPorterFilterFactory {protected=protwords.txt,
language=Dutch, luceneMatchVersion=LUCENE_34}
position 1 2
term text bla bla
keyword false false
type word word
startOffset 6 10
endOffset 9 13
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
{luceneMatchVersion=LUCENE_34}
position 1 2
term text bla bla
keyword false false
type word word
startOffset 6 10
endOffset 9 13
On Monday 25 July 2011 18:07:29 Mike Sokolov wrote:
Hmm - I'm not sure about that; see
https://issues.apache.org/jira/browse/SOLR-2119
On 07/25/2011 12:01 PM, Markus Jelsma wrote:
charFilters are executed first regardless of their position in the
analyzer.
On Monday 25 July 2011 17:53:59 Mike Sokolov wrote:
I think you need to list the charfilter earlier in the analysis chain;
before the tokenizer. Porbably Solr should tell you this...
-Mike
On 07/25/2011 09:03 AM, Merlin Morgenstern wrote:
sounds logical. I just changed it to the following, restarted and
reindexed
with commit:
<fieldType name="text" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer
class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<charFilter
class="solr.HTMLStripCharFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer
class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<charFilter
class="solr.HTMLStripCharFilterFactory"/>
</analyzer>
</fieldType>
Unfortunatelly that did not fix the error. There are still<h3> tags
inside the data. Although I believe there are viewer then before but I
can not prove that. Fact is, there are still html tags inside the data.
Any other ideas what the problem could be?
2011/7/25 Markus Jelsma<markus.jel...@openindex.io>
You've three analyzer elements, i wonder what that would do. You need
to add
the char filter to the index-time analyzer.
On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote:
Hi there,
I am trying to strip html tags from the data before adding the
documents
to
the index. To do that I altered schem.xml like this:
<fieldType name="text" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer
class="solr.WhitespaceTokenizerFactory"/>
<filter
class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter
class="solr.KeywordMarkerFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer
class="solr.WhitespaceTokenizerFactory"/>
<filter
class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter
class="solr.KeywordMarkerFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer>
<charFilter
class="solr.HTMLStripCharFilterFactory"/>
<tokenizer
class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
<fields>
<field name="text" type="text" indexed="true" stored="true"
required="false"/>
</fields>
Unfortunatelly this does not work, the hmtl tags like<h3> are still
present after restarting and reindexing. I also tryed
htmlstriptransformer, but this did not work either.
Has anybody an idea how to get this done? Thank you in advance for
any hint.
Merlin
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350