Hello,

Our schema in Sol 1.3 looked like:

<tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>

It takes 30s to index 1500 docs. When we run the same in Sol 1.4 it take 70s.

I noticed that HTMLStripStandardTokenizerFactory was deprecated. So
changed the schema to:
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>

It still takes 70s.

Instead, if I use the schema:
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>

It takes 30s in both 1.3 and 1.4.

I am not sure if HTMLStrip has become slower in 1.4 or HTML stripping
impacts perf down stream in 1.4. Before I started writing a unit test
with a TokenizerChain, I wanted to check if I am doing something
fundamentally wrong.

Robin

Reply via email to