Why do you need spaces in the replacement? Try pattern="\+" replacement="plus" - it will cause the transformed charstream to contain as many tokens as the original and avoid the highlighting crash.
-- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 22. nov. 2011, at 05:40, Tomasz Wegrzanowski wrote: > Hi, > > I've been trying to match some phrases with + and & (like c++, > google+, r&d etc.), > but tokenized gets rid of them before I can do anything with synonym filters. > > So I tried using CharFilters like this: > > <fieldType name="text" class="solr.TextField" > positionIncrementGap="100" autoGeneratePhraseQueries="true"> > <analyzer type="index"> > <charFilter class="solr.PatternReplaceCharFilterFactory" > pattern="\+" replacement=" plus "/> > <charFilter class="solr.PatternReplaceCharFilterFactory" > pattern="&" replacement=" and "/> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.SynonymFilterFactory" > synonyms="synonyms_case_sensitive.txt" ignoreCase="false" > expand="true"/> > <filter class="solr.SynonymFilterFactory" > synonyms="synonyms.txt" ignoreCase="true" expand="true"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt" enablePositionIncrements="true" /> > <filter class="solr.WordDelimiterFilterFactory" > generateWordParts="1" generateNumberParts="1" catenateWords="1" > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.KeywordMarkerFilterFactory" > protected="protwords.txt"/> > <filter class="solr.SnowballPorterFilterFactory" language="English"/> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.SynonymFilterFactory" > synonyms="query_synonyms.txt" ignoreCase="true" expand="false" /> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt" enablePositionIncrements="true" /> > <filter class="solr.WordDelimiterFilterFactory" > generateWordParts="1" generateNumberParts="1" catenateWords="0" > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.KeywordMarkerFilterFactory" > protected="protwords.txt"/> > <filter class="solr.SnowballPorterFilterFactory" language="English"/> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > </analyzer> > </fieldType> > > This mostly works, but for a very small number of documents, mostly > those with large number of pluses in them, > highlighter just crashes (and it's highlighter since turning it off > and reissuing the query works just fine, if I replace > pluses with spaces and reindex, the same query reruns just fine) with > exception like this: > > Nov 21, 2011 11:35:11 PM org.apache.solr.common.SolrException log > SEVERE: java.lang.StringIndexOutOfBoundsException: String index out of range: > -1 > at java.lang.String.substring(String.java:1938) > at > org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:237) > at > org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:462) > at > org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:378) > at > org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:116) > at > org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360) > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:343) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) > at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) > at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) > at > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:845) > at > org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) > at > org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) > at java.lang.Thread.run(Thread.java:619) > > Is this a known issue? > > Are CharFilters even the right way to approach it? > > Or should I perhaps change or subclass StandardTokenizerFactory to > treat + and & as words? > I haven't looked at StandardTokenizerFactory code yet, so I don't know > how feasible would that be. > > Thanks, > Tomasz