I must be missing something obvious.I have a simple regex that removes <space><hyphen><space> pattern.
The unit test below works fine, but when I plug it into schema and query, regex does not match, since input already gets split by space (further below). My understanding that charFilter would operate on raw input string and than pass it to the whitespace tokenizer which seems to be the case, but I am not sure why I get already split token stream. Analyzer analyzer = new Analyzer() { @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { Tokenizer tokenizer = new MockTokenizer(reader, MockTokenizer.WHITESPACE, false); return new TokenStreamComponents(tokenizer, tokenizer); } @Override protected Reader initReader(String fieldName, Reader reader) { return new PatternReplaceCharFilter(pattern("\\s+[\u002d,\u2011,\u2012,\u2013,\u2014,\u2212]\\s+"), " ", reader); } }; final TokenStream tokens = analyzer.tokenStream("", new StringReader("a - b")); tokens.reset(); final CharTermAttribute termAtt = tokens.addAttribute(CharTermAttribute.class); while (tokens.incrementToken()) { System.out.println("===> " + new String(Arrays.copyOf(termAtt.buffer(), termAtt.length()))); } I end up with: ===> a ===> b Now I define the same in my schema: <fieldType name="text" class="solr.TextField" positionIncrementGap="100" multiValued="true" autoGeneratePhraseQueries="false"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory" /> </analyzer> <analyzer type="query"> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\s+[\u002d,\u2011,\u2012,\u2013,\u2014,\u2212]\s+" replacement=" ; " /> <tokenizer class="solr.WhitespaceTokenizerFactory" /> </analyzer> </fieldType> <field name="myfield" type="text" indexed="true" stored="false" multiValued="true"/> When I query the input already comes in split into (e.g. a,-,b) PatternReplaceCharFilter's processPattern method so regex would not match. CharSequence processPattern(CharSequence input) ... even though charFilter is defined before tokenizer. Here is the query SolrQuery solrQuery = new SolrQuery("a - b"); solrQuery.setRequestHandler("/select"); solrQuery.set("defType", "edismax"); solrQuery.set("qf", "myfield"); solrQuery.set(CommonParams.ROWS, "0"); solrQuery.set(CommonParams.DEBUG, "true"); solrQuery.set(CommonParams.DEBUG_QUERY, "true"); QueryResponse response = solrSvr.query(solrQuery); System.out.println("parsedQtoString " + response.getDebugMap() .get("parsedquery_toString")); System.out.println("parsedQ " + response.getDebugMap() .get("parsedquery")); Output is parsedQtoString +((myfield:a) (myfield:-) (myfield:b)) parsedQ (+(DisjunctionMaxQuery((myfield:a)) DisjunctionMaxQuery((myfield:-)) DisjunctionMaxQuery((myfield:b))))/no_coord