I think I am making some progress - the key suggestion was to look at the analysis.jsp which I foolishly had forgotten =(.
I think it is actually a bug in the ShingleFilterFactory when it is used in subsequent to another Filter which removes tokens, e.g. StopFilterFactory or WordDelimiterFactory. The Analyzer clearly shows anytime a token is dropped the ShingleFilterFactory picks up a mysterious '_'. For example, I enter "w'w oa". The WordDelimiterFactory removes the "w'w" token but then the ShingleFilterFactory shows "_ oa". Drop the apostraphy in to create "ww oa" and the ShingleFilterFactory shows "oa". Same occurs if I have the StopFilterFactory remove tokens. Be grateful if anyone else can replicate this behavior. Christopher -----Original Message----- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: Thursday, February 11, 2010 12:40 PM To: solr-user@lucene.apache.org Subject: RE: The Riddle of the Underscore and the Dollar Sign . . . > Unfortunately, the underscore is > being quite resilient =( > > I tried the solr.MappingCharFilterFactory and know the > mapping is working as > I am changing "c" => "q" just fine. But the underscore > refuses to go! > > I am baffled . . . I just activated name="textCharNorm" in example schema.xml and added "_" => "xxx" to mapping-ISOLatin1Accent.txt I verified from http://localhost:8983/solr/admin/analysis.jsp that replacement is done without problems. Can you also test analysis.jsp? May be your documents has underscores having different Unicode values. I know three different Unicode valued characters that all look like "-" If thats the case you need to find their Unicode values and write them into mappings.txt.