I think I am making some progress - the key suggestion was to look at the
analysis.jsp which I foolishly had forgotten =(.

I think it is actually a bug in the ShingleFilterFactory when it is used in
subsequent to another Filter which removes tokens, e.g. StopFilterFactory or
WordDelimiterFactory. The Analyzer clearly shows anytime a token is dropped
the ShingleFilterFactory picks up a mysterious '_'.

For example, I enter "w'w oa". The WordDelimiterFactory removes the "w'w"
token but then the ShingleFilterFactory shows "_ oa". Drop the apostraphy in
to create "ww oa" and the ShingleFilterFactory shows "oa". Same occurs if I
have the StopFilterFactory remove tokens.

Be grateful if anyone else can replicate this behavior.

Christopher

-----Original Message-----
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Thursday, February 11, 2010 12:40 PM
To: solr-user@lucene.apache.org
Subject: RE: The Riddle of the Underscore and the Dollar Sign . . .


> Unfortunately, the underscore is
> being quite resilient =(
> 
> I tried the solr.MappingCharFilterFactory and know the
> mapping is working as
> I am changing "c" => "q" just fine. But the underscore
> refuses to go!
> 
> I am baffled . . .

I just activated name="textCharNorm" in example schema.xml and added 
"_" => "xxx" to mapping-ISOLatin1Accent.txt
I verified from http://localhost:8983/solr/admin/analysis.jsp that
replacement is done without problems. Can you also test analysis.jsp?

May be your documents has underscores having different Unicode values. I
know three different Unicode valued characters that all look like "-"
If thats the case you need to find their Unicode values and write them into
mappings.txt.



      


Reply via email to