StandardTokenizer will have stripped punctuation I think. You might try
searching for all the entity names though:
(agrave | egrave | omacron | etc... )
The names are pretty distinctive. Although you might have problems with
greek letters.
-Mike
On 04/28/2011 12:10 PM, Paul wrote:
I'm trying to create a test to make sure that character sequences like
"è" are successfully converted to their equivalent utf
character (that is, in this case, "รจ").
So, I'd like to search my solr index using the equivalent of the
following regular expression:
&\w{1,6};
To find any escaped sequences that might have slipped through.
Is this possible? I have indexed these fields with text_lu, which
looks like this:
<fieldtype name="text_lu" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldtype>
Thanks,
Paul