Re: Searching for escaped characters

Mike Sokolov Thu, 28 Apr 2011 11:03:21 -0700

StandardTokenizer will have stripped punctuation I think. You might trysearching for all the entity names though:


(agrave | egrave | omacron | etc... )

The names are pretty distinctive. Although you might have problems withgreek letters.


-Mike

On 04/28/2011 12:10 PM, Paul wrote:

I'm trying to create a test to make sure that character sequences like
"&egrave;" are successfully converted to their equivalent utf
character (that is, in this case, "è").

So, I'd like to search my solr index using the equivalent of the
following regular expression:

&\w{1,6};

To find any escaped sequences that might have slipped through.

Is this possible? I have indexed these fields with text_lu, which
looks like this:

    <fieldtype name="text_lu" class="solr.TextField" positionIncrementGap="100">
       <analyzer>
         <tokenizer class="solr.StandardTokenizerFactory"/>
         <filter class="solr.StandardFilterFactory"/>
         <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
     </fieldtype>

Thanks,
Paul

Re: Searching for escaped characters

Reply via email to