Re: Solr - Remove specific punctuation marks

Jonathan Rochkind Mon, 24 Sep 2012 10:27:31 -0700

When I do things like this and want to avoid empty tokens even thoughprevious analysis might result in some--I just throw one of these at theend of my analysis chain:


        <!-- get rid of empty string tokens. max is required, although
             we don't really care. -->
        <filter class="solr.LengthFilterFactory" min="1" max="9999"/>

A charfilter to filter raw characters can certainly still result in anempty token, if an initial token was composed solely of chars you wantedto filter out! In which case you probably want the token to be deletedentirely, not still there as an empty token. The above length filter isone way to do that, although unfortunately requires specifying a 'max'even though I didn't actually want to filter out on the high end, oh well.



On 9/24/2012 1:07 PM, Jack Krupansky wrote:

I tried it and PRFF is indeed generating an empty token. I don't know
how Lucene will index or query an empty term. I mean, what it "should"
do. In any case, it is best to avoid them.

You should be using a "charFilter" to simply filter raw characters
before tokenizing. So, try:

<charFilter class="solr.PatternReplaceCharFilterFactory"/>

It has the same pattern and replacement attributes.

-- Jack Krupansky

-----Original Message----- From: Jack Krupansky
Sent: Monday, September 24, 2012 12:43 PM
To: [email protected]
Subject: Re: Solr - Remove specific punctuation marks

1. Which query parser are you using?
2. I see the following comment in the Java 6 doc for regex "\p{Punct}":
"POSIX character classes (US-ASCII only)", so if any of the punctuation is
some higher Unicode character code, it won't be matched/removed.
3. It seems very odd that the parsed query has empty terms - normally the
query parsers will ignore terms that analyze to zero tokens. Maybe your "{"
is not an ASCII left brace code and is (apparently) unprintable in the
parsed query. Or, maybe there is some encoding problem in the analyzer.

-- Jack Krupansky

-----Original Message----- From: Daisy
Sent: Monday, September 24, 2012 9:26 AM
To: [email protected]
Subject: RE: Solr - Remove specific punctuation marks

I tried &amp; and it solved the 500 error code. But still it could find
punctuation marks.
Although the parsed query didnt contain the punctuation mark,

<str name="rawquerystring">"{"</str>
<str name="querystring">"{"</str>
<str name="parsedquery">text:</str>
<str name="parsedquery_toString">text:</str>

but still the numfound gives 1

<result name="response" numFound="1" start="0">

and the highlight shows the result of punctuation mark
<em>{</em>
The steps I did:
1- editing the schema
2- restart the server
3-delete the file
4-index the file




--
View this message in context:
http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795p4009835.html

Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr - Remove specific punctuation marks

Reply via email to