Re: Special character and wildcard matching

Jack Krupansky Tue, 24 Feb 2015 11:05:00 -0800

Please post the info I requested - the exact query, and the Solr response.

-- Jack Krupansky


On Tue, Feb 24, 2015 at 12:45 PM, Arun Rangarajan <arunrangara...@gmail.com>
wrote:

> In our case, the lower-casing is happening in a custom Java indexer code,
> via Java's String.toLowerCase() method.
>
> I used the analysis tool in Solr admin (with Jetty). I believe the raw
> bytes explain this.
>
> Attached are the results for beyonce in file beyonce_no_spl_chars.JPG and
> beyoncé in file beyonce_with_spl_chars.JPG.
>
> Raw bytes for beyonce: [62 65 79 6f 6e 63 65]
> Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81]
>
> So when you look at the bytes, it seems to explain why beyonce* matches
> beyoncé.
>
> I tried your approach with a KeywordTokenizer followed by a
> LowerCaseFilter, but I see the same behavior.
>
>
>
> On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky <jack.krupan...@gmail.com>
> wrote:
>
>> But how is that lowercasing occurring? I mean, solr.StrField doesn't do
>> that.
>>
>> Some containers default to automatically mapping accented characters, so
>> that the accented "e" would then get indexed as a normal "e", and then
>> your
>> wildcard would match it, and an accented "e" in a query would get mapped
>> as
>> well and then match the normal "e" in the index. What does your query
>> response look like?
>>
>> This blog post explains that problem:
>> http://bensch.be/tomcat-solr-and-special-characters
>>
>> Note that you could make your string field a text field with the keyword
>> tokenizer and then filter it for lower case, such as when the user query
>> might have a capital "B". String field is most appropriate when the field
>> really is 100% raw.
>>
>>
>> -- Jack Krupansky
>>
>> On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan <
>> arunrangara...@gmail.com>
>> wrote:
>>
>> > Yes, it is a string field and not a text field.
>> >
>> > <fieldType name="string" class="solr.StrField" sortMissingLast="true"
>> > omitNorms="true"/>
>> > <field name="raw_name" type="string" indexed="true" stored="true" />
>> >
>> > Lower-casing done to do case-insensitive matching.
>> >
>> > On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky <
>> jack.krupan...@gmail.com>
>> > wrote:
>> >
>> > > Is it really a string field - as opposed to a text field? Show us the
>> > field
>> > > and field type.
>> > >
>> > > Besides, if it really were a "raw" name, wouldn't that be a capital
>> "B"?
>> > >
>> > > -- Jack Krupansky
>> > >
>> > > On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan <
>> > arunrangara...@gmail.com
>> > > >
>> > > wrote:
>> > >
>> > > > I have a string field raw_name like this in my document:
>> > > >
>> > > > {raw_name: beyoncé}
>> > > >
>> > > > (Notice that the last character is a special character.)
>> > > >
>> > > > When I issue this wildcard query:
>> > > >
>> > > > q=raw_name:beyonce*
>> > > >
>> > > > i.e. with the last character simply being the ASCII 'e', Solr
>> returns
>> > me
>> > > > the above document.
>> > > >
>> > > > How do I prevent this?
>> > > >
>> > >
>> >
>>
>
>

Re: Special character and wildcard matching

Reply via email to