Please post the info I requested - the exact query, and the Solr response. -- Jack Krupansky
On Tue, Feb 24, 2015 at 12:45 PM, Arun Rangarajan <arunrangara...@gmail.com> wrote: > In our case, the lower-casing is happening in a custom Java indexer code, > via Java's String.toLowerCase() method. > > I used the analysis tool in Solr admin (with Jetty). I believe the raw > bytes explain this. > > Attached are the results for beyonce in file beyonce_no_spl_chars.JPG and > beyoncé in file beyonce_with_spl_chars.JPG. > > Raw bytes for beyonce: [62 65 79 6f 6e 63 65] > Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81] > > So when you look at the bytes, it seems to explain why beyonce* matches > beyoncé. > > I tried your approach with a KeywordTokenizer followed by a > LowerCaseFilter, but I see the same behavior. > > > > On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky <jack.krupan...@gmail.com> > wrote: > >> But how is that lowercasing occurring? I mean, solr.StrField doesn't do >> that. >> >> Some containers default to automatically mapping accented characters, so >> that the accented "e" would then get indexed as a normal "e", and then >> your >> wildcard would match it, and an accented "e" in a query would get mapped >> as >> well and then match the normal "e" in the index. What does your query >> response look like? >> >> This blog post explains that problem: >> http://bensch.be/tomcat-solr-and-special-characters >> >> Note that you could make your string field a text field with the keyword >> tokenizer and then filter it for lower case, such as when the user query >> might have a capital "B". String field is most appropriate when the field >> really is 100% raw. >> >> >> -- Jack Krupansky >> >> On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan < >> arunrangara...@gmail.com> >> wrote: >> >> > Yes, it is a string field and not a text field. >> > >> > <fieldType name="string" class="solr.StrField" sortMissingLast="true" >> > omitNorms="true"/> >> > <field name="raw_name" type="string" indexed="true" stored="true" /> >> > >> > Lower-casing done to do case-insensitive matching. >> > >> > On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky < >> jack.krupan...@gmail.com> >> > wrote: >> > >> > > Is it really a string field - as opposed to a text field? Show us the >> > field >> > > and field type. >> > > >> > > Besides, if it really were a "raw" name, wouldn't that be a capital >> "B"? >> > > >> > > -- Jack Krupansky >> > > >> > > On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan < >> > arunrangara...@gmail.com >> > > > >> > > wrote: >> > > >> > > > I have a string field raw_name like this in my document: >> > > > >> > > > {raw_name: beyoncé} >> > > > >> > > > (Notice that the last character is a special character.) >> > > > >> > > > When I issue this wildcard query: >> > > > >> > > > q=raw_name:beyonce* >> > > > >> > > > i.e. with the last character simply being the ASCII 'e', Solr >> returns >> > me >> > > > the above document. >> > > > >> > > > How do I prevent this? >> > > > >> > > >> > >> > >