What happens if the query does not have wildcard expansion (*)? If the behavior is correct, then the issue is somehow with the MultitermQueryAnalysis (a hidden automatically generated analyzer chain): http://wiki.apache.org/solr/MultitermQueryAnalysis
Which would still make it a bug, but at least the cause could be narrowed down. Regards, Alex. ---- Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 24 February 2015 at 14:56, Arun Rangarajan <arunrangara...@gmail.com> wrote: > Thanks, Jack. > I have filed a tkt: https://issues.apache.org/jira/browse/SOLR-7154 > > > On Tue, Feb 24, 2015 at 11:43 AM, Jack Krupansky <jack.krupan...@gmail.com> > wrote: > >> Thanks. That at least verifies that the accented e is stored in the field. >> I don't see anything wrong here, so it is as if the Lucene prefix query was >> mapping the accented characters. It's not supposed to do that, but... >> >> Go ahead and file a Jira bug. Include all of the details that you provided >> in this thread. >> >> -- Jack Krupansky >> >> On Tue, Feb 24, 2015 at 2:35 PM, Arun Rangarajan <arunrangara...@gmail.com >> > >> wrote: >> >> > Exact query: >> > /select?q=raw_name:beyonce*&wt=json&fl=raw_name >> > >> > Response: >> > >> > { "responseHeader": { "status": 0, "QTime": 0, "params": { >> > "fl": "raw_name", "q": "raw_name:beyonce*", "wt": "json" >> > } }, "response": { "numFound": 2, "start": 0, "docs": [ >> > { "raw_name": "beyoncé" }, { "raw_name": >> > "beyoncé" } ] }} >> > >> > >> > >> > On Tue, Feb 24, 2015 at 11:01 AM, Jack Krupansky < >> jack.krupan...@gmail.com >> > > >> > wrote: >> > >> > > Please post the info I requested - the exact query, and the Solr >> > response. >> > > >> > > -- Jack Krupansky >> > > >> > > On Tue, Feb 24, 2015 at 12:45 PM, Arun Rangarajan < >> > > arunrangara...@gmail.com> >> > > wrote: >> > > >> > > > In our case, the lower-casing is happening in a custom Java indexer >> > code, >> > > > via Java's String.toLowerCase() method. >> > > > >> > > > I used the analysis tool in Solr admin (with Jetty). I believe the >> raw >> > > > bytes explain this. >> > > > >> > > > Attached are the results for beyonce in file beyonce_no_spl_chars.JPG >> > and >> > > > beyoncé in file beyonce_with_spl_chars.JPG. >> > > > >> > > > Raw bytes for beyonce: [62 65 79 6f 6e 63 65] >> > > > Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81] >> > > > >> > > > So when you look at the bytes, it seems to explain why beyonce* >> matches >> > > > beyoncé. >> > > > >> > > > I tried your approach with a KeywordTokenizer followed by a >> > > > LowerCaseFilter, but I see the same behavior. >> > > > >> > > > >> > > > >> > > > On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky < >> > > jack.krupan...@gmail.com> >> > > > wrote: >> > > > >> > > >> But how is that lowercasing occurring? I mean, solr.StrField doesn't >> > do >> > > >> that. >> > > >> >> > > >> Some containers default to automatically mapping accented >> characters, >> > so >> > > >> that the accented "e" would then get indexed as a normal "e", and >> then >> > > >> your >> > > >> wildcard would match it, and an accented "e" in a query would get >> > mapped >> > > >> as >> > > >> well and then match the normal "e" in the index. What does your >> query >> > > >> response look like? >> > > >> >> > > >> This blog post explains that problem: >> > > >> http://bensch.be/tomcat-solr-and-special-characters >> > > >> >> > > >> Note that you could make your string field a text field with the >> > keyword >> > > >> tokenizer and then filter it for lower case, such as when the user >> > query >> > > >> might have a capital "B". String field is most appropriate when the >> > > field >> > > >> really is 100% raw. >> > > >> >> > > >> >> > > >> -- Jack Krupansky >> > > >> >> > > >> On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan < >> > > >> arunrangara...@gmail.com> >> > > >> wrote: >> > > >> >> > > >> > Yes, it is a string field and not a text field. >> > > >> > >> > > >> > <fieldType name="string" class="solr.StrField" >> > sortMissingLast="true" >> > > >> > omitNorms="true"/> >> > > >> > <field name="raw_name" type="string" indexed="true" stored="true" >> /> >> > > >> > >> > > >> > Lower-casing done to do case-insensitive matching. >> > > >> > >> > > >> > On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky < >> > > >> jack.krupan...@gmail.com> >> > > >> > wrote: >> > > >> > >> > > >> > > Is it really a string field - as opposed to a text field? Show >> us >> > > the >> > > >> > field >> > > >> > > and field type. >> > > >> > > >> > > >> > > Besides, if it really were a "raw" name, wouldn't that be a >> > capital >> > > >> "B"? >> > > >> > > >> > > >> > > -- Jack Krupansky >> > > >> > > >> > > >> > > On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan < >> > > >> > arunrangara...@gmail.com >> > > >> > > > >> > > >> > > wrote: >> > > >> > > >> > > >> > > > I have a string field raw_name like this in my document: >> > > >> > > > >> > > >> > > > {raw_name: beyoncé} >> > > >> > > > >> > > >> > > > (Notice that the last character is a special character.) >> > > >> > > > >> > > >> > > > When I issue this wildcard query: >> > > >> > > > >> > > >> > > > q=raw_name:beyonce* >> > > >> > > > >> > > >> > > > i.e. with the last character simply being the ASCII 'e', Solr >> > > >> returns >> > > >> > me >> > > >> > > > the above document. >> > > >> > > > >> > > >> > > > How do I prevent this? >> > > >> > > > >> > > >> > > >> > > >> > >> > > >> >> > > > >> > > > >> > > >> > >>