Re: Special character and wildcard matching

Alexandre Rafalovitch Tue, 24 Feb 2015 12:21:55 -0800

What happens if the query does not have wildcard expansion (*)? If the
behavior is correct, then the issue is somehow with the
MultitermQueryAnalysis (a hidden automatically generated analyzer
chain): http://wiki.apache.org/solr/MultitermQueryAnalysis


Which would still make it a bug, but at least the cause could be narrowed down.

Regards,
   Alex.


----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 24 February 2015 at 14:56, Arun Rangarajan <arunrangara...@gmail.com> wrote:
> Thanks, Jack.
> I have filed a tkt: https://issues.apache.org/jira/browse/SOLR-7154
>
>
> On Tue, Feb 24, 2015 at 11:43 AM, Jack Krupansky <jack.krupan...@gmail.com>
> wrote:
>
>> Thanks. That at least verifies that the accented e is stored in the field.
>> I don't see anything wrong here, so it is as if the Lucene prefix query was
>> mapping the accented characters. It's not supposed to do that, but...
>>
>> Go ahead and file a Jira bug. Include all of the details that you provided
>> in this thread.
>>
>> -- Jack Krupansky
>>
>> On Tue, Feb 24, 2015 at 2:35 PM, Arun Rangarajan <arunrangara...@gmail.com
>> >
>> wrote:
>>
>> > Exact query:
>> > /select?q=raw_name:beyonce*&wt=json&fl=raw_name
>> >
>> > Response:
>> >
>> > {  "responseHeader": {    "status": 0,    "QTime": 0,    "params": {
>> >    "fl": "raw_name",      "q": "raw_name:beyonce*",      "wt": "json"
>> >   }  },  "response": {    "numFound": 2,    "start": 0,    "docs": [
>> >    {        "raw_name": "beyoncé"      },      {        "raw_name":
>> > "beyoncé"      }    ]  }}
>> >
>> >
>> >
>> > On Tue, Feb 24, 2015 at 11:01 AM, Jack Krupansky <
>> jack.krupan...@gmail.com
>> > >
>> > wrote:
>> >
>> > > Please post the info I requested - the exact query, and the Solr
>> > response.
>> > >
>> > > -- Jack Krupansky
>> > >
>> > > On Tue, Feb 24, 2015 at 12:45 PM, Arun Rangarajan <
>> > > arunrangara...@gmail.com>
>> > > wrote:
>> > >
>> > > > In our case, the lower-casing is happening in a custom Java indexer
>> > code,
>> > > > via Java's String.toLowerCase() method.
>> > > >
>> > > > I used the analysis tool in Solr admin (with Jetty). I believe the
>> raw
>> > > > bytes explain this.
>> > > >
>> > > > Attached are the results for beyonce in file beyonce_no_spl_chars.JPG
>> > and
>> > > > beyoncé in file beyonce_with_spl_chars.JPG.
>> > > >
>> > > > Raw bytes for beyonce: [62 65 79 6f 6e 63 65]
>> > > > Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81]
>> > > >
>> > > > So when you look at the bytes, it seems to explain why beyonce*
>> matches
>> > > > beyoncé.
>> > > >
>> > > > I tried your approach with a KeywordTokenizer followed by a
>> > > > LowerCaseFilter, but I see the same behavior.
>> > > >
>> > > >
>> > > >
>> > > > On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky <
>> > > jack.krupan...@gmail.com>
>> > > > wrote:
>> > > >
>> > > >> But how is that lowercasing occurring? I mean, solr.StrField doesn't
>> > do
>> > > >> that.
>> > > >>
>> > > >> Some containers default to automatically mapping accented
>> characters,
>> > so
>> > > >> that the accented "e" would then get indexed as a normal "e", and
>> then
>> > > >> your
>> > > >> wildcard would match it, and an accented "e" in a query would get
>> > mapped
>> > > >> as
>> > > >> well and then match the normal "e" in the index. What does your
>> query
>> > > >> response look like?
>> > > >>
>> > > >> This blog post explains that problem:
>> > > >> http://bensch.be/tomcat-solr-and-special-characters
>> > > >>
>> > > >> Note that you could make your string field a text field with the
>> > keyword
>> > > >> tokenizer and then filter it for lower case, such as when the user
>> > query
>> > > >> might have a capital "B". String field is most appropriate when the
>> > > field
>> > > >> really is 100% raw.
>> > > >>
>> > > >>
>> > > >> -- Jack Krupansky
>> > > >>
>> > > >> On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan <
>> > > >> arunrangara...@gmail.com>
>> > > >> wrote:
>> > > >>
>> > > >> > Yes, it is a string field and not a text field.
>> > > >> >
>> > > >> > <fieldType name="string" class="solr.StrField"
>> > sortMissingLast="true"
>> > > >> > omitNorms="true"/>
>> > > >> > <field name="raw_name" type="string" indexed="true" stored="true"
>> />
>> > > >> >
>> > > >> > Lower-casing done to do case-insensitive matching.
>> > > >> >
>> > > >> > On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky <
>> > > >> jack.krupan...@gmail.com>
>> > > >> > wrote:
>> > > >> >
>> > > >> > > Is it really a string field - as opposed to a text field? Show
>> us
>> > > the
>> > > >> > field
>> > > >> > > and field type.
>> > > >> > >
>> > > >> > > Besides, if it really were a "raw" name, wouldn't that be a
>> > capital
>> > > >> "B"?
>> > > >> > >
>> > > >> > > -- Jack Krupansky
>> > > >> > >
>> > > >> > > On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan <
>> > > >> > arunrangara...@gmail.com
>> > > >> > > >
>> > > >> > > wrote:
>> > > >> > >
>> > > >> > > > I have a string field raw_name like this in my document:
>> > > >> > > >
>> > > >> > > > {raw_name: beyoncé}
>> > > >> > > >
>> > > >> > > > (Notice that the last character is a special character.)
>> > > >> > > >
>> > > >> > > > When I issue this wildcard query:
>> > > >> > > >
>> > > >> > > > q=raw_name:beyonce*
>> > > >> > > >
>> > > >> > > > i.e. with the last character simply being the ASCII 'e', Solr
>> > > >> returns
>> > > >> > me
>> > > >> > > > the above document.
>> > > >> > > >
>> > > >> > > > How do I prevent this?
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >>
>> > > >
>> > > >
>> > >
>> >
>>

Re: Special character and wildcard matching

Reply via email to