Re: recent utf8 problems

Chris Hostetter Tue, 07 Nov 2017 09:37:17 -0800

: 1) When looking for Tübingen in the title, I am expecting the 3092484 

Just to be clear -- I'm reading that as an 8 character word, where the 2nd 
character is U+00FC and the other characters are plain ascii: T_bingen


Also to be clear: I'm attempting to reproduce the steps you describe using 
Solr 7.1, via "bin/solr -e techproducts"

I've indexed one additional document like so...

curl -H 'Content-Type: application/json' 
'http://localhost:8983/solr/techproducts/update?commit=true' --data-binary 
'[{"id":"HOSS","title":"Tübingen"}]'


: I should explain my test more clearly. We use a webbrowser (Firefox or 
: Chrome) to open the admin console of the search engine, which is at 
: http://localhost:8983/solr/#/mmc_search3/query 
: <http://localhost:8983/solr/#/mmc_search3/query> on my local device. 
: This is the default behavior. In this webbrowser, I use the query 
: "title:T%C3%BCbingen” in the field “g” with /select as the 

If you type "title:T%C3%BCbingen" into the "q" param of the 
/solr/#/mmc_search3/query UI then you are *NOT* searching for an 8 
character word where the second character is U+00FC.

You are in fact searching for a 13 character word where the 2nd and 5h 
characters are the plain old ascii '%' -- the UI expects the *raw* string 
you wish to search for, and handles the URL encoding for you.

If you look at the solr logs when you hit the "Query" button after typing 
"title:T%C3%BCbingen" into the serach box, you should see this...

... webapp=/solr path=/select 
params={q=title:T%25C3%25BCbingen&_=1510074136657} ...

those are the *URL decoded* params

you should also see in the "response" portion of the UI, that the "params" 
contains...

    "params":{
      "q":"title:T%C3%BCbingen",
      "_":"1510074136657"}},

That is, again, the URL decoded params.

Likewise, if i use the UI to change the "wt" to "python" the response now 
shows me...

    'params':{
      'q':'title:T%C3%BCbingen',
      'wt':'python',
      '_':'1510074872875'}},

...there is no python unicode escaping here because there is none needed 
-- all of the characters in my 'q' param are plain old ascii characters

Solr doesn't know that you want to search for "LATIN SMALL LETTER U WITH 
DIAERESIS" -- it thinks you want to serach for "percent followed by C3 
followed by percent followed by BC"

Follow the steps you describe, in all of the above queries, i got 
numFound=0 ... but if i change the query i type in the UI to 
"title:Tübingen" (ie: type the plan unicode characters w/o attempting any 
special URL encoding myself) then everything works -- with the python 
output, note the unicode escape sequences...

    'params':{
      'q':u'title:T\u00fcbingen',
      'wt':'python',
      '_':'1510074872875'}},
  'response':{'numFound':1,'start':0,'docs':[
      {
        'id':'HOSS',
        'title':[u'T\u00fcbingen'],
        '_version_':1583428186725679104}]
  }}


And now what i ge in the solr.log...

... webapp=/solr path=/select 
params={q=title:Tübingen&wt=python&_=1510074872875} ...


Part of your confusion may be that some versions of some browsers try to 
be helpful by making urls "human readable" and hiding the fact that 
certain characters are actually being URL encoded.

for example -- your email contains the following verbatim text...

: 4) 
: 
http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:Tübingen&wt=python
 
: 
<http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:T%C3%BCbingen&wt=python>
 
: is displayed but it is 

Note that these to 2 "urls" -- one of which is in theory just the 
:linkable" version of the other -- are not equivilent.  Most likey 
because your email client tried to be helpful when you pasted a URL from 
your browser?  

>From what you've described, it's not actaully 100% clear what actual bytes 
you are seeing in the browser URL -- let alone what bytes your browser is 
actually sending to solr.



Based on your further comments, it appears that the reason you are getting 
results when you send the *literal* query for the word "T%C3%BCbingen"is 
because that's literally what you indexed for these documents.

Note the example document you showed when you get the "good" results 
(numFound=:3092484) with wt=python...

:     'params':{
:       'q':'title:T%C3%BCbingen',
:       'echoParams':'all',
:       'wt':'python',
:       '_':'1510024595963'}},
:   'response':{'numFound':3092484,'start':0,'docs':[
:       {
:         'photoid':'6182384834',
:         'hash':'7b201435fc5126accbfee6453b7fb181',
:         'userid':'48992104@N00',
:         'datetaken':'2011-09-04T13:19:16Z',
:         'dateuploaded':'2011-09-25T11:54:41Z',
:         'capturedevice':'NIKON COOLPIX S2500',
:         'title':'T%C3%BCbingen',

....

You are searching for the *literal* string 'T%C3%BCbingen' (containing 2 
percent symbols and no non-ascii characters) and you are finding it!  
Because that document also has the literal title of 'T%C3%BCbingen' 
(containing 2 percent symbols and no non-ascii characters)

Solr's JSON/XML/Python response will *NEVER* show you a string that is URL 
encoded (unless that's the *literal* string you gave it)


I humble suggest you:

1) ensure the python library that you are actually using to powering 
your website handles all URL encoding for you and all you have to do is 
give it literaly unicde strings
2) ensure you pass literaly unicde characters to your indexing code as 
well
3) until you are more comfortable with when/where exactly URL encoding 
should be happening -- avoid doing any testing with your web browser.   
Either use curl (where you *must* do your own URL encoding) or use some 
custom pythong code (where you should *never* do your own URL encodin) for 
all your tests requests.






-Hoss
http://www.lucidworks.com/

Re: recent utf8 problems

Reply via email to