Re: recent utf8 problems

Dr. Mario Michael Krell Mon, 06 Nov 2017 21:43:56 -0800

Hi,

thank you for your time and trying to narrow down my problem.

1) When looking for Tübingen in the title, I am expecting the 3092484 results. 
That sounds like a reasonable result. Furthermore, when looking at some of the 
results, they are exactly what I am looking for.

2) I am testing them against the same solr server. This is a very simple 
testing setup, that brings our problem to the core. Originally, we used a 
urlib.request.urlopen query to get the data in Python and then send it to our 
webpage (http://search.mmcommons.org/) as a json object. I think, I should 
explain my test more clearly. We use a webbrowser (Firefox or Chrome) to open 
the admin console of the search engine, which is at 
http://localhost:8983/solr/#/mmc_search3/query 
<http://localhost:8983/solr/#/mmc_search3/query> on my local device. This is 
the default behavior. In this webbrowser, I use the query  
"title:T%C3%BCbingen” in the field “g” with /select as the “Request-Handler 
(qt) <>”.This approach works like a charm (result wich echoParams attached). 
Also as asked by Rick, the request url displayed in the upper left is just 
perfect:
http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:T%C3%BCbingen&wt=python

<http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:T%C3%BCbingen&wt=python>
The problems start to occur, when I click on this url:
{
  'responseHeader':{
    'status':0,
    'QTime':0,
    'params':{
      'q':u'title:T\u00fcbingen',
      'echoParams':'all',
      'wt':'python'}},
  'response':{'numFound':0,'start':0,'docs':[]
  }}
So it seems internally, Solr is changing the request (or a used library?). I 
just don’t have any idea why. But I would like to get the more than 3 million 
results. I could as well just enter the above url into my browser and the url 
will be changed to
http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:Tübingen&wt=python

<http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:T%C3%BCbingen&wt=python>
and I get the same result (no found documents). So this is the problem. 
However, when I copy paste the url, it is still displaying the utf8 encoding. I 
thing the “ü” in the url is just an improved layout by the browser.

The confusion with the different solr comes from the fact, that I am 
continuously trying to improve my search index and make it more efficient. 
Hence I reindexed it several times, always to the latest version. The last 
reindexing occurred for Solr 7.0.1. having the indexing for Lucene 7.0.1. 
However, I performed the test also for other versions without any success.

3) As Rick said: "With the Yahoo Flickr Creative Commons 100 Million (YFCC100m) 
dataset, a great novel dataset was introduced to the computer vision and 
multimedia research community." — cool

My objective it to make it better usable, especially by providing different 
search modalities. The dataset consists of 99 Million images and 800k videos, 
but I am only working on the Flickr as well as generated metadata and try to 
add more and more metadata. The next big challenge is similarity search.

4) 
http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:Tübingen&wt=python

<http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:T%C3%BCbingen&wt=python>
 is displayed but it is 
http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:T%C3%BCbingen&wt=python

<http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:T%C3%BCbingen&wt=python>.

5) I am searching for Tübingen. It is u-umlaut (LATIN SMALL LETTER U WITH 
DIAERESIS) as Rick said.

6) I am just clicking on it in the admin solr standard interface. I could as 
well copy it into my webbrowser and open it. The result would be the same.
 <http://localhost:8983/solr/#/>

7) As you can see in the result, the document seems to be indexed correctly, 
isn’t it? If we can’t figure anything out, I will try to reindex again but this 
will take a while because of the large amount of data and my limited compute 
power.

8) Thanks for the hint with echoparams. The result is displayed above.

9) As shown in the attached search result, there are actually results correctly 
indexed.

10) The above example is now with Python.

11) @Rick: Shall I change the /select handler? I do not quite understand the 
problem with it. But maybe as an explanation, my original config was probably 
based on solr4.x. I basically just updated the Lucene version and I had to 
replace/remove some parts because they were not supported anymore.

12) For playing the ''what changed previous to it being broken” game, I am 
wondering if Solr (6.5 or 7.0.1) has any other dependencies other than Java. 
However, playing this game is quite difficult, because the human mind is not 
that good at it. We only tested once in a while, if requests with special 
symbols work and we mainly tested it only in the Gui without actually clicking 
on the resulting link that is displayed. Later we tested with the webpage, once 
and it was working. To figure out why it is not working anymore, we reduced the 
factors as much as possible and eventually arrived at the aforementioned test.
{
  'responseHeader':{
    'status':0,
    'QTime':131,
    'params':{
      'q':'title:T%C3%BCbingen',
      'echoParams':'all',
      'wt':'python',
      '_':'1510024595963'}},
  'response':{'numFound':3092484,'start':0,'docs':[
      {
        'photoid':'6182384834',
        'hash':'7b201435fc5126accbfee6453b7fb181',
        'userid':'48992104@N00',
        'datetaken':'2011-09-04T13:19:16Z',
        'dateuploaded':'2011-09-25T11:54:41Z',
        'capturedevice':'NIKON COOLPIX S2500',
        'title':'T%C3%BCbingen',
        'longitude':9.055888,
        'latitude':48.520157,
        'accuracy':16,
        'licensename':'Attribution-NonCommercial-ShareAlike License',
        'marker':0,
        'year':2011,
        'yearmonth':201109,
        'month':9,
        'a_autotags':['city',
          'nature',
          'outdoor',
          'cityscape',
          'valley',
          'landscape',
          'architecture',
          'canyon'],
        'p_town':'\'Tuebingen\'',
        'p_state':'\'Baden-Wurttemberg\'',
        'p_country':'\'Germany\'',
        'p_places':['\'Neckargasse\'',
          '\'Tuebingen\'',
          '\'Tubingen\'',
          '\'Baden-Wurttemberg\'',
          '\'72070\'',
          '\'Germany\'',
          '\'Europe%2FBerlin\''],
        'usertags':['not_provided'],
        'facet_usertags':['not_provided'],
        'description':'not_provided',
        'a_architecture':656,
        'a_canyon':504,
        'a_city':656,
        'a_cityscape':575,
        'a_landscape':542,
        'a_nature':542,
        'a_outdoor':924,
        'a_valley':504,
        '_version_':1581268421041979393},

> On Nov 6, 2017, at 16:03, Chris Hostetter <hossman_luc...@fucit.org> wrote:
> 
> 
> : We recently discovered issues with solr with converting utf8 code in the 
> search. One or two month ago everything was still working.
> : 
> : - What might have caused it is a Java update (Java 8 Update 151). 
> : - We are using firefox as well as chrome for displaying results.
> : - We tested it with Solr 6.5, Solr 7.0.0, 7.0.1, and 7.1.
> 
> Just to be clear: in the 2 examples you provde below...
> 
> 1) which situation do you consider "correct" ? 
>     ("match lots of docs" or "match no docs")
> 2) are you testing those against the same live solr server?
> 
> I ask Q #2 because you mentioned "One or two month ago everything was 
> still working" ... but it's not clear what part of the "results" where 
> different one of two months ago.
> 
> other things tha are unclear/confusing about your question...
> 
> : We created a search engine base on the yfcc100m and in the normal 
> : browser (http://localhost:8983/solr/#/mmc_search3/query 
> : <http://localhost:8983/solr/#/mmc_search3/query>), we can search for 
> : "title:T%C3%BCbingen” in the query field and get more than 3 million 
> : results:
> 
> 3) what is "yfcc100m" ?
> 4) what is the actual URL you see in your browser?
> 5) what is the underlying byte sequence / character sequence you are 
> trying to search for?
> 
> ie: can you please explicitly name the UNICODE codepoints you are 
> intendeing to search for?
> 
> : However, when we use the respective web-address, 
> : http://localhost:8983/solr/mmc_search3/select?q=title:T%C3%BCbingen&wt=json 
> <http://localhost:8983/solr/mmc_search3/select?q=title:T%C3%BCbingen&wt=json>
> 
> 6) define "use the respective web-address" ?
>    (how are you using it? what http client is hitting that url?)
> 
> 
> Some general advice about debugging possible charst related issues:
> 
> * the problem may be related to how the query is executed -- or it may 
> have been realted to how the data was originally indexed, if at that type 
> the wrong byte sequences were sent.
> 
> * you can use things like "echoParams=all" in a query to see exactly what 
> unicode characters solr is recieving in the q param
> * assuming the field you are searching is stored=true, you can also send 
> requests to search for one of the documents you expect by id, and verify 
> what unicode characters were indexed.
> * in both types of requests, you can use "wt=python" to help see the 
> underlying bytes being returned for each character (the python response 
> writer escapes all characters outside of the ascii range)
> 
> 
> 
> -Hoss
> http://www.lucidworks.com/

Re: recent utf8 problems

Reply via email to