Hi, thank you for your time and trying to narrow down my problem.
1) When looking for Tübingen in the title, I am expecting the 3092484 results. That sounds like a reasonable result. Furthermore, when looking at some of the results, they are exactly what I am looking for. 2) I am testing them against the same solr server. This is a very simple testing setup, that brings our problem to the core. Originally, we used a urlib.request.urlopen query to get the data in Python and then send it to our webpage (http://search.mmcommons.org/) as a json object. I think, I should explain my test more clearly. We use a webbrowser (Firefox or Chrome) to open the admin console of the search engine, which is at http://localhost:8983/solr/#/mmc_search3/query <http://localhost:8983/solr/#/mmc_search3/query> on my local device. This is the default behavior. In this webbrowser, I use the query "title:T%C3%BCbingen” in the field “g” with /select as the “Request-Handler (qt) <>”.This approach works like a charm (result wich echoParams attached). Also as asked by Rick, the request url displayed in the upper left is just perfect: http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:T%C3%BCbingen&wt=python <http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:T%C3%BCbingen&wt=python> The problems start to occur, when I click on this url: { 'responseHeader':{ 'status':0, 'QTime':0, 'params':{ 'q':u'title:T\u00fcbingen', 'echoParams':'all', 'wt':'python'}}, 'response':{'numFound':0,'start':0,'docs':[] }} So it seems internally, Solr is changing the request (or a used library?). I just don’t have any idea why. But I would like to get the more than 3 million results. I could as well just enter the above url into my browser and the url will be changed to http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:Tübingen&wt=python <http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:T%C3%BCbingen&wt=python> and I get the same result (no found documents). So this is the problem. However, when I copy paste the url, it is still displaying the utf8 encoding. I thing the “ü” in the url is just an improved layout by the browser. The confusion with the different solr comes from the fact, that I am continuously trying to improve my search index and make it more efficient. Hence I reindexed it several times, always to the latest version. The last reindexing occurred for Solr 7.0.1. having the indexing for Lucene 7.0.1. However, I performed the test also for other versions without any success. 3) As Rick said: "With the Yahoo Flickr Creative Commons 100 Million (YFCC100m) dataset, a great novel dataset was introduced to the computer vision and multimedia research community." — cool My objective it to make it better usable, especially by providing different search modalities. The dataset consists of 99 Million images and 800k videos, but I am only working on the Flickr as well as generated metadata and try to add more and more metadata. The next big challenge is similarity search. 4) http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:Tübingen&wt=python <http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:T%C3%BCbingen&wt=python> is displayed but it is http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:T%C3%BCbingen&wt=python <http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:T%C3%BCbingen&wt=python>. 5) I am searching for Tübingen. It is u-umlaut (LATIN SMALL LETTER U WITH DIAERESIS) as Rick said. 6) I am just clicking on it in the admin solr standard interface. I could as well copy it into my webbrowser and open it. The result would be the same. <http://localhost:8983/solr/#/> 7) As you can see in the result, the document seems to be indexed correctly, isn’t it? If we can’t figure anything out, I will try to reindex again but this will take a while because of the large amount of data and my limited compute power. 8) Thanks for the hint with echoparams. The result is displayed above. 9) As shown in the attached search result, there are actually results correctly indexed. 10) The above example is now with Python. 11) @Rick: Shall I change the /select handler? I do not quite understand the problem with it. But maybe as an explanation, my original config was probably based on solr4.x. I basically just updated the Lucene version and I had to replace/remove some parts because they were not supported anymore. 12) For playing the ''what changed previous to it being broken” game, I am wondering if Solr (6.5 or 7.0.1) has any other dependencies other than Java. However, playing this game is quite difficult, because the human mind is not that good at it. We only tested once in a while, if requests with special symbols work and we mainly tested it only in the Gui without actually clicking on the resulting link that is displayed. Later we tested with the webpage, once and it was working. To figure out why it is not working anymore, we reduced the factors as much as possible and eventually arrived at the aforementioned test. { 'responseHeader':{ 'status':0, 'QTime':131, 'params':{ 'q':'title:T%C3%BCbingen', 'echoParams':'all', 'wt':'python', '_':'1510024595963'}}, 'response':{'numFound':3092484,'start':0,'docs':[ { 'photoid':'6182384834', 'hash':'7b201435fc5126accbfee6453b7fb181', 'userid':'48992104@N00', 'datetaken':'2011-09-04T13:19:16Z', 'dateuploaded':'2011-09-25T11:54:41Z', 'capturedevice':'NIKON COOLPIX S2500', 'title':'T%C3%BCbingen', 'longitude':9.055888, 'latitude':48.520157, 'accuracy':16, 'licensename':'Attribution-NonCommercial-ShareAlike License', 'marker':0, 'year':2011, 'yearmonth':201109, 'month':9, 'a_autotags':['city', 'nature', 'outdoor', 'cityscape', 'valley', 'landscape', 'architecture', 'canyon'], 'p_town':'\'Tuebingen\'', 'p_state':'\'Baden-Wurttemberg\'', 'p_country':'\'Germany\'', 'p_places':['\'Neckargasse\'', '\'Tuebingen\'', '\'Tubingen\'', '\'Baden-Wurttemberg\'', '\'72070\'', '\'Germany\'', '\'Europe%2FBerlin\''], 'usertags':['not_provided'], 'facet_usertags':['not_provided'], 'description':'not_provided', 'a_architecture':656, 'a_canyon':504, 'a_city':656, 'a_cityscape':575, 'a_landscape':542, 'a_nature':542, 'a_outdoor':924, 'a_valley':504, '_version_':1581268421041979393}, > On Nov 6, 2017, at 16:03, Chris Hostetter <hossman_luc...@fucit.org> wrote: > > > : We recently discovered issues with solr with converting utf8 code in the > search. One or two month ago everything was still working. > : > : - What might have caused it is a Java update (Java 8 Update 151). > : - We are using firefox as well as chrome for displaying results. > : - We tested it with Solr 6.5, Solr 7.0.0, 7.0.1, and 7.1. > > Just to be clear: in the 2 examples you provde below... > > 1) which situation do you consider "correct" ? > ("match lots of docs" or "match no docs") > 2) are you testing those against the same live solr server? > > I ask Q #2 because you mentioned "One or two month ago everything was > still working" ... but it's not clear what part of the "results" where > different one of two months ago. > > other things tha are unclear/confusing about your question... > > : We created a search engine base on the yfcc100m and in the normal > : browser (http://localhost:8983/solr/#/mmc_search3/query > : <http://localhost:8983/solr/#/mmc_search3/query>), we can search for > : "title:T%C3%BCbingen” in the query field and get more than 3 million > : results: > > 3) what is "yfcc100m" ? > 4) what is the actual URL you see in your browser? > 5) what is the underlying byte sequence / character sequence you are > trying to search for? > > ie: can you please explicitly name the UNICODE codepoints you are > intendeing to search for? > > : However, when we use the respective web-address, > : http://localhost:8983/solr/mmc_search3/select?q=title:T%C3%BCbingen&wt=json > <http://localhost:8983/solr/mmc_search3/select?q=title:T%C3%BCbingen&wt=json> > > 6) define "use the respective web-address" ? > (how are you using it? what http client is hitting that url?) > > > Some general advice about debugging possible charst related issues: > > * the problem may be related to how the query is executed -- or it may > have been realted to how the data was originally indexed, if at that type > the wrong byte sequences were sent. > > * you can use things like "echoParams=all" in a query to see exactly what > unicode characters solr is recieving in the q param > * assuming the field you are searching is stored=true, you can also send > requests to search for one of the documents you expect by id, and verify > what unicode characters were indexed. > * in both types of requests, you can use "wt=python" to help see the > underlying bytes being returned for each character (the python response > writer escapes all characters outside of the ascii range) > > > > -Hoss > http://www.lucidworks.com/