Dr Krell Item 11): It is best to get the solrconfig.xml provided with the new version of Solr, and change it to suit your needs. Do not try to work from the old version's solrconfig.xml.
I did not have time to read the other items. Look in solr.log, and compare the successful query with the unsuccessful one for clues, then look at the config for /select again. Cheers -- Rick On November 7, 2017 12:43:00 AM EST, "Dr. Mario Michael Krell" <kr...@uni-bremen.de> wrote: >Hi, > >thank you for your time and trying to narrow down my problem. > >1) When looking for Tübingen in the title, I am expecting the 3092484 >results. That sounds like a reasonable result. Furthermore, when >looking at some of the results, they are exactly what I am looking for. > >2) I am testing them against the same solr server. This is a very >simple testing setup, that brings our problem to the core. Originally, >we used a urlib.request.urlopen query to get the data in Python and >then send it to our webpage (http://search.mmcommons.org/) as a json >object. I think, I should explain my test more clearly. We use a >webbrowser (Firefox or Chrome) to open the admin console of the search >engine, which is at http://localhost:8983/solr/#/mmc_search3/query ><http://localhost:8983/solr/#/mmc_search3/query> on my local device. >This is the default behavior. In this webbrowser, I use the query >"title:T%C3%BCbingen” in the field “g” with /select as the >“Request-Handler (qt) <>”.This approach works like a charm (result wich >echoParams attached). Also as asked by Rick, the request url displayed >in the upper left is just perfect: >http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:T%C3%BCbingen&wt=python ><http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:T%C3%BCbingen&wt=python> >The problems start to occur, when I click on this url: >{ > 'responseHeader':{ > 'status':0, > 'QTime':0, > 'params':{ > 'q':u'title:T\u00fcbingen', > 'echoParams':'all', > 'wt':'python'}}, > 'response':{'numFound':0,'start':0,'docs':[] > }} >So it seems internally, Solr is changing the request (or a used >library?). I just don’t have any idea why. But I would like to get the >more than 3 million results. I could as well just enter the above url >into my browser and the url will be changed to >http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:Tübingen&wt=python ><http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:T%C3%BCbingen&wt=python> >and I get the same result (no found documents). So this is the problem. >However, when I copy paste the url, it is still displaying the utf8 >encoding. I thing the “ü” in the url is just an improved layout by the >browser. > >The confusion with the different solr comes from the fact, that I am >continuously trying to improve my search index and make it more >efficient. Hence I reindexed it several times, always to the latest >version. The last reindexing occurred for Solr 7.0.1. having the >indexing for Lucene 7.0.1. However, I performed the test also for other >versions without any success. > >3) As Rick said: "With the Yahoo Flickr Creative Commons 100 Million >(YFCC100m) dataset, a great novel dataset was introduced to the >computer vision and multimedia research community." — cool > >My objective it to make it better usable, especially by providing >different search modalities. The dataset consists of 99 Million images >and 800k videos, but I am only working on the Flickr as well as >generated metadata and try to add more and more metadata. The next big >challenge is similarity search. > >4) >http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:Tübingen&wt=python ><http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:T%C3%BCbingen&wt=python> >is displayed but it is >http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:T%C3%BCbingen&wt=python ><http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:T%C3%BCbingen&wt=python>. > >5) I am searching for Tübingen. It is u-umlaut (LATIN SMALL LETTER U >WITH DIAERESIS) as Rick said. > >6) I am just clicking on it in the admin solr standard interface. I >could as well copy it into my webbrowser and open it. The result would >be the same. > <http://localhost:8983/solr/#/> > >7) As you can see in the result, the document seems to be indexed >correctly, isn’t it? If we can’t figure anything out, I will try to >reindex again but this will take a while because of the large amount of >data and my limited compute power. > >8) Thanks for the hint with echoparams. The result is displayed above. > >9) As shown in the attached search result, there are actually results >correctly indexed. > >10) The above example is now with Python. > >11) @Rick: Shall I change the /select handler? I do not quite >understand the problem with it. But maybe as an explanation, my >original config was probably based on solr4.x. I basically just updated >the Lucene version and I had to replace/remove some parts because they >were not supported anymore. > >12) For playing the ''what changed previous to it being broken” game, I >am wondering if Solr (6.5 or 7.0.1) has any other dependencies other >than Java. However, playing this game is quite difficult, because the >human mind is not that good at it. We only tested once in a while, if >requests with special symbols work and we mainly tested it only in the >Gui without actually clicking on the resulting link that is displayed. >Later we tested with the webpage, once and it was working. To figure >out why it is not working anymore, we reduced the factors as much as >possible and eventually arrived at the aforementioned test. >{ > 'responseHeader':{ > 'status':0, > 'QTime':131, > 'params':{ > 'q':'title:T%C3%BCbingen', > 'echoParams':'all', > 'wt':'python', > '_':'1510024595963'}}, > 'response':{'numFound':3092484,'start':0,'docs':[ > { > 'photoid':'6182384834', > 'hash':'7b201435fc5126accbfee6453b7fb181', > 'userid':'48992104@N00', > 'datetaken':'2011-09-04T13:19:16Z', > 'dateuploaded':'2011-09-25T11:54:41Z', > 'capturedevice':'NIKON COOLPIX S2500', > 'title':'T%C3%BCbingen', > 'longitude':9.055888, > 'latitude':48.520157, > 'accuracy':16, > 'licensename':'Attribution-NonCommercial-ShareAlike License', > 'marker':0, > 'year':2011, > 'yearmonth':201109, > 'month':9, > 'a_autotags':['city', > 'nature', > 'outdoor', > 'cityscape', > 'valley', > 'landscape', > 'architecture', > 'canyon'], > 'p_town':'\'Tuebingen\'', > 'p_state':'\'Baden-Wurttemberg\'', > 'p_country':'\'Germany\'', > 'p_places':['\'Neckargasse\'', > '\'Tuebingen\'', > '\'Tubingen\'', > '\'Baden-Wurttemberg\'', > '\'72070\'', > '\'Germany\'', > '\'Europe%2FBerlin\''], > 'usertags':['not_provided'], > 'facet_usertags':['not_provided'], > 'description':'not_provided', > 'a_architecture':656, > 'a_canyon':504, > 'a_city':656, > 'a_cityscape':575, > 'a_landscape':542, > 'a_nature':542, > 'a_outdoor':924, > 'a_valley':504, > '_version_':1581268421041979393}, > > >> On Nov 6, 2017, at 16:03, Chris Hostetter <hossman_luc...@fucit.org> >wrote: >> >> >> : We recently discovered issues with solr with converting utf8 code >in the search. One or two month ago everything was still working. >> : >> : - What might have caused it is a Java update (Java 8 Update 151). >> : - We are using firefox as well as chrome for displaying results. >> : - We tested it with Solr 6.5, Solr 7.0.0, 7.0.1, and 7.1. >> >> Just to be clear: in the 2 examples you provde below... >> >> 1) which situation do you consider "correct" ? >> ("match lots of docs" or "match no docs") >> 2) are you testing those against the same live solr server? >> >> I ask Q #2 because you mentioned "One or two month ago everything was > >> still working" ... but it's not clear what part of the "results" >where >> different one of two months ago. >> >> other things tha are unclear/confusing about your question... >> >> : We created a search engine base on the yfcc100m and in the normal >> : browser (http://localhost:8983/solr/#/mmc_search3/query >> : <http://localhost:8983/solr/#/mmc_search3/query>), we can search >for >> : "title:T%C3%BCbingen” in the query field and get more than 3 >million >> : results: >> >> 3) what is "yfcc100m" ? >> 4) what is the actual URL you see in your browser? >> 5) what is the underlying byte sequence / character sequence you are >> trying to search for? >> >> ie: can you please explicitly name the UNICODE codepoints you are >> intendeing to search for? >> >> : However, when we use the respective web-address, >> : >http://localhost:8983/solr/mmc_search3/select?q=title:T%C3%BCbingen&wt=json ><http://localhost:8983/solr/mmc_search3/select?q=title:T%C3%BCbingen&wt=json> >> >> 6) define "use the respective web-address" ? >> (how are you using it? what http client is hitting that url?) >> >> >> Some general advice about debugging possible charst related issues: >> >> * the problem may be related to how the query is executed -- or it >may >> have been realted to how the data was originally indexed, if at that >type >> the wrong byte sequences were sent. >> >> * you can use things like "echoParams=all" in a query to see exactly >what >> unicode characters solr is recieving in the q param >> * assuming the field you are searching is stored=true, you can also >send >> requests to search for one of the documents you expect by id, and >verify >> what unicode characters were indexed. >> * in both types of requests, you can use "wt=python" to help see the >> underlying bytes being returned for each character (the python >response >> writer escapes all characters outside of the ascii range) >> >> >> >> -Hoss >> http://www.lucidworks.com/ -- Sorry for being brief. Alternate email is rickleir at yahoo dot com