: We recently discovered issues with solr with converting utf8 code in the search. One or two month ago everything was still working. : : - What might have caused it is a Java update (Java 8 Update 151). : - We are using firefox as well as chrome for displaying results. : - We tested it with Solr 6.5, Solr 7.0.0, 7.0.1, and 7.1.
Just to be clear: in the 2 examples you provde below... 1) which situation do you consider "correct" ? ("match lots of docs" or "match no docs") 2) are you testing those against the same live solr server? I ask Q #2 because you mentioned "One or two month ago everything was still working" ... but it's not clear what part of the "results" where different one of two months ago. other things tha are unclear/confusing about your question... : We created a search engine base on the yfcc100m and in the normal : browser (http://localhost:8983/solr/#/mmc_search3/query : <http://localhost:8983/solr/#/mmc_search3/query>), we can search for : "title:T%C3%BCbingen” in the query field and get more than 3 million : results: 3) what is "yfcc100m" ? 4) what is the actual URL you see in your browser? 5) what is the underlying byte sequence / character sequence you are trying to search for? ie: can you please explicitly name the UNICODE codepoints you are intendeing to search for? : However, when we use the respective web-address, : http://localhost:8983/solr/mmc_search3/select?q=title:T%C3%BCbingen&wt=json <http://localhost:8983/solr/mmc_search3/select?q=title:T%C3%BCbingen&wt=json> 6) define "use the respective web-address" ? (how are you using it? what http client is hitting that url?) Some general advice about debugging possible charst related issues: * the problem may be related to how the query is executed -- or it may have been realted to how the data was originally indexed, if at that type the wrong byte sequences were sent. * you can use things like "echoParams=all" in a query to see exactly what unicode characters solr is recieving in the q param * assuming the field you are searching is stored=true, you can also send requests to search for one of the documents you expect by id, and verify what unicode characters were indexed. * in both types of requests, you can use "wt=python" to help see the underlying bytes being returned for each character (the python response writer escapes all characters outside of the ascii range) -Hoss http://www.lucidworks.com/