Re: recent utf8 problems

Chris Hostetter Mon, 06 Nov 2017 16:03:58 -0800

: We recently discovered issues with solr with converting utf8 code in the 
search. One or two month ago everything was still working.
: 
: - What might have caused it is a Java update (Java 8 Update 151). 
: - We are using firefox as well as chrome for displaying results.
: - We tested it with Solr 6.5, Solr 7.0.0, 7.0.1, and 7.1.


Just to be clear: in the 2 examples you provde below...

 1) which situation do you consider "correct" ? 
     ("match lots of docs" or "match no docs")
 2) are you testing those against the same live solr server?

I ask Q #2 because you mentioned "One or two month ago everything was 
still working" ... but it's not clear what part of the "results" where 
different one of two months ago.

other things tha are unclear/confusing about your question...

: We created a search engine base on the yfcc100m and in the normal 
: browser (http://localhost:8983/solr/#/mmc_search3/query 
: <http://localhost:8983/solr/#/mmc_search3/query>), we can search for 
: "title:T%C3%BCbingen” in the query field and get more than 3 million 
: results:

 3) what is "yfcc100m" ?
 4) what is the actual URL you see in your browser?
 5) what is the underlying byte sequence / character sequence you are 
trying to search for?

ie: can you please explicitly name the UNICODE codepoints you are 
intendeing to search for?

: However, when we use the respective web-address, 
: http://localhost:8983/solr/mmc_search3/select?q=title:T%C3%BCbingen&wt=json 
<http://localhost:8983/solr/mmc_search3/select?q=title:T%C3%BCbingen&wt=json>

 6) define "use the respective web-address" ?
    (how are you using it? what http client is hitting that url?)


Some general advice about debugging possible charst related issues:

 * the problem may be related to how the query is executed -- or it may 
have been realted to how the data was originally indexed, if at that type 
the wrong byte sequences were sent.

 * you can use things like "echoParams=all" in a query to see exactly what 
unicode characters solr is recieving in the q param
 * assuming the field you are searching is stored=true, you can also send 
requests to search for one of the documents you expect by id, and verify 
what unicode characters were indexed.
 * in both types of requests, you can use "wt=python" to help see the 
underlying bytes being returned for each character (the python response 
writer escapes all characters outside of the ascii range)



-Hoss
http://www.lucidworks.com/

Re: recent utf8 problems

Reply via email to