On Feb 18, 2009, at 7:34 AM, revathy arun wrote:
I am trying to index various langauge documents
(foroyo,chinese,japanese)
.These have been converted from pdf to text using xpdf
I am using the standard anlyzer for content analysis ,but i am not
able to
search anything from some of the files.
Please provide us an example of how you are indexing... what requests
are you sending to Solr? What client API are you using to interface
with Solr?
What container are you using? Jetty? Tomcat?
My guess is that these documents are not in utf-8 encoding and hence
solr
does not return result.
Certainly whatever reads in the text from your data source needs to
know the encoding and use it appropriately.
Is there any way to check the encoding of a text/pdf document or
convert
them to utf -8 encoding?
I would imagine the conversion could be made to go to UTF8
while indexing i am sending the header for charset as utf-8 .
How are you doing this?
Any pointers?
If you're using Tomcat, you'll need to set the URIEncoding, as
described here:
<http://wiki.apache.org/solr/SolrTomcat#head-20147ee4d9dd5ca83ed264898280ab60457847c4
>
Erik