I am finding that the search results based on indexing Tika extracted text are very different from results based on indexing the text extracted via other means. This shows up for example with a chinese web site that I am trying to index.
I created the documents (for posting to SOLR) in two ways. The source text of the web pages are full of html entities like 〹 and some english characters mixed in. (a) Simple text extraction from the page source by a Perl script. The resulting content field looks like <field name="content_china">Who We Are 公司历史 您的成功案例 领导团队 业务部门 Innovation 创 etc... </field> I posted these documents to a SOLR instance (b) Used Tika (command line). The resulting content field looks like <field name="content_china">Who We Are Ã¥ ŒÂ¸åކå² 您的æˆÂ功æ¡ ˆä¾‹ 领导团队 业务部门  Innovation å etc... </field> I posted these documents to a different instance When I search the first instance for a string (that I copied & pasted from the web site) I find a number of hits, including the page from which I copied the string from. But when I do the same on the instance with Tika extracted text - I get nothing. Has anyone seen this? I believe it may have to do with encoding. In both cases the posted documents were utf-8 compiant. Thanks for your insights. - ashok -- View this message in context: http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24708854.html Sent from the Solr - User mailing list archive at Nabble.com.