I look forward to the eanswers to this one. Dennis Gearon
Signature Warning ---------------- It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. ----- Original Message ---- From: Tod <listac...@gmail.com> To: solr-user@lucene.apache.org Sent: Thu, November 11, 2010 11:35:23 AM Subject: Retrieving indexed content containing multiple languages My Solr corpus is currently created by indexing metadata from a relational database as well as content pointed to by URLs from the database. I'm using a pretty generic out of the box Solr schema. The search results are presented via an AJAX enabled HTML page. When I perform a search the document title (for example) has a mix of english and chinese characters. Everything there is fine - I can see the english and chinese returned from a facet query on title. I can search against the title using english words it contains and I get back an expected result. I asked a chinese friend to perform the same search using chinese and nothing is returned. How should I go about getting this search to work? Chinese is just one language, I'll probably need to support more in the future. My thought is that the chinese characters are indexed as their unicode equivalent so all I'll need to do is make sure the query is encoded appropriately and just perform a regular search as I would if the terms were in english. For some reason that sounds too easy. I see there is a CJK tokenizer that would help here. Do I need that for my situation? Is there a fairly detailed tutorial on how to handle these types of language challenges? Thanks in advance - Tod