Retrieving indexed content containing multiple languages

Tod Thu, 11 Nov 2010 11:38:11 -0800

My Solr corpus is currently created by indexing metadata from arelational database as well as content pointed to by URLs from thedatabase. I'm using a pretty generic out of the box Solr schema. Thesearch results are presented via an AJAX enabled HTML page.

When I perform a search the document title (for example) has a mix ofenglish and chinese characters. Everything there is fine - I can seethe english and chinese returned from a facet query on title. I cansearch against the title using english words it contains and I get backan expected result. I asked a chinese friend to perform the same searchusing chinese and nothing is returned.

How should I go about getting this search to work? Chinese is just onelanguage, I'll probably need to support more in the future.

My thought is that the chinese characters are indexed as their unicodeequivalent so all I'll need to do is make sure the query is encodedappropriately and just perform a regular search as I would if the termswere in english. For some reason that sounds too easy.

I see there is a CJK tokenizer that would help here. Do I need that formy situation? Is there a fairly detailed tutorial on how to handlethese types of language challenges?



Thanks in advance - Tod

Retrieving indexed content containing multiple languages

Reply via email to