Re: Retrieving indexed content containing multiple languages

Tod Tue, 16 Nov 2010 07:09:57 -0800

On 11/11/2010 3:24 PM, Dennis Gearon wrote:

I look forward to the eanswers to this one.


Well, it seems it was as easy as adding the CJKTokenizerFactory:

<fieldtype name="text_cjk" class="solr.TextField"positionIncrementGap="100">

 <analyzer>
  <tokenizer class="solr.CJKTokenizerFactory"/>
 </analyzer>
</fieldtype>

Once I did that and reindexed I could search for both english andchinese using the default 'text' field. The next hurdle was getting thejavascript to cooperate. The chinese characters were getting corruptedon the way to the AJAX call against the Solr server.

As it turned out I was performing a POST to Solr using the jQuery .ajaxapi call. Apparently when executing a POST you need to make sure thecharacters entered into the input field of the form are converted tounicode (\u7968 for example) prior to the AJAX call to Solr.Conversely, if executing a GET you need to convert the characters toUTF8 (%E7%A5%A8).

So now my customers are happily finding the appropriate document usingenglish and chinese.

If someone could check my math I would appreciate it. If it looksreasonable and there is nothing else written about it on the wiki I'llcreate a tutorial to give everybody else a leg up.



- Tod

----- Original Message ----
From: Tod<listac...@gmail.com>
To: solr-user@lucene.apache.org
Sent: Thu, November 11, 2010 11:35:23 AM
Subject: Retrieving indexed content containing multiple languages

My Solr corpus is currently created by indexing metadata from a relational
database as well as content pointed to by URLs from the database.  I'm using a
pretty generic out of the box Solr schema.  The search results are presented via
an AJAX enabled HTML page.

When I perform a search the document title (for example) has a mix of english
and chinese characters.  Everything there is fine - I can see the english and
chinese returned from a facet query on title.  I can search against the title
using english words it contains and I get back an expected result.  I asked a
chinese friend to perform the same search using chinese and nothing is returned.

How should I go about getting this search to work?  Chinese is just one
language, I'll probably need to support more in the future.

My thought is that the chinese characters are indexed as their unicode
equivalent so all I'll need to do is make sure the query is encoded
appropriately and just perform a regular search as I would if the terms were in
english.  For some reason that sounds too easy.

I see there is a CJK tokenizer that would help here.  Do I need that for my
situation?  Is there a fairly detailed tutorial on how to handle these types of
language challenges?

Thanks in advance - Tod

Re: Retrieving indexed content containing multiple languages

Reply via email to