RE: Indexing Japanese & English

Paul Clegg Thu, 07 Feb 2008 11:19:33 -0800

Yes, I've seen this bit.  Near as I can tell, it's what I want, so that our
Japanese users can search on a double-byte character and get back results
(since they don't use spaces to delineate words, it's impossible in the
default solr configuration to find a single double-byte character somewhere
"in the middle" of a sentence).


What I need are the directions for how I compile the four-line factory java
code, where in the tree I put it, and how I get solr to recognize it.  I
don't know the java commands to for doing any of that.  That's where I need
the help.  How do I compile the factory code fragment?  How do I get it into
the solr.war file?

I found this in the archives, but don't know what to do with it, primarily
because I don't know how the java and tomcat stuff works:

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200706.mbox/%3c200
[EMAIL PROTECTED]

...Paul

Paul Clegg, Principal Software Engineer
My Digital Life, Inc. (www.mydl.com)
NetService Ventures Group (www.nsv.com)
2108 Sand Hill Road, Menlo Park, CA 94025
Email:  [EMAIL PROTECTED]
Cell: 650-619-1220

-----Original Message-----
From: Lance Norskog [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 07, 2008 11:05 AM
To: solr-user@lucene.apache.org
Subject: RE: Indexing Japanese & English

Here are the comments for CJKTokenizer.  First, is this what you want?
Remember, there are three Japanese writing systems.

/**
 * CJKTokenizer was modified from StopTokenizer which does a decent job for
 * most European languages. It performs other token methods for double-byte
 * Characters: the token will return at each two charactors with overlap
match.<br>
 * Example: "java C1C2C3C4" will be segment to: "java" "C1C2" "C2C3" "C3C4"
it
 * also need filter filter zero length token ""<br>
 * for Digit: digit, '+', '#' will token as letter<br>
 * for more info on Asia language(Chinese Japanese Korean) text
segmentation:
 * please search  <a
 * href="http://www.google.com/search?q=word+chinese+segment";>google</a>
 *
 * @author Che, Dong
 */

-----Original Message-----
From: Paul Clegg [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 07, 2008 10:36 AM
To: solr-user@lucene.apache.org
Subject: Indexing Japanese & English

I hate asking stupid questions immediately after joining a mailing list, but
I'm in a bit of a pinch here.

 

I'm using Solr/Tomcat for a Ruby on Rails project (acts_as_solr) and I've
had a lot of success getting it working -- for English.  The problem I'm
running into is that our primary customers are actually Japanese.

 

I've done the searching around, and found the thread back in June about
using Lucene's CJKAnalyzer and CJKTokenizer, but apparently I need to write
my own factor or something.  It looks like it's only three lines of Java
code, and I can cut & paste with the best of them.

 

Here's the problem:  I know zip, zilch, zero about Java.  I just hate the
language with an absolute passion.  The reason I went with Solr (besides the
fact it's pretty much the only real game going) is that I could avoid the
Java parts by directly dealing with its XML, JSON and Ruby interfaces.

 

So I'm wondering if there are any "Adding CJKTokenizer to Solr for Dummies"
guides out there someone can point me to, to tell me, pretty much
step-by-step, what I need to do to get this configured.  I saw something
about unpacking the solr.war and repacking it, but, since I know dinkus
about Java, that really didn't mean a whole lot to me, even though I'm
guessing it's probably a grand total of four commands at the unix prompt.
:)

 

.Paul

 

Paul Clegg, Principal Software Engineer

My Digital Life, Inc. (www.mydl.com)

NetService Ventures Group (www.nsv.com)

2108 Sand Hill Road, Menlo Park, CA 94025

Email:  [EMAIL PROTECTED]

Cell: 650-619-1220

RE: Indexing Japanese & English

Reply via email to