Yes, I've seen this bit. Near as I can tell, it's what I want, so that our Japanese users can search on a double-byte character and get back results (since they don't use spaces to delineate words, it's impossible in the default solr configuration to find a single double-byte character somewhere "in the middle" of a sentence).
What I need are the directions for how I compile the four-line factory java code, where in the tree I put it, and how I get solr to recognize it. I don't know the java commands to for doing any of that. That's where I need the help. How do I compile the factory code fragment? How do I get it into the solr.war file? I found this in the archives, but don't know what to do with it, primarily because I don't know how the java and tomcat stuff works: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200706.mbox/%3c200 [EMAIL PROTECTED] ...Paul Paul Clegg, Principal Software Engineer My Digital Life, Inc. (www.mydl.com) NetService Ventures Group (www.nsv.com) 2108 Sand Hill Road, Menlo Park, CA 94025 Email: [EMAIL PROTECTED] Cell: 650-619-1220 -----Original Message----- From: Lance Norskog [mailto:[EMAIL PROTECTED] Sent: Thursday, February 07, 2008 11:05 AM To: solr-user@lucene.apache.org Subject: RE: Indexing Japanese & English Here are the comments for CJKTokenizer. First, is this what you want? Remember, there are three Japanese writing systems. /** * CJKTokenizer was modified from StopTokenizer which does a decent job for * most European languages. It performs other token methods for double-byte * Characters: the token will return at each two charactors with overlap match.<br> * Example: "java C1C2C3C4" will be segment to: "java" "C1C2" "C2C3" "C3C4" it * also need filter filter zero length token ""<br> * for Digit: digit, '+', '#' will token as letter<br> * for more info on Asia language(Chinese Japanese Korean) text segmentation: * please search <a * href="http://www.google.com/search?q=word+chinese+segment">google</a> * * @author Che, Dong */ -----Original Message----- From: Paul Clegg [mailto:[EMAIL PROTECTED] Sent: Thursday, February 07, 2008 10:36 AM To: solr-user@lucene.apache.org Subject: Indexing Japanese & English I hate asking stupid questions immediately after joining a mailing list, but I'm in a bit of a pinch here. I'm using Solr/Tomcat for a Ruby on Rails project (acts_as_solr) and I've had a lot of success getting it working -- for English. The problem I'm running into is that our primary customers are actually Japanese. I've done the searching around, and found the thread back in June about using Lucene's CJKAnalyzer and CJKTokenizer, but apparently I need to write my own factor or something. It looks like it's only three lines of Java code, and I can cut & paste with the best of them. Here's the problem: I know zip, zilch, zero about Java. I just hate the language with an absolute passion. The reason I went with Solr (besides the fact it's pretty much the only real game going) is that I could avoid the Java parts by directly dealing with its XML, JSON and Ruby interfaces. So I'm wondering if there are any "Adding CJKTokenizer to Solr for Dummies" guides out there someone can point me to, to tell me, pretty much step-by-step, what I need to do to get this configured. I saw something about unpacking the solr.war and repacking it, but, since I know dinkus about Java, that really didn't mean a whole lot to me, even though I'm guessing it's probably a grand total of four commands at the unix prompt. :) .Paul Paul Clegg, Principal Software Engineer My Digital Life, Inc. (www.mydl.com) NetService Ventures Group (www.nsv.com) 2108 Sand Hill Road, Menlo Park, CA 94025 Email: [EMAIL PROTECTED] Cell: 650-619-1220