Not sure how up to date this is: http://www.basistech.com/customers/

I've only used their C++ products, which generally worked well for web search with a few exceptions. According to http:// www.basistech.com/knowledge-center/chinese/chinese-language- analysis.pdf , they provide Java APIs as well. Their CJK language analyzers are all morphological, AFAIK.

To process mixed languages properly, you'll also need a unicode/ language aware container analyzer that automatically picks the right analyzer for the right language.

__Luke

On Nov 27, 2007, at 10:29 PM, Otis Gospodnetic wrote:

Eswar - I'm interested in the answer to John's question, too! :)

As for why n-grams - probably because they are free and simple, while dictionary-based stuff would likely not be free (are there free dictionaries for C or J or K??), and a morphological analyzer would be a bit more work. That said, if you need a morphological analyzer for non-CJK languages, let me know - see my sig.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: John Stewart <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Tuesday, November 27, 2007 12:12:40 PM
Subject: Re: CJK Analyzers for Solr

Eswar,

What type of morphological analysis do you suspect (or know) that
Google does on east asian text?  I don't think you can treat the three
languages in the same way here.  Japanese has multi-morphemic words,
but Chinese doesn't really.

jds

On Nov 27, 2007 11:54 AM, Eswar K <[EMAIL PROTECTED]> wrote:
Is there any specific reason why the CJK analyzers in Solr were
 chosen to be
n-gram based instead of it being a morphological analyzer which is
 kind of
implemented in Google as it considered to be more effective than the
 n-gram
ones?

Regards,
Eswar




On Nov 27, 2007 7:57 AM, Eswar K <[EMAIL PROTECTED]> wrote:

thanks james...

How much time does it take to index 18m docs?

- Eswar


On Nov 27, 2007 7:43 AM, James liu <[EMAIL PROTECTED] >
 wrote:

i not use HYLANDA analyzer.

i use je-analyzer and indexing at least 18m docs.

i m sorry i only use chinese analyzer.


On Nov 27, 2007 10:01 AM, Eswar K <[EMAIL PROTECTED]> wrote:

What is the performance of these CJK analyzers (one in lucene
 and
hylanda
)?
We would potentially be indexing millions of documents.

James,

We would have a look at hylanda too. What abt japanese and
 korean
analyzers,
any recommendations?

- Eswar

On Nov 27, 2007 7:21 AM, James liu <[EMAIL PROTECTED]>
 wrote:

I don't think NGram is good method for Chinese.

CJKAnalyzer of Lucene is 2-Gram.

Eswar K:
 if it is chinese analyzer,,i recommend
 hylanda(www.hylanda.com),,,it
is
the best chinese analyzer and it not free.
 if u wanna free chinese analyzer, maybe u can try
 je-analyzer. it
have
some problem when using it.



On Nov 27, 2007 5:56 AM, Otis Gospodnetic <
[EMAIL PROTECTED]>
wrote:

Eswar,

We've uses the NGram stuff that exists in Lucene's
contrib/analyzers
instead of CJK.  Doesn't that allow you to do everything
 that the
Chinese
and CJK analyzers do?  It's been a few months since I've
 looked at
Chinese
and CJK Analzyers, so I could be off.

Otis

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Eswar K <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Monday, November 26, 2007 8:30:52 AM
Subject: CJK Analyzers for Solr

Hi,

Does Solr come with Language analyzers for CJK? If not, can
 you
please
direct me to some good CJK analyzers?

Regards,
Eswar






--
regards
jl





--
regards
jl








Reply via email to