Re: CJK Analyzers for Solr

Luke Lu Tue, 27 Nov 2007 23:34:56 -0800

Not sure how up to date this is: http://www.basistech.com/customers/

I've only used their C++ products, which generally worked well forweb search with a few exceptions. According to http://www.basistech.com/knowledge-center/chinese/chinese-language-analysis.pdf , they provide Java APIs as well. Their CJK languageanalyzers are all morphological, AFAIK.

To process mixed languages properly, you'll also need a unicode/language aware container analyzer that automatically picks the rightanalyzer for the right language.


__Luke

On Nov 27, 2007, at 10:29 PM, Otis Gospodnetic wrote:

Eswar - I'm interested in the answer to John's question, too! :)

As for why n-grams - probably because they are free and simple,while dictionary-based stuff would likely not be free (are therefree dictionaries for C or J or K??), and a morphological analyzerwould be a bit more work. That said, if you need a morphologicalanalyzer for non-CJK languages, let me know - see my sig.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: John Stewart <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Tuesday, November 27, 2007 12:12:40 PM
Subject: Re: CJK Analyzers for Solr

Eswar,

What type of morphological analysis do you suspect (or know) that
Google does on east asian text?  I don't think you can treat the three
languages in the same way here.  Japanese has multi-morphemic words,
but Chinese doesn't really.

jds

On Nov 27, 2007 11:54 AM, Eswar K <[EMAIL PROTECTED]> wrote:

Is there any specific reason why the CJK analyzers in Solr were

 chosen to be

n-gram based instead of it being a morphological analyzer which is

 kind of

implemented in Google as it considered to be more effective than the

 n-gram

ones?

Regards,
Eswar




On Nov 27, 2007 7:57 AM, Eswar K <[EMAIL PROTECTED]> wrote:

thanks james...

How much time does it take to index 18m docs?

- Eswar


On Nov 27, 2007 7:43 AM, James liu <[EMAIL PROTECTED] >

 wrote:

i not use HYLANDA analyzer.

i use je-analyzer and indexing at least 18m docs.

i m sorry i only use chinese analyzer.


On Nov 27, 2007 10:01 AM, Eswar K <[EMAIL PROTECTED]> wrote:

What is the performance of these CJK analyzers (one in lucene

and

hylanda

)?
We would potentially be indexing millions of documents.

James,

We would have a look at hylanda too. What abt japanese and

 korean

analyzers,
any recommendations?

- Eswar

On Nov 27, 2007 7:21 AM, James liu <[EMAIL PROTECTED]>

 wrote:

I don't think NGram is good method for Chinese.

CJKAnalyzer of Lucene is 2-Gram.

Eswar K:
 if it is chinese analyzer,,i recommend

 hylanda（www.hylanda.com）,,,it

is

the best chinese analyzer and it not free.
 if u wanna free chinese analyzer, maybe u can try

 je-analyzer. it

have

some problem when using it.



On Nov 27, 2007 5:56 AM, Otis Gospodnetic <

[EMAIL PROTECTED]>

wrote:

Eswar,

We've uses the NGram stuff that exists in Lucene's

contrib/analyzers

instead of CJK.  Doesn't that allow you to do everything

 that the

Chinese

and CJK analyzers do?  It's been a few months since I've

 looked at

Chinese

and CJK Analzyers, so I could be off.

Otis

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Eswar K <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Monday, November 26, 2007 8:30:52 AM
Subject: CJK Analyzers for Solr

Hi,

Does Solr come with Language analyzers for CJK? If not, can

you

please

direct me to some good CJK analyzers?

Regards,
Eswar



--
regards
jl




--
regards
jl

Re: CJK Analyzers for Solr

Reply via email to