Not sure how up to date this is: http://www.basistech.com/customers/
I've only used their C++ products, which generally worked well for
web search with a few exceptions. According to http://
www.basistech.com/knowledge-center/chinese/chinese-language-
analysis.pdf , they provide Java APIs as well. Their CJK language
analyzers are all morphological, AFAIK.
To process mixed languages properly, you'll also need a unicode/
language aware container analyzer that automatically picks the right
analyzer for the right language.
__Luke
On Nov 27, 2007, at 10:29 PM, Otis Gospodnetic wrote:
Eswar - I'm interested in the answer to John's question, too! :)
As for why n-grams - probably because they are free and simple,
while dictionary-based stuff would likely not be free (are there
free dictionaries for C or J or K??), and a morphological analyzer
would be a bit more work. That said, if you need a morphological
analyzer for non-CJK languages, let me know - see my sig.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
From: John Stewart <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Tuesday, November 27, 2007 12:12:40 PM
Subject: Re: CJK Analyzers for Solr
Eswar,
What type of morphological analysis do you suspect (or know) that
Google does on east asian text? I don't think you can treat the three
languages in the same way here. Japanese has multi-morphemic words,
but Chinese doesn't really.
jds
On Nov 27, 2007 11:54 AM, Eswar K <[EMAIL PROTECTED]> wrote:
Is there any specific reason why the CJK analyzers in Solr were
chosen to be
n-gram based instead of it being a morphological analyzer which is
kind of
implemented in Google as it considered to be more effective than the
n-gram
ones?
Regards,
Eswar
On Nov 27, 2007 7:57 AM, Eswar K <[EMAIL PROTECTED]> wrote:
thanks james...
How much time does it take to index 18m docs?
- Eswar
On Nov 27, 2007 7:43 AM, James liu <[EMAIL PROTECTED] >
wrote:
i not use HYLANDA analyzer.
i use je-analyzer and indexing at least 18m docs.
i m sorry i only use chinese analyzer.
On Nov 27, 2007 10:01 AM, Eswar K <[EMAIL PROTECTED]> wrote:
What is the performance of these CJK analyzers (one in lucene
and
hylanda
)?
We would potentially be indexing millions of documents.
James,
We would have a look at hylanda too. What abt japanese and
korean
analyzers,
any recommendations?
- Eswar
On Nov 27, 2007 7:21 AM, James liu <[EMAIL PROTECTED]>
wrote:
I don't think NGram is good method for Chinese.
CJKAnalyzer of Lucene is 2-Gram.
Eswar K:
if it is chinese analyzer,,i recommend
hylanda(www.hylanda.com),,,it
is
the best chinese analyzer and it not free.
if u wanna free chinese analyzer, maybe u can try
je-analyzer. it
have
some problem when using it.
On Nov 27, 2007 5:56 AM, Otis Gospodnetic <
[EMAIL PROTECTED]>
wrote:
Eswar,
We've uses the NGram stuff that exists in Lucene's
contrib/analyzers
instead of CJK. Doesn't that allow you to do everything
that the
Chinese
and CJK analyzers do? It's been a few months since I've
looked at
Chinese
and CJK Analzyers, so I could be off.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
From: Eswar K <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Monday, November 26, 2007 8:30:52 AM
Subject: CJK Analyzers for Solr
Hi,
Does Solr come with Language analyzers for CJK? If not, can
you
please
direct me to some good CJK analyzers?
Regards,
Eswar
--
regards
jl
--
regards
jl