Re: Memory use with sorting problem

2007-11-27 Thread Chris Laux
Hi again,

in the meantime I discovered the use of jmap (I'm not a Java programmer)
and found that all the memory was being used up by String and char[]
objects.

The Lucene docs have the following to say on sorting memory use:

> For String fields, the cache is larger: in addition to the above
array, the value of every term in the field is kept in memory. If there
are many unique terms in the field, this could be quite large.

(http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/Sort.html)

I am sorting on the "slong" schema type, which is of course stored as a
string. The above quote seems to indicate that it is possible for a
field not to be a string for the purposes of the sort, while I took it
from LiA that everything is a string to Lucene.

What can I do to make sure the additional memory is not used by every
unique term? i.e. how to have the slong not be a "String field"?

Cheers,
Chris


Chris Laux wrote:
> Hi all,
> 
> I've been struggling with this problem for over a month now, and
> although memory issues have been discussed often, I don't seem to be
> able to find a fitting solution.
> 
> The index is merely 1.5 GB large, but memory use quickly fills out the
> heap max of 1 GB on a 2 GB machine. This then works fine until
> auto-warming starts. Switching the latter off altogether is unattractive
> as it leads to response times of up to 30 s. When auto-warming starts, I
> get this error:
> 
>> SEVERE: Error during auto-warming of
> key:org.apache.solr.search.QueryResultKey
> @e0b93139:java.lang.OutOfMemoryError: Java heap space
> 
> Now when I reduce the size of caches (to a fraction of the default
> settings) and number of warming Searchers (to 2), memory use is not
> reduced and the problem stays. Only deactivating auto-warming will help.
> When I set the heap size limit higher (and go into swap space), all the
> extra memory seems to be used up right away, independently from
> auto-warming.
> 
> This all seems to be closely connected to sorting by a numerical field,
> as switching this off does make memory use a lot more friendly.
> 
> Is it normal to need that much Memory for such a small index?
> 
> I suspect the problem is in Lucene, would it be better to post on their
> list?
> 
> Does anyone know a better way of getting the sorting done?
> 
> Thanks in advance for your help,
> 
> Chris
> 
> 
> This is the field setup in schema.xml:
> 
>  multiValued="false" />
>  multiValued="false" />
> 
> 
> 
> And this is a sample query:
> 
> select/?q=solr&start=0&rows=20&sort=created+desc
> 
> 



Re: Inconsistent results in Solr Search with Lucene Index

2007-11-27 Thread Grant Ingersoll
Have you setup your Analyzers, etc. so they correspond to the exact  
ones that you were using in Lucene?  Under the Solr Admin you can try  
the analysis tool to see how your index and queries are treated.  What  
happens if you do a *:* query from the Admin query screen?


If your index is reasonably sized, I would just reindex, but you  
shouldn't have to do this.


-Grant

On Nov 27, 2007, at 8:18 AM, trysteps wrote:


Hi All,
I am trying to use Solr Search with Lucene Index so just set all  
schema.xml configs like tokenize and field necessaries.

But I can not get results like Lucene.
For example ,
search for 'dog' returns lots of results with lucene but in Solr, I  
can't get any result. But search with 'dog*' returns same result  
with Lucene.
What is the best way to integrate Lucene index to Solr, are there  
any well-documented sources?

Thanks for your Attention,
Trysteps



--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ





Re: CJK Analyzers for Solr

2007-11-27 Thread Eswar K
Is there any specific reason why the CJK analyzers in Solr were chosen to be
n-gram based instead of it being a morphological analyzer which is kind of
implemented in Google as it considered to be more effective than the n-gram
ones?

Regards,
Eswar



On Nov 27, 2007 7:57 AM, Eswar K <[EMAIL PROTECTED]> wrote:

> thanks james...
>
> How much time does it take to index 18m docs?
>
> - Eswar
>
>
> On Nov 27, 2007 7:43 AM, James liu <[EMAIL PROTECTED] > wrote:
>
> > i not use HYLANDA analyzer.
> >
> > i use je-analyzer and indexing at least 18m docs.
> >
> > i m sorry i only use chinese analyzer.
> >
> >
> > On Nov 27, 2007 10:01 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> >
> > > What is the performance of these CJK analyzers (one in lucene and
> > hylanda
> > > )?
> > > We would potentially be indexing millions of documents.
> > >
> > > James,
> > >
> > > We would have a look at hylanda too. What abt japanese and korean
> > > analyzers,
> > > any recommendations?
> > >
> > > - Eswar
> > >
> > > On Nov 27, 2007 7:21 AM, James liu <[EMAIL PROTECTED]> wrote:
> > >
> > > > I don't think NGram is good method for Chinese.
> > > >
> > > > CJKAnalyzer of Lucene is 2-Gram.
> > > >
> > > > Eswar K:
> > > >  if it is chinese analyzer,,i recommend hylanda(www.hylanda.com),,,it
> > is
> > > > the best chinese analyzer and it not free.
> > > >  if u wanna free chinese analyzer, maybe u can try je-analyzer. it
> > have
> > > > some problem when using it.
> > > >
> > > >
> > > >
> > > > On Nov 27, 2007 5:56 AM, Otis Gospodnetic <
> > [EMAIL PROTECTED]>
> > > > wrote:
> > > >
> > > > > Eswar,
> > > > >
> > > > > We've uses the NGram stuff that exists in Lucene's
> > contrib/analyzers
> > > > > instead of CJK.  Doesn't that allow you to do everything that the
> > > > Chinese
> > > > > and CJK analyzers do?  It's been a few months since I've looked at
> > > > Chinese
> > > > > and CJK Analzyers, so I could be off.
> > > > >
> > > > > Otis
> > > > >
> > > > > --
> > > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > > > >
> > > > > - Original Message 
> > > > > From: Eswar K <[EMAIL PROTECTED]>
> > > > > To: solr-user@lucene.apache.org
> > > > > Sent: Monday, November 26, 2007 8:30:52 AM
> > > > > Subject: CJK Analyzers for Solr
> > > > >
> > > > > Hi,
> > > > >
> > > > > Does Solr come with Language analyzers for CJK? If not, can you
> > please
> > > > > direct me to some good CJK analyzers?
> > > > >
> > > > > Regards,
> > > > > Eswar
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > regards
> > > > jl
> > > >
> > >
> >
> >
> >
> > --
> > regards
> > jl
> >
>
>


Re: CJK Analyzers for Solr

2007-11-27 Thread John Stewart
Eswar,

What type of morphological analysis do you suspect (or know) that
Google does on east asian text?  I don't think you can treat the three
languages in the same way here.  Japanese has multi-morphemic words,
but Chinese doesn't really.

jds

On Nov 27, 2007 11:54 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> Is there any specific reason why the CJK analyzers in Solr were chosen to be
> n-gram based instead of it being a morphological analyzer which is kind of
> implemented in Google as it considered to be more effective than the n-gram
> ones?
>
> Regards,
> Eswar
>
>
>
>
> On Nov 27, 2007 7:57 AM, Eswar K <[EMAIL PROTECTED]> wrote:
>
> > thanks james...
> >
> > How much time does it take to index 18m docs?
> >
> > - Eswar
> >
> >
> > On Nov 27, 2007 7:43 AM, James liu <[EMAIL PROTECTED] > wrote:
> >
> > > i not use HYLANDA analyzer.
> > >
> > > i use je-analyzer and indexing at least 18m docs.
> > >
> > > i m sorry i only use chinese analyzer.
> > >
> > >
> > > On Nov 27, 2007 10:01 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> > >
> > > > What is the performance of these CJK analyzers (one in lucene and
> > > hylanda
> > > > )?
> > > > We would potentially be indexing millions of documents.
> > > >
> > > > James,
> > > >
> > > > We would have a look at hylanda too. What abt japanese and korean
> > > > analyzers,
> > > > any recommendations?
> > > >
> > > > - Eswar
> > > >
> > > > On Nov 27, 2007 7:21 AM, James liu <[EMAIL PROTECTED]> wrote:
> > > >
> > > > > I don't think NGram is good method for Chinese.
> > > > >
> > > > > CJKAnalyzer of Lucene is 2-Gram.
> > > > >
> > > > > Eswar K:
> > > > >  if it is chinese analyzer,,i recommend hylanda(www.hylanda.com),,,it
> > > is
> > > > > the best chinese analyzer and it not free.
> > > > >  if u wanna free chinese analyzer, maybe u can try je-analyzer. it
> > > have
> > > > > some problem when using it.
> > > > >
> > > > >
> > > > >
> > > > > On Nov 27, 2007 5:56 AM, Otis Gospodnetic <
> > > [EMAIL PROTECTED]>
> > > > > wrote:
> > > > >
> > > > > > Eswar,
> > > > > >
> > > > > > We've uses the NGram stuff that exists in Lucene's
> > > contrib/analyzers
> > > > > > instead of CJK.  Doesn't that allow you to do everything that the
> > > > > Chinese
> > > > > > and CJK analyzers do?  It's been a few months since I've looked at
> > > > > Chinese
> > > > > > and CJK Analzyers, so I could be off.
> > > > > >
> > > > > > Otis
> > > > > >
> > > > > > --
> > > > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > > > > >
> > > > > > - Original Message 
> > > > > > From: Eswar K <[EMAIL PROTECTED]>
> > > > > > To: solr-user@lucene.apache.org
> > > > > > Sent: Monday, November 26, 2007 8:30:52 AM
> > > > > > Subject: CJK Analyzers for Solr
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Does Solr come with Language analyzers for CJK? If not, can you
> > > please
> > > > > > direct me to some good CJK analyzers?
> > > > > >
> > > > > > Regards,
> > > > > > Eswar
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > regards
> > > > > jl
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > regards
> > > jl
> > >
> >
> >
>


Combining SOLR and JAMon to monitor query execution times from a browser

2007-11-27 Thread Siegfried Goeschl

Hi folks,

working on a closed source project for an IP concerned company is not 
always fun ... we combined SOLR with JAMon 
(http://jamonapi.sourceforge.net/) to keep an eye of the query times and 
this might be of general interest


+) JAMon comes with a ready-to-use ServletFilter
+) we extended this implementation to keep track for queries issued by a 
customer and the requested domain objects, e.g. "artist", "album", "track"
+) this allows us to keep track of the execution times and their 
distribution to find quickly long running queries without having access 
to the access.log from a web browser
+) a small presentation can be found at 
http://people.apache.org/~sgoeschl/presentations/jamon-20070717.pdf

+) if it is of general I can rewrite the code as contribution

Cheers,

Siegfried Goeschl


Re: Combining SOLR and JAMon to monitor query execution times from a browser

2007-11-27 Thread Matthew Runo
I'd be interested in seeing more logging in the admin section! I saw  
that there is QPS in 1.3, which is great, but it'd be wonderful to see  
more.


--Matthew Runo

On Nov 27, 2007, at 9:18 AM, Siegfried Goeschl wrote:


Hi folks,

working on a closed source project for an IP concerned company is  
not always fun ... we combined SOLR with JAMon (http://jamonapi.sourceforge.net/ 
) to keep an eye of the query times and this might be of general  
interest


+) JAMon comes with a ready-to-use ServletFilter
+) we extended this implementation to keep track for queries issued  
by a customer and the requested domain objects, e.g. "artist",  
"album", "track"
+) this allows us to keep track of the execution times and their  
distribution to find quickly long running queries without having  
access to the access.log from a web browser

+) a small presentation can be found at 
http://people.apache.org/~sgoeschl/presentations/jamon-20070717.pdf
+) if it is of general I can rewrite the code as contribution

Cheers,

Siegfried Goeschl





Re: CJK Analyzers for Solr

2007-11-27 Thread Mike Klaas

On 27-Nov-07, at 8:54 AM, Eswar K wrote:

Is there any specific reason why the CJK analyzers in Solr were  
chosen to be
n-gram based instead of it being a morphological analyzer which is  
kind of
implemented in Google as it considered to be more effective than  
the n-gram

ones?


The CJK analyzers are just wrappers of the already-available  
analyzers in lucene.  I suspect (but am not sure) that the core devs  
aren't fluent in the issues surrounding the analysis of asian text (I  
certainly am not).  Any improvements in this regard would be greatly  
appreciated.


-Mike


two solr instances?

2007-11-27 Thread Jörg Kiegeland
Is it possible to deploy solr.war once to Tomcat (which is on top of an 
Apache HTTP Server in my configuration) which then can manage two Solr 
indexes?


I have to make accessible two different Solr indexes (both have 
different schema.xml files) over the web. If the above architecture is 
not possible: is there any other solution?


Re: CJK Analyzers for Solr

2007-11-27 Thread Walter Underwood
Dictionaries are surprisingly expensive to build and maintain and
bi-gram is surprisingly effective for Chinese. See this paper:

   http://citeseer.ist.psu.edu/kwok97comparing.html

I expect that n-gram indexing would be less effective for Japanese
because it is an inflected language. Korean is even harder. It might
work to break Korean into the phonetic subparts and use n-gram on
those.

You should not do term highlighting with any of the n-gram methods.
The relevance can be very good, but the highlighting just looks dumb.

wunder

On 11/27/07 8:54 AM, "Eswar K" <[EMAIL PROTECTED]> wrote:

> Is there any specific reason why the CJK analyzers in Solr were chosen to be
> n-gram based instead of it being a morphological analyzer which is kind of
> implemented in Google as it considered to be more effective than the n-gram
> ones?
> 
> Regards,
> Eswar
> 
> 
> 
> On Nov 27, 2007 7:57 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> 
>> thanks james...
>> 
>> How much time does it take to index 18m docs?
>> 
>> - Eswar
>> 
>> 
>> On Nov 27, 2007 7:43 AM, James liu <[EMAIL PROTECTED] > wrote:
>> 
>>> i not use HYLANDA analyzer.
>>> 
>>> i use je-analyzer and indexing at least 18m docs.
>>> 
>>> i m sorry i only use chinese analyzer.
>>> 
>>> 
>>> On Nov 27, 2007 10:01 AM, Eswar K <[EMAIL PROTECTED]> wrote:
>>> 
 What is the performance of these CJK analyzers (one in lucene and
>>> hylanda
 )?
 We would potentially be indexing millions of documents.
 
 James,
 
 We would have a look at hylanda too. What abt japanese and korean
 analyzers,
 any recommendations?
 
 - Eswar
 
 On Nov 27, 2007 7:21 AM, James liu <[EMAIL PROTECTED]> wrote:
 
> I don't think NGram is good method for Chinese.
> 
> CJKAnalyzer of Lucene is 2-Gram.
> 
> Eswar K:
>  if it is chinese analyzer,,i recommend hylanda(www.hylanda.com),,,it
>>> is
> the best chinese analyzer and it not free.
>  if u wanna free chinese analyzer, maybe u can try je-analyzer. it
>>> have
> some problem when using it.
> 
> 
> 
> On Nov 27, 2007 5:56 AM, Otis Gospodnetic <
>>> [EMAIL PROTECTED]>
> wrote:
> 
>> Eswar,
>> 
>> We've uses the NGram stuff that exists in Lucene's
>>> contrib/analyzers
>> instead of CJK.  Doesn't that allow you to do everything that the
> Chinese
>> and CJK analyzers do?  It's been a few months since I've looked at
> Chinese
>> and CJK Analzyers, so I could be off.
>> 
>> Otis
>> 
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> 
>> - Original Message 
>> From: Eswar K <[EMAIL PROTECTED]>
>> To: solr-user@lucene.apache.org
>> Sent: Monday, November 26, 2007 8:30:52 AM
>> Subject: CJK Analyzers for Solr
>> 
>> Hi,
>> 
>> Does Solr come with Language analyzers for CJK? If not, can you
>>> please
>> direct me to some good CJK analyzers?
>> 
>> Regards,
>> Eswar
>> 
>> 
>> 
>> 
> 
> 
> --
> regards
> jl
> 
 
>>> 
>>> 
>>> 
>>> --
>>> regards
>>> jl
>>> 
>> 
>> 



Re: two solr instances?

2007-11-27 Thread Chris Laux
Have you looked at this page on the wiki:
http://wiki.apache.org/solr/SolrTomcat#head-024d7e11209030f1dbcac9974e55106abae837ac

That should get you started.

-Chris


Jörg Kiegeland wrote:
> Is it possible to deploy solr.war once to Tomcat (which is on top of an
> Apache HTTP Server in my configuration) which then can manage two Solr
> indexes?
> 
> I have to make accessible two different Solr indexes (both have
> different schema.xml files) over the web. If the above architecture is
> not possible: is there any other solution?
> 



RE: LSA Implementation

2007-11-27 Thread Norskog, Lance
WordNet itself is English-only. There are various ontology projects for
it.

http://www.globalwordnet.org/ is a separate world language database
project. I found it at the bottom of the WordNet wikipedia page. Thanks
for starting me on the search!

Lance 

-Original Message-
From: Eswar K [mailto:[EMAIL PROTECTED] 
Sent: Monday, November 26, 2007 6:50 PM
To: solr-user@lucene.apache.org
Subject: Re: LSA Implementation

The languages also include CJK :) among others.

- Eswar

On Nov 27, 2007 8:16 AM, Norskog, Lance <[EMAIL PROTECTED]> wrote:

> The WordNet project at Princeton (USA) is a large database of
synonyms.
> If you're only working in English this might be useful instead of 
> running your own analyses.
>
> http://en.wikipedia.org/wiki/WordNet
> http://wordnet.princeton.edu/
>
> Lance
>
> -Original Message-
> From: Eswar K [mailto:[EMAIL PROTECTED]
> Sent: Monday, November 26, 2007 6:34 PM
> To: solr-user@lucene.apache.org
> Subject: Re: LSA Implementation
>
> In addition to recording which keywords a document contains, the 
> method examines the document collection as a whole, to see which other

> documents contain some of those same words. this algo should consider 
> documents that have many words in common to be semantically close, and

> ones with few words in common to be semantically distant. This simple 
> method correlates surprisingly well with how a human being, looking at

> content, might classify a document collection. Although the algorithm 
> doesn't understand anything about what the words *mean*, the patterns 
> it notices can make it seem astonishingly intelligent.
>
> When you search an such  an index, the search engine looks at 
> similarity values it has calculated for every content word, and 
> returns the documents that it thinks best fit the query. Because two 
> documents may be semantically very close even if they do not share a 
> particular keyword,
>
> Where a plain keyword search will fail if there is no exact match, 
> this algo will often return relevant documents that don't contain the 
> keyword at all.
>
> - Eswar
>
> On Nov 27, 2007 7:51 AM, Marvin Humphrey <[EMAIL PROTECTED]>
wrote:
>
> >
> > On Nov 26, 2007, at 6:06 PM, Eswar K wrote:
> >
> > > We essentially are looking at having an implementation for doing 
> > > search which can return documents having conceptually similar 
> > > words without necessarily having the original word searched for.
> >
> > Very challenging.  Say someone searches for "LSA" and hits an 
> > archived
>
> > version of the mail you sent to this list.  "LSA" is a reasonably 
> > discriminating term.  But so is "Eswar".
> >
> > If you knew that the original term was "LSA", then you might look 
> > for documents near it in term vector space.  But if you don't know 
> > the original term, only the content of the document, how do you know

> > whether you should look for docs near "lsa" or "eswar"?
> >
> > Marvin Humphrey
> > Rectangular Research
> > http://www.rectangular.com/
> >
> >
> >
>


Related Search

2007-11-27 Thread William Silva
Hi,
What is the best way to implement a related search like CNET with SOLR ?
Ex.: Searching for "tv" the related searches are: lcd tv, lcd, hdtv,
vizio, plasma tv, panasonic, gps, plasma
Thanks,
William.


Re: Related Search

2007-11-27 Thread Cool Coder
Take a look at this thread 
http://www.gossamer-threads.com/lists/lucene/java-user/54996
   
  There was a need to get all related topics for any selected topic. I have 
taken help of lucene-sand-box wordnet project to get all synoms of user 
selected topics. I am not sure whether wordnet project would help you as you 
look for products synonyms. In your case, you might need to maintain a vector 
of product synonyms E.g. If User searches TV, internally you would search for 
lcd tv, lcd, hdtv etc...
   
   
  Take a look at www.ajaxtrend.com, how all related topics are displayed and I 
am also keep on refining the related query search as this site is evolving. 
This is just prototype. 
   
  - BR
William Silva <[EMAIL PROTECTED]> wrote:
  Hi,
What is the best way to implement a related search like CNET with SOLR ?
Ex.: Searching for "tv" the related searches are: lcd tv, lcd, hdtv,
vizio, plasma tv, panasonic, gps, plasma
Thanks,
William.


   
-
Be a better sports nut! Let your teams follow you with Yahoo Mobile. Try it now.

Solr and nutch, for reading a nutch index

2007-11-27 Thread bbrown
I couldn't tell if this was asked before.  But I want to perform a nutch crawl
without any solr plugin which will simply write to some index directory.  And
then ideally I would like to use solr for searching?  I am assuming this is
possible?

--
Berlin Brown
[berlin dot brown at gmail dot com]
http://botspiritcompany.com/botlist/?



Re: Solr and nutch, for reading a nutch index

2007-11-27 Thread Brian Whitman


On Nov 27, 2007, at 6:08 PM, bbrown wrote:

I couldn't tell if this was asked before.  But I want to perform a  
nutch crawl
without any solr plugin which will simply write to some index  
directory.  And
then ideally I would like to use solr for searching?  I am assuming  
this is

possible?



yes, this is quite possible. You need to have a solr schema that  
mimics the nutch schema, see sami's solrindexer for an example. Once  
you've got that schema, simply set the data dir in your solrconfig to  
the nutch index location and you'll be set.





Re: LSA Implementation

2007-11-27 Thread Grant Ingersoll
Using Wordnet may require having some type of disambiguation approach,  
otherwise you can end up w/ a lot of "synonyms".  I also would look  
into how much coverage there is for non-English languages.


If you have the resources, you may be better off developing/finding  
your own synonym/concept list based on your genres.  You may also look  
into other approaches for assigning concepts off line and adding them  
to the document.


-Grant

On Nov 27, 2007, at 3:21 PM, Norskog, Lance wrote:

WordNet itself is English-only. There are various ontology projects  
for

it.

http://www.globalwordnet.org/ is a separate world language database
project. I found it at the bottom of the WordNet wikipedia page.  
Thanks

for starting me on the search!

Lance

-Original Message-
From: Eswar K [mailto:[EMAIL PROTECTED]
Sent: Monday, November 26, 2007 6:50 PM
To: solr-user@lucene.apache.org
Subject: Re: LSA Implementation

The languages also include CJK :) among others.

- Eswar

On Nov 27, 2007 8:16 AM, Norskog, Lance <[EMAIL PROTECTED]> wrote:


The WordNet project at Princeton (USA) is a large database of

synonyms.

If you're only working in English this might be useful instead of
running your own analyses.

http://en.wikipedia.org/wiki/WordNet
http://wordnet.princeton.edu/

Lance

-Original Message-
From: Eswar K [mailto:[EMAIL PROTECTED]
Sent: Monday, November 26, 2007 6:34 PM
To: solr-user@lucene.apache.org
Subject: Re: LSA Implementation

In addition to recording which keywords a document contains, the
method examines the document collection as a whole, to see which  
other



documents contain some of those same words. this algo should consider
documents that have many words in common to be semantically close,  
and



ones with few words in common to be semantically distant. This simple
method correlates surprisingly well with how a human being, looking  
at



content, might classify a document collection. Although the algorithm
doesn't understand anything about what the words *mean*, the patterns
it notices can make it seem astonishingly intelligent.

When you search an such  an index, the search engine looks at
similarity values it has calculated for every content word, and
returns the documents that it thinks best fit the query. Because two
documents may be semantically very close even if they do not share a
particular keyword,

Where a plain keyword search will fail if there is no exact match,
this algo will often return relevant documents that don't contain the
keyword at all.

- Eswar

On Nov 27, 2007 7:51 AM, Marvin Humphrey <[EMAIL PROTECTED]>

wrote:




On Nov 26, 2007, at 6:06 PM, Eswar K wrote:


We essentially are looking at having an implementation for doing
search which can return documents having conceptually similar
words without necessarily having the original word searched for.


Very challenging.  Say someone searches for "LSA" and hits an
archived



version of the mail you sent to this list.  "LSA" is a reasonably
discriminating term.  But so is "Eswar".

If you knew that the original term was "LSA", then you might look
for documents near it in term vector space.  But if you don't know
the original term, only the content of the document, how do you know



whether you should look for docs near "lsa" or "eswar"?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/







--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ





Re: Combining SOLR and JAMon to monitor query execution times from a browser

2007-11-27 Thread Norberto Meijome
On Tue, 27 Nov 2007 18:18:16 +0100
Siegfried Goeschl <[EMAIL PROTECTED]> wrote:

> Hi folks,
> 
> working on a closed source project for an IP concerned company is not 
> always fun ... we combined SOLR with JAMon 
> (http://jamonapi.sourceforge.net/) to keep an eye of the query times and 
> this might be of general interest
> 
> +) JAMon comes with a ready-to-use ServletFilter
> +) we extended this implementation to keep track for queries issued by a 
> customer and the requested domain objects, e.g. "artist", "album", "track"
> +) this allows us to keep track of the execution times and their 
> distribution to find quickly long running queries without having access 
> to the access.log from a web browser
> +) a small presentation can be found at 
> http://people.apache.org/~sgoeschl/presentations/jamon-20070717.pdf
> +) if it is of general I can rewrite the code as contribution

Thanks Siegfried,

I am further interested in  plugging this information into something like 
Nagios , Cacti , Zenoss , bigsister , Openview or your monitoring system of 
choice, but I haven't had much time to look into this yet. How does JAMon 
compare to JMX ( 
http://java.sun.com/javase/technologies/core/mntr-mgmt/javamanagement/) ? 

cheers,
B

_
{Beto|Norberto|Numard} Meijome

There are no stupid questions, but there are a LOT of inquisitive idiots.

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.


Re: Solr and nutch, for reading a nutch index

2007-11-27 Thread Norberto Meijome
On Tue, 27 Nov 2007 18:12:13 -0500
Brian Whitman <[EMAIL PROTECTED]> wrote:

> 
> On Nov 27, 2007, at 6:08 PM, bbrown wrote:
> 
> > I couldn't tell if this was asked before.  But I want to perform a  
> > nutch crawl
> > without any solr plugin which will simply write to some index  
> > directory.  And
> > then ideally I would like to use solr for searching?  I am assuming  
> > this is
> > possible?
> >
> 
> yes, this is quite possible. You need to have a solr schema that  
> mimics the nutch schema, see sami's solrindexer for an example. Once  
> you've got that schema, simply set the data dir in your solrconfig to  
> the nutch index location and you'll be set.

I think you should keep an eye on the versions of Lucene library used by both 
Nutch + Solr - differences at this layer *could* make them incompatible - but I 
am not an expert...
B

_
{Beto|Norberto|Numard} Meijome

"Against logic there is no armor like ignorance."
  Laurence J. Peter

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.


Re: Solr and nutch, for reading a nutch index

2007-11-27 Thread Otis Gospodnetic
I only glanced at Sami's post recently and what I think I saw there is 
something different.  In other words, what Sami described is not a Solr 
instance pointing to a Nutch-built Lucene index, but rather an app that reads 
the appropriate Nutch/Hadoop files with fetched content and posts the read 
content to a Solr instance using a Solr java client like solrj.
No?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Norberto Meijome <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Cc: [EMAIL PROTECTED]
Sent: Tuesday, November 27, 2007 8:33:18 PM
Subject: Re: Solr and nutch, for reading a nutch index

On Tue, 27 Nov 2007 18:12:13 -0500
Brian Whitman <[EMAIL PROTECTED]> wrote:

> 
> On Nov 27, 2007, at 6:08 PM, bbrown wrote:
> 
> > I couldn't tell if this was asked before.  But I want to perform a
  
> > nutch crawl
> > without any solr plugin which will simply write to some index  
> > directory.  And
> > then ideally I would like to use solr for searching?  I am assuming
  
> > this is
> > possible?
> >
> 
> yes, this is quite possible. You need to have a solr schema that  
> mimics the nutch schema, see sami's solrindexer for an example. Once
  
> you've got that schema, simply set the data dir in your solrconfig to
  
> the nutch index location and you'll be set.

I think you should keep an eye on the versions of Lucene library used
 by both Nutch + Solr - differences at this layer *could* make them
 incompatible - but I am not an expert...
B

_
{Beto|Norberto|Numard} Meijome

"Against logic there is no armor like ignorance."
  Laurence J. Peter

I speak for myself, not my employer. Contents may be hot. Slippery when
 wet. Reading disclaimers makes you go blind. Writing them is worse.
 You have been Warned.





Re: CJK Analyzers for Solr

2007-11-27 Thread Otis Gospodnetic
Eswar - I'm interested in the answer to John's question, too! :)

As for why n-grams - probably because they are free and simple, while 
dictionary-based stuff would likely not be free (are there free dictionaries 
for C or J or K??), and a morphological analyzer would be a bit more work.  
That said, if you need a morphological analyzer for non-CJK languages, let me 
know - see my sig.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: John Stewart <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Tuesday, November 27, 2007 12:12:40 PM
Subject: Re: CJK Analyzers for Solr

Eswar,

What type of morphological analysis do you suspect (or know) that
Google does on east asian text?  I don't think you can treat the three
languages in the same way here.  Japanese has multi-morphemic words,
but Chinese doesn't really.

jds

On Nov 27, 2007 11:54 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> Is there any specific reason why the CJK analyzers in Solr were
 chosen to be
> n-gram based instead of it being a morphological analyzer which is
 kind of
> implemented in Google as it considered to be more effective than the
 n-gram
> ones?
>
> Regards,
> Eswar
>
>
>
>
> On Nov 27, 2007 7:57 AM, Eswar K <[EMAIL PROTECTED]> wrote:
>
> > thanks james...
> >
> > How much time does it take to index 18m docs?
> >
> > - Eswar
> >
> >
> > On Nov 27, 2007 7:43 AM, James liu <[EMAIL PROTECTED] >
 wrote:
> >
> > > i not use HYLANDA analyzer.
> > >
> > > i use je-analyzer and indexing at least 18m docs.
> > >
> > > i m sorry i only use chinese analyzer.
> > >
> > >
> > > On Nov 27, 2007 10:01 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> > >
> > > > What is the performance of these CJK analyzers (one in lucene
 and
> > > hylanda
> > > > )?
> > > > We would potentially be indexing millions of documents.
> > > >
> > > > James,
> > > >
> > > > We would have a look at hylanda too. What abt japanese and
 korean
> > > > analyzers,
> > > > any recommendations?
> > > >
> > > > - Eswar
> > > >
> > > > On Nov 27, 2007 7:21 AM, James liu <[EMAIL PROTECTED]>
 wrote:
> > > >
> > > > > I don't think NGram is good method for Chinese.
> > > > >
> > > > > CJKAnalyzer of Lucene is 2-Gram.
> > > > >
> > > > > Eswar K:
> > > > >  if it is chinese analyzer,,i recommend
 hylanda(www.hylanda.com),,,it
> > > is
> > > > > the best chinese analyzer and it not free.
> > > > >  if u wanna free chinese analyzer, maybe u can try
 je-analyzer. it
> > > have
> > > > > some problem when using it.
> > > > >
> > > > >
> > > > >
> > > > > On Nov 27, 2007 5:56 AM, Otis Gospodnetic <
> > > [EMAIL PROTECTED]>
> > > > > wrote:
> > > > >
> > > > > > Eswar,
> > > > > >
> > > > > > We've uses the NGram stuff that exists in Lucene's
> > > contrib/analyzers
> > > > > > instead of CJK.  Doesn't that allow you to do everything
 that the
> > > > > Chinese
> > > > > > and CJK analyzers do?  It's been a few months since I've
 looked at
> > > > > Chinese
> > > > > > and CJK Analzyers, so I could be off.
> > > > > >
> > > > > > Otis
> > > > > >
> > > > > > --
> > > > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > > > > >
> > > > > > - Original Message 
> > > > > > From: Eswar K <[EMAIL PROTECTED]>
> > > > > > To: solr-user@lucene.apache.org
> > > > > > Sent: Monday, November 26, 2007 8:30:52 AM
> > > > > > Subject: CJK Analyzers for Solr
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Does Solr come with Language analyzers for CJK? If not, can
 you
> > > please
> > > > > > direct me to some good CJK analyzers?
> > > > > >
> > > > > > Regards,
> > > > > > Eswar
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > regards
> > > > > jl
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > regards
> > > jl
> > >
> >
> >
>





Re: Solr and nutch, for reading a nutch index

2007-11-27 Thread Brian Whitman


On Nov 28, 2007, at 1:24 AM, Otis Gospodnetic wrote:

I only glanced at Sami's post recently and what I think I saw there  
is something different.  In other words, what Sami described is not  
a Solr instance pointing to a Nutch-built Lucene index, but rather  
an app that reads the appropriate Nutch/Hadoop files with fetched  
content and posts the read content to a Solr instance using a Solr  
java client like solrj.

No?



Yes, to be clear, all you need from Sami's thing is the schema file.  
Ignore everything else. Then point solr at the nutch index directory  
(it's just a lucene index.)


Sami's entire thing is for indexing with solr instead of nutch,  
separate issue...





Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Norberto Meijome <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Cc: [EMAIL PROTECTED]
Sent: Tuesday, November 27, 2007 8:33:18 PM
Subject: Re: Solr and nutch, for reading a nutch index

On Tue, 27 Nov 2007 18:12:13 -0500
Brian Whitman <[EMAIL PROTECTED]> wrote:



On Nov 27, 2007, at 6:08 PM, bbrown wrote:


I couldn't tell if this was asked before.  But I want to perform a



nutch crawl
without any solr plugin which will simply write to some index
directory.  And
then ideally I would like to use solr for searching?  I am assuming



this is
possible?



yes, this is quite possible. You need to have a solr schema that
mimics the nutch schema, see sami's solrindexer for an example. Once



you've got that schema, simply set the data dir in your solrconfig to



the nutch index location and you'll be set.


I think you should keep an eye on the versions of Lucene library used
by both Nutch + Solr - differences at this layer *could* make them
incompatible - but I am not an expert...
B

_
{Beto|Norberto|Numard} Meijome

"Against logic there is no armor like ignorance."
 Laurence J. Peter

I speak for myself, not my employer. Contents may be hot. Slippery  
when

wet. Reading disclaimers makes you go blind. Writing them is worse.
You have been Warned.





--
http://variogr.am/





Re: CJK Analyzers for Solr

2007-11-27 Thread Otis Gospodnetic
For what it's worth I worked on indexing and searching a *massive* pile of 
data, a good portion of which was in CJ and some K.  The n-gram approach was 
used for all 3 languages and the quality of search results, including 
highlighting was evaluated and okay-ed by native speakers of these languages.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Walter Underwood <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Tuesday, November 27, 2007 2:41:38 PM
Subject: Re: CJK Analyzers for Solr

Dictionaries are surprisingly expensive to build and maintain and
bi-gram is surprisingly effective for Chinese. See this paper:

   http://citeseer.ist.psu.edu/kwok97comparing.html

I expect that n-gram indexing would be less effective for Japanese
because it is an inflected language. Korean is even harder. It might
work to break Korean into the phonetic subparts and use n-gram on
those.

You should not do term highlighting with any of the n-gram methods.
The relevance can be very good, but the highlighting just looks dumb.

wunder

On 11/27/07 8:54 AM, "Eswar K" <[EMAIL PROTECTED]> wrote:

> Is there any specific reason why the CJK analyzers in Solr were
 chosen to be
> n-gram based instead of it being a morphological analyzer which is
 kind of
> implemented in Google as it considered to be more effective than the
 n-gram
> ones?
> 
> Regards,
> Eswar
> 
> 
> 
> On Nov 27, 2007 7:57 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> 
>> thanks james...
>> 
>> How much time does it take to index 18m docs?
>> 
>> - Eswar
>> 
>> 
>> On Nov 27, 2007 7:43 AM, James liu <[EMAIL PROTECTED] > wrote:
>> 
>>> i not use HYLANDA analyzer.
>>> 
>>> i use je-analyzer and indexing at least 18m docs.
>>> 
>>> i m sorry i only use chinese analyzer.
>>> 
>>> 
>>> On Nov 27, 2007 10:01 AM, Eswar K <[EMAIL PROTECTED]> wrote:
>>> 
 What is the performance of these CJK analyzers (one in lucene and
>>> hylanda
 )?
 We would potentially be indexing millions of documents.
 
 James,
 
 We would have a look at hylanda too. What abt japanese and korean
 analyzers,
 any recommendations?
 
 - Eswar
 
 On Nov 27, 2007 7:21 AM, James liu <[EMAIL PROTECTED]>
 wrote:
 
> I don't think NGram is good method for Chinese.
> 
> CJKAnalyzer of Lucene is 2-Gram.
> 
> Eswar K:
>  if it is chinese analyzer,,i recommend
 hylanda(www.hylanda.com),,,it
>>> is
> the best chinese analyzer and it not free.
>  if u wanna free chinese analyzer, maybe u can try je-analyzer.
 it
>>> have
> some problem when using it.
> 
> 
> 
> On Nov 27, 2007 5:56 AM, Otis Gospodnetic <
>>> [EMAIL PROTECTED]>
> wrote:
> 
>> Eswar,
>> 
>> We've uses the NGram stuff that exists in Lucene's
>>> contrib/analyzers
>> instead of CJK.  Doesn't that allow you to do everything that
 the
> Chinese
>> and CJK analyzers do?  It's been a few months since I've looked
 at
> Chinese
>> and CJK Analzyers, so I could be off.
>> 
>> Otis
>> 
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> 
>> - Original Message 
>> From: Eswar K <[EMAIL PROTECTED]>
>> To: solr-user@lucene.apache.org
>> Sent: Monday, November 26, 2007 8:30:52 AM
>> Subject: CJK Analyzers for Solr
>> 
>> Hi,
>> 
>> Does Solr come with Language analyzers for CJK? If not, can you
>>> please
>> direct me to some good CJK analyzers?
>> 
>> Regards,
>> Eswar
>> 
>> 
>> 
>> 
> 
> 
> --
> regards
> jl
> 
 
>>> 
>>> 
>>> 
>>> --
>>> regards
>>> jl
>>> 
>> 
>> 






Re: CJK Analyzers for Solr

2007-11-27 Thread Otis Gospodnetic
James - can you elaborate on why you think the n-gram approach is not good for 
Chinese?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: James liu <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Monday, November 26, 2007 8:51:23 PM
Subject: Re: CJK Analyzers for Solr

I don't think NGram is good method for Chinese.

CJKAnalyzer of Lucene is 2-Gram.

Eswar K:
  if it is chinese analyzer,,i recommend
 hylanda(www.hylanda.com),,,it is
the best chinese analyzer and it not free.
  if u wanna free chinese analyzer, maybe u can try je-analyzer. it
 have
some problem when using it.



On Nov 27, 2007 5:56 AM, Otis Gospodnetic <[EMAIL PROTECTED]>
wrote:

> Eswar,
>
> We've uses the NGram stuff that exists in Lucene's contrib/analyzers
> instead of CJK.  Doesn't that allow you to do everything that the
 Chinese
> and CJK analyzers do?  It's been a few months since I've looked at
 Chinese
> and CJK Analzyers, so I could be off.
>
> Otis
>
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> - Original Message 
> From: Eswar K <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Monday, November 26, 2007 8:30:52 AM
> Subject: CJK Analyzers for Solr
>
> Hi,
>
> Does Solr come with Language analyzers for CJK? If not, can you
 please
> direct me to some good CJK analyzers?
>
> Regards,
> Eswar
>
>
>
>


-- 
regards
jl





Re: CJK Analyzers for Solr

2007-11-27 Thread Otis Gospodnetic
Eswar,

I wouldn't worry about the performance of those CJK analyzers too much - they 
are fairly trivial.  The StandardAnalyzer is slower, for example.  I recently 
indexed cca 20MM large docs on a 8-core, 8 GB RAM box in 10 hours - 550 
docs/second.  No CJK, just English.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Eswar K <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Monday, November 26, 2007 9:27:15 PM
Subject: Re: CJK Analyzers for Solr

thanks james...

How much time does it take to index 18m docs?

- Eswar

On Nov 27, 2007 7:43 AM, James liu <[EMAIL PROTECTED]> wrote:

> i not use HYLANDA analyzer.
>
> i use je-analyzer and indexing at least 18m docs.
>
> i m sorry i only use chinese analyzer.
>
>
> On Nov 27, 2007 10:01 AM, Eswar K <[EMAIL PROTECTED]> wrote:
>
> > What is the performance of these CJK analyzers (one in lucene and
> hylanda
> > )?
> > We would potentially be indexing millions of documents.
> >
> > James,
> >
> > We would have a look at hylanda too. What abt japanese and korean
> > analyzers,
> > any recommendations?
> >
> > - Eswar
> >
> > On Nov 27, 2007 7:21 AM, James liu <[EMAIL PROTECTED]> wrote:
> >
> > > I don't think NGram is good method for Chinese.
> > >
> > > CJKAnalyzer of Lucene is 2-Gram.
> > >
> > > Eswar K:
> > >  if it is chinese analyzer,,i recommend
 hylanda(www.hylanda.com),,,it
> is
> > > the best chinese analyzer and it not free.
> > >  if u wanna free chinese analyzer, maybe u can try je-analyzer.
 it
> have
> > > some problem when using it.
> > >
> > >
> > >
> > > On Nov 27, 2007 5:56 AM, Otis Gospodnetic
 <[EMAIL PROTECTED]>
> > > wrote:
> > >
> > > > Eswar,
> > > >
> > > > We've uses the NGram stuff that exists in Lucene's
 contrib/analyzers
> > > > instead of CJK.  Doesn't that allow you to do everything that
 the
> > > Chinese
> > > > and CJK analyzers do?  It's been a few months since I've looked
 at
> > > Chinese
> > > > and CJK Analzyers, so I could be off.
> > > >
> > > > Otis
> > > >
> > > > --
> > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > > >
> > > > - Original Message 
> > > > From: Eswar K <[EMAIL PROTECTED]>
> > > > To: solr-user@lucene.apache.org
> > > > Sent: Monday, November 26, 2007 8:30:52 AM
> > > > Subject: CJK Analyzers for Solr
> > > >
> > > > Hi,
> > > >
> > > > Does Solr come with Language analyzers for CJK? If not, can you
> please
> > > > direct me to some good CJK analyzers?
> > > >
> > > > Regards,
> > > > Eswar
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > regards
> > > jl
> > >
> >
>
>
>
> --
> regards
> jl
>





Re: CJK Analyzers for Solr

2007-11-27 Thread Eswar K
John,

There were two parts to my question,

1) n-gram vs morphological analyzer - This was based on what I read at a few
places which rate morphological analysis higher than n-gram. An example
being (
http://www.basistech.com/knowledge-center/products/N-Gram-vs-morphological-analysis.pdf).
My intention of  asking this was not to question the effectiveness of the
existing implementation but was from the process of thought process behind
the decision. I was and am curious to know if they are any downsides of
using a morphological analyzer over the CJK analyzer, which prompted me to
ask this.

2) Morphological Analyzer used by Google - I dont know which Morph analyzer
Google uses,  but I have read at  different places that they do .

- Eswar

On Nov 27, 2007 10:42 PM, John Stewart <[EMAIL PROTECTED]> wrote:

> Eswar,
>
> What type of morphological analysis do you suspect (or know) that
> Google does on east asian text?  I don't think you can treat the three
> languages in the same way here.  Japanese has multi-morphemic words,
> but Chinese doesn't really.
>
> jds
>
> On Nov 27, 2007 11:54 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> > Is there any specific reason why the CJK analyzers in Solr were chosen
> to be
> > n-gram based instead of it being a morphological analyzer which is kind
> of
> > implemented in Google as it considered to be more effective than the
> n-gram
> > ones?
> >
> > Regards,
> > Eswar
> >
> >
> >
> >
> > On Nov 27, 2007 7:57 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> >
> > > thanks james...
> > >
> > > How much time does it take to index 18m docs?
> > >
> > > - Eswar
> > >
> > >
> > > On Nov 27, 2007 7:43 AM, James liu <[EMAIL PROTECTED] > wrote:
> > >
> > > > i not use HYLANDA analyzer.
> > > >
> > > > i use je-analyzer and indexing at least 18m docs.
> > > >
> > > > i m sorry i only use chinese analyzer.
> > > >
> > > >
> > > > On Nov 27, 2007 10:01 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> > > >
> > > > > What is the performance of these CJK analyzers (one in lucene and
> > > > hylanda
> > > > > )?
> > > > > We would potentially be indexing millions of documents.
> > > > >
> > > > > James,
> > > > >
> > > > > We would have a look at hylanda too. What abt japanese and korean
> > > > > analyzers,
> > > > > any recommendations?
> > > > >
> > > > > - Eswar
> > > > >
> > > > > On Nov 27, 2007 7:21 AM, James liu <[EMAIL PROTECTED]>
> wrote:
> > > > >
> > > > > > I don't think NGram is good method for Chinese.
> > > > > >
> > > > > > CJKAnalyzer of Lucene is 2-Gram.
> > > > > >
> > > > > > Eswar K:
> > > > > >  if it is chinese analyzer,,i recommend hylanda(www.hylanda.com)
> ,,,it
> > > > is
> > > > > > the best chinese analyzer and it not free.
> > > > > >  if u wanna free chinese analyzer, maybe u can try je-analyzer.
> it
> > > > have
> > > > > > some problem when using it.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Nov 27, 2007 5:56 AM, Otis Gospodnetic <
> > > > [EMAIL PROTECTED]>
> > > > > > wrote:
> > > > > >
> > > > > > > Eswar,
> > > > > > >
> > > > > > > We've uses the NGram stuff that exists in Lucene's
> > > > contrib/analyzers
> > > > > > > instead of CJK.  Doesn't that allow you to do everything that
> the
> > > > > > Chinese
> > > > > > > and CJK analyzers do?  It's been a few months since I've
> looked at
> > > > > > Chinese
> > > > > > > and CJK Analzyers, so I could be off.
> > > > > > >
> > > > > > > Otis
> > > > > > >
> > > > > > > --
> > > > > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > > > > > >
> > > > > > > - Original Message 
> > > > > > > From: Eswar K <[EMAIL PROTECTED]>
> > > > > > > To: solr-user@lucene.apache.org
> > > > > > > Sent: Monday, November 26, 2007 8:30:52 AM
> > > > > > > Subject: CJK Analyzers for Solr
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > Does Solr come with Language analyzers for CJK? If not, can
> you
> > > > please
> > > > > > > direct me to some good CJK analyzers?
> > > > > > >
> > > > > > > Regards,
> > > > > > > Eswar
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > regards
> > > > > > jl
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > regards
> > > > jl
> > > >
> > >
> > >
> >
>


Re: CJK Analyzers for Solr

2007-11-27 Thread Eswar K
Otis,

Thanks for the information, we will check this out.

Regards,
Eswar

On Nov 28, 2007 12:20 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
wrote:

> Eswar,
>
> I wouldn't worry about the performance of those CJK analyzers too much -
> they are fairly trivial.  The StandardAnalyzer is slower, for example.  I
> recently indexed cca 20MM large docs on a 8-core, 8 GB RAM box in 10 hours -
> 550 docs/second.  No CJK, just English.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> - Original Message 
> From: Eswar K <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Monday, November 26, 2007 9:27:15 PM
> Subject: Re: CJK Analyzers for Solr
>
> thanks james...
>
> How much time does it take to index 18m docs?
>
> - Eswar
>
> On Nov 27, 2007 7:43 AM, James liu <[EMAIL PROTECTED]> wrote:
>
> > i not use HYLANDA analyzer.
> >
> > i use je-analyzer and indexing at least 18m docs.
> >
> > i m sorry i only use chinese analyzer.
> >
> >
> > On Nov 27, 2007 10:01 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> >
> > > What is the performance of these CJK analyzers (one in lucene and
> > hylanda
> > > )?
> > > We would potentially be indexing millions of documents.
> > >
> > > James,
> > >
> > > We would have a look at hylanda too. What abt japanese and korean
> > > analyzers,
> > > any recommendations?
> > >
> > > - Eswar
> > >
> > > On Nov 27, 2007 7:21 AM, James liu <[EMAIL PROTECTED]> wrote:
> > >
> > > > I don't think NGram is good method for Chinese.
> > > >
> > > > CJKAnalyzer of Lucene is 2-Gram.
> > > >
> > > > Eswar K:
> > > >  if it is chinese analyzer,,i recommend
>  hylanda(www.hylanda.com),,,it
> > is
> > > > the best chinese analyzer and it not free.
> > > >  if u wanna free chinese analyzer, maybe u can try je-analyzer.
>  it
> > have
> > > > some problem when using it.
> > > >
> > > >
> > > >
> > > > On Nov 27, 2007 5:56 AM, Otis Gospodnetic
>  <[EMAIL PROTECTED]>
> > > > wrote:
> > > >
> > > > > Eswar,
> > > > >
> > > > > We've uses the NGram stuff that exists in Lucene's
>  contrib/analyzers
> > > > > instead of CJK.  Doesn't that allow you to do everything that
>  the
> > > > Chinese
> > > > > and CJK analyzers do?  It's been a few months since I've looked
>  at
> > > > Chinese
> > > > > and CJK Analzyers, so I could be off.
> > > > >
> > > > > Otis
> > > > >
> > > > > --
> > > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > > > >
> > > > > - Original Message 
> > > > > From: Eswar K <[EMAIL PROTECTED]>
> > > > > To: solr-user@lucene.apache.org
> > > > > Sent: Monday, November 26, 2007 8:30:52 AM
> > > > > Subject: CJK Analyzers for Solr
> > > > >
> > > > > Hi,
> > > > >
> > > > > Does Solr come with Language analyzers for CJK? If not, can you
> > please
> > > > > direct me to some good CJK analyzers?
> > > > >
> > > > > Regards,
> > > > > Eswar
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > regards
> > > > jl
> > > >
> > >
> >
> >
> >
> > --
> > regards
> > jl
> >
>
>
>
>


Re: CJK Analyzers for Solr

2007-11-27 Thread Otis Gospodnetic
Eswar - I can answer the Google question.  Actually, you are pointing to it in 
1) :)

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Eswar K <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, November 28, 2007 2:21:40 AM
Subject: Re: CJK Analyzers for Solr

John,

There were two parts to my question,

1) n-gram vs morphological analyzer - This was based on what I read at
 a few
places which rate morphological analysis higher than n-gram. An example
being (
http://www.basistech.com/knowledge-center/products/N-Gram-vs-morphological-analysis.pdf).
My intention of  asking this was not to question the effectiveness of
 the
existing implementation but was from the process of thought process
 behind
the decision. I was and am curious to know if they are any downsides of
using a morphological analyzer over the CJK analyzer, which prompted me
 to
ask this.

2) Morphological Analyzer used by Google - I dont know which Morph
 analyzer
Google uses,  but I have read at  different places that they do .

- Eswar

On Nov 27, 2007 10:42 PM, John Stewart <[EMAIL PROTECTED]> wrote:

> Eswar,
>
> What type of morphological analysis do you suspect (or know) that
> Google does on east asian text?  I don't think you can treat the
 three
> languages in the same way here.  Japanese has multi-morphemic words,
> but Chinese doesn't really.
>
> jds
>
> On Nov 27, 2007 11:54 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> > Is there any specific reason why the CJK analyzers in Solr were
 chosen
> to be
> > n-gram based instead of it being a morphological analyzer which is
 kind
> of
> > implemented in Google as it considered to be more effective than
 the
> n-gram
> > ones?
> >
> > Regards,
> > Eswar
> >
> >
> >
> >
> > On Nov 27, 2007 7:57 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> >
> > > thanks james...
> > >
> > > How much time does it take to index 18m docs?
> > >
> > > - Eswar
> > >
> > >
> > > On Nov 27, 2007 7:43 AM, James liu <[EMAIL PROTECTED] >
 wrote:
> > >
> > > > i not use HYLANDA analyzer.
> > > >
> > > > i use je-analyzer and indexing at least 18m docs.
> > > >
> > > > i m sorry i only use chinese analyzer.
> > > >
> > > >
> > > > On Nov 27, 2007 10:01 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> > > >
> > > > > What is the performance of these CJK analyzers (one in lucene
 and
> > > > hylanda
> > > > > )?
> > > > > We would potentially be indexing millions of documents.
> > > > >
> > > > > James,
> > > > >
> > > > > We would have a look at hylanda too. What abt japanese and
 korean
> > > > > analyzers,
> > > > > any recommendations?
> > > > >
> > > > > - Eswar
> > > > >
> > > > > On Nov 27, 2007 7:21 AM, James liu <[EMAIL PROTECTED]>
> wrote:
> > > > >
> > > > > > I don't think NGram is good method for Chinese.
> > > > > >
> > > > > > CJKAnalyzer of Lucene is 2-Gram.
> > > > > >
> > > > > > Eswar K:
> > > > > >  if it is chinese analyzer,,i recommend
 hylanda(www.hylanda.com)
> ,,,it
> > > > is
> > > > > > the best chinese analyzer and it not free.
> > > > > >  if u wanna free chinese analyzer, maybe u can try
 je-analyzer.
> it
> > > > have
> > > > > > some problem when using it.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Nov 27, 2007 5:56 AM, Otis Gospodnetic <
> > > > [EMAIL PROTECTED]>
> > > > > > wrote:
> > > > > >
> > > > > > > Eswar,
> > > > > > >
> > > > > > > We've uses the NGram stuff that exists in Lucene's
> > > > contrib/analyzers
> > > > > > > instead of CJK.  Doesn't that allow you to do everything
 that
> the
> > > > > > Chinese
> > > > > > > and CJK analyzers do?  It's been a few months since I've
> looked at
> > > > > > Chinese
> > > > > > > and CJK Analzyers, so I could be off.
> > > > > > >
> > > > > > > Otis
> > > > > > >
> > > > > > > --
> > > > > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > > > > > >
> > > > > > > - Original Message 
> > > > > > > From: Eswar K <[EMAIL PROTECTED]>
> > > > > > > To: solr-user@lucene.apache.org
> > > > > > > Sent: Monday, November 26, 2007 8:30:52 AM
> > > > > > > Subject: CJK Analyzers for Solr
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > Does Solr come with Language analyzers for CJK? If not,
 can
> you
> > > > please
> > > > > > > direct me to some good CJK analyzers?
> > > > > > >
> > > > > > > Regards,
> > > > > > > Eswar
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > regards
> > > > > > jl
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > regards
> > > > jl
> > > >
> > >
> > >
> >
>





Re: CJK Analyzers for Solr

2007-11-27 Thread Luke Lu

Not sure how up to date this is: http://www.basistech.com/customers/

I've only used their C++ products, which generally worked well for  
web search with a few exceptions. According to http:// 
www.basistech.com/knowledge-center/chinese/chinese-language- 
analysis.pdf , they provide Java APIs as well. Their CJK language  
analyzers are all morphological, AFAIK.


To process mixed languages properly, you'll also need a unicode/ 
language aware container analyzer that automatically picks the right  
analyzer for the right language.


__Luke

On Nov 27, 2007, at 10:29 PM, Otis Gospodnetic wrote:


Eswar - I'm interested in the answer to John's question, too! :)

As for why n-grams - probably because they are free and simple,  
while dictionary-based stuff would likely not be free (are there  
free dictionaries for C or J or K??), and a morphological analyzer  
would be a bit more work.  That said, if you need a morphological  
analyzer for non-CJK languages, let me know - see my sig.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: John Stewart <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Tuesday, November 27, 2007 12:12:40 PM
Subject: Re: CJK Analyzers for Solr

Eswar,

What type of morphological analysis do you suspect (or know) that
Google does on east asian text?  I don't think you can treat the three
languages in the same way here.  Japanese has multi-morphemic words,
but Chinese doesn't really.

jds

On Nov 27, 2007 11:54 AM, Eswar K <[EMAIL PROTECTED]> wrote:

Is there any specific reason why the CJK analyzers in Solr were

 chosen to be

n-gram based instead of it being a morphological analyzer which is

 kind of

implemented in Google as it considered to be more effective than the

 n-gram

ones?

Regards,
Eswar




On Nov 27, 2007 7:57 AM, Eswar K <[EMAIL PROTECTED]> wrote:


thanks james...

How much time does it take to index 18m docs?

- Eswar


On Nov 27, 2007 7:43 AM, James liu <[EMAIL PROTECTED] >

 wrote:



i not use HYLANDA analyzer.

i use je-analyzer and indexing at least 18m docs.

i m sorry i only use chinese analyzer.


On Nov 27, 2007 10:01 AM, Eswar K <[EMAIL PROTECTED]> wrote:


What is the performance of these CJK analyzers (one in lucene

 and

hylanda

)?
We would potentially be indexing millions of documents.

James,

We would have a look at hylanda too. What abt japanese and

 korean

analyzers,
any recommendations?

- Eswar

On Nov 27, 2007 7:21 AM, James liu <[EMAIL PROTECTED]>

 wrote:



I don't think NGram is good method for Chinese.

CJKAnalyzer of Lucene is 2-Gram.

Eswar K:
 if it is chinese analyzer,,i recommend

 hylanda(www.hylanda.com),,,it

is

the best chinese analyzer and it not free.
 if u wanna free chinese analyzer, maybe u can try

 je-analyzer. it

have

some problem when using it.



On Nov 27, 2007 5:56 AM, Otis Gospodnetic <

[EMAIL PROTECTED]>

wrote:


Eswar,

We've uses the NGram stuff that exists in Lucene's

contrib/analyzers

instead of CJK.  Doesn't that allow you to do everything

 that the

Chinese

and CJK analyzers do?  It's been a few months since I've

 looked at

Chinese

and CJK Analzyers, so I could be off.

Otis

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Eswar K <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Monday, November 26, 2007 8:30:52 AM
Subject: CJK Analyzers for Solr

Hi,

Does Solr come with Language analyzers for CJK? If not, can

 you

please

direct me to some good CJK analyzers?

Regards,
Eswar







--
regards
jl







--
regards
jl