Re: Memory use with sorting problem
Hi again, in the meantime I discovered the use of jmap (I'm not a Java programmer) and found that all the memory was being used up by String and char[] objects. The Lucene docs have the following to say on sorting memory use: > For String fields, the cache is larger: in addition to the above array, the value of every term in the field is kept in memory. If there are many unique terms in the field, this could be quite large. (http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/Sort.html) I am sorting on the "slong" schema type, which is of course stored as a string. The above quote seems to indicate that it is possible for a field not to be a string for the purposes of the sort, while I took it from LiA that everything is a string to Lucene. What can I do to make sure the additional memory is not used by every unique term? i.e. how to have the slong not be a "String field"? Cheers, Chris Chris Laux wrote: > Hi all, > > I've been struggling with this problem for over a month now, and > although memory issues have been discussed often, I don't seem to be > able to find a fitting solution. > > The index is merely 1.5 GB large, but memory use quickly fills out the > heap max of 1 GB on a 2 GB machine. This then works fine until > auto-warming starts. Switching the latter off altogether is unattractive > as it leads to response times of up to 30 s. When auto-warming starts, I > get this error: > >> SEVERE: Error during auto-warming of > key:org.apache.solr.search.QueryResultKey > @e0b93139:java.lang.OutOfMemoryError: Java heap space > > Now when I reduce the size of caches (to a fraction of the default > settings) and number of warming Searchers (to 2), memory use is not > reduced and the problem stays. Only deactivating auto-warming will help. > When I set the heap size limit higher (and go into swap space), all the > extra memory seems to be used up right away, independently from > auto-warming. > > This all seems to be closely connected to sorting by a numerical field, > as switching this off does make memory use a lot more friendly. > > Is it normal to need that much Memory for such a small index? > > I suspect the problem is in Lucene, would it be better to post on their > list? > > Does anyone know a better way of getting the sorting done? > > Thanks in advance for your help, > > Chris > > > This is the field setup in schema.xml: > > multiValued="false" /> > multiValued="false" /> > > > > And this is a sample query: > > select/?q=solr&start=0&rows=20&sort=created+desc > >
Re: Inconsistent results in Solr Search with Lucene Index
Have you setup your Analyzers, etc. so they correspond to the exact ones that you were using in Lucene? Under the Solr Admin you can try the analysis tool to see how your index and queries are treated. What happens if you do a *:* query from the Admin query screen? If your index is reasonably sized, I would just reindex, but you shouldn't have to do this. -Grant On Nov 27, 2007, at 8:18 AM, trysteps wrote: Hi All, I am trying to use Solr Search with Lucene Index so just set all schema.xml configs like tokenize and field necessaries. But I can not get results like Lucene. For example , search for 'dog' returns lots of results with lucene but in Solr, I can't get any result. But search with 'dog*' returns same result with Lucene. What is the best way to integrate Lucene index to Solr, are there any well-documented sources? Thanks for your Attention, Trysteps -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
Re: CJK Analyzers for Solr
Is there any specific reason why the CJK analyzers in Solr were chosen to be n-gram based instead of it being a morphological analyzer which is kind of implemented in Google as it considered to be more effective than the n-gram ones? Regards, Eswar On Nov 27, 2007 7:57 AM, Eswar K <[EMAIL PROTECTED]> wrote: > thanks james... > > How much time does it take to index 18m docs? > > - Eswar > > > On Nov 27, 2007 7:43 AM, James liu <[EMAIL PROTECTED] > wrote: > > > i not use HYLANDA analyzer. > > > > i use je-analyzer and indexing at least 18m docs. > > > > i m sorry i only use chinese analyzer. > > > > > > On Nov 27, 2007 10:01 AM, Eswar K <[EMAIL PROTECTED]> wrote: > > > > > What is the performance of these CJK analyzers (one in lucene and > > hylanda > > > )? > > > We would potentially be indexing millions of documents. > > > > > > James, > > > > > > We would have a look at hylanda too. What abt japanese and korean > > > analyzers, > > > any recommendations? > > > > > > - Eswar > > > > > > On Nov 27, 2007 7:21 AM, James liu <[EMAIL PROTECTED]> wrote: > > > > > > > I don't think NGram is good method for Chinese. > > > > > > > > CJKAnalyzer of Lucene is 2-Gram. > > > > > > > > Eswar K: > > > > if it is chinese analyzer,,i recommend hylanda(www.hylanda.com),,,it > > is > > > > the best chinese analyzer and it not free. > > > > if u wanna free chinese analyzer, maybe u can try je-analyzer. it > > have > > > > some problem when using it. > > > > > > > > > > > > > > > > On Nov 27, 2007 5:56 AM, Otis Gospodnetic < > > [EMAIL PROTECTED]> > > > > wrote: > > > > > > > > > Eswar, > > > > > > > > > > We've uses the NGram stuff that exists in Lucene's > > contrib/analyzers > > > > > instead of CJK. Doesn't that allow you to do everything that the > > > > Chinese > > > > > and CJK analyzers do? It's been a few months since I've looked at > > > > Chinese > > > > > and CJK Analzyers, so I could be off. > > > > > > > > > > Otis > > > > > > > > > > -- > > > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > > > > > - Original Message > > > > > From: Eswar K <[EMAIL PROTECTED]> > > > > > To: solr-user@lucene.apache.org > > > > > Sent: Monday, November 26, 2007 8:30:52 AM > > > > > Subject: CJK Analyzers for Solr > > > > > > > > > > Hi, > > > > > > > > > > Does Solr come with Language analyzers for CJK? If not, can you > > please > > > > > direct me to some good CJK analyzers? > > > > > > > > > > Regards, > > > > > Eswar > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > regards > > > > jl > > > > > > > > > > > > > > > -- > > regards > > jl > > > >
Re: CJK Analyzers for Solr
Eswar, What type of morphological analysis do you suspect (or know) that Google does on east asian text? I don't think you can treat the three languages in the same way here. Japanese has multi-morphemic words, but Chinese doesn't really. jds On Nov 27, 2007 11:54 AM, Eswar K <[EMAIL PROTECTED]> wrote: > Is there any specific reason why the CJK analyzers in Solr were chosen to be > n-gram based instead of it being a morphological analyzer which is kind of > implemented in Google as it considered to be more effective than the n-gram > ones? > > Regards, > Eswar > > > > > On Nov 27, 2007 7:57 AM, Eswar K <[EMAIL PROTECTED]> wrote: > > > thanks james... > > > > How much time does it take to index 18m docs? > > > > - Eswar > > > > > > On Nov 27, 2007 7:43 AM, James liu <[EMAIL PROTECTED] > wrote: > > > > > i not use HYLANDA analyzer. > > > > > > i use je-analyzer and indexing at least 18m docs. > > > > > > i m sorry i only use chinese analyzer. > > > > > > > > > On Nov 27, 2007 10:01 AM, Eswar K <[EMAIL PROTECTED]> wrote: > > > > > > > What is the performance of these CJK analyzers (one in lucene and > > > hylanda > > > > )? > > > > We would potentially be indexing millions of documents. > > > > > > > > James, > > > > > > > > We would have a look at hylanda too. What abt japanese and korean > > > > analyzers, > > > > any recommendations? > > > > > > > > - Eswar > > > > > > > > On Nov 27, 2007 7:21 AM, James liu <[EMAIL PROTECTED]> wrote: > > > > > > > > > I don't think NGram is good method for Chinese. > > > > > > > > > > CJKAnalyzer of Lucene is 2-Gram. > > > > > > > > > > Eswar K: > > > > > if it is chinese analyzer,,i recommend hylanda(www.hylanda.com),,,it > > > is > > > > > the best chinese analyzer and it not free. > > > > > if u wanna free chinese analyzer, maybe u can try je-analyzer. it > > > have > > > > > some problem when using it. > > > > > > > > > > > > > > > > > > > > On Nov 27, 2007 5:56 AM, Otis Gospodnetic < > > > [EMAIL PROTECTED]> > > > > > wrote: > > > > > > > > > > > Eswar, > > > > > > > > > > > > We've uses the NGram stuff that exists in Lucene's > > > contrib/analyzers > > > > > > instead of CJK. Doesn't that allow you to do everything that the > > > > > Chinese > > > > > > and CJK analyzers do? It's been a few months since I've looked at > > > > > Chinese > > > > > > and CJK Analzyers, so I could be off. > > > > > > > > > > > > Otis > > > > > > > > > > > > -- > > > > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > > > > > > > - Original Message > > > > > > From: Eswar K <[EMAIL PROTECTED]> > > > > > > To: solr-user@lucene.apache.org > > > > > > Sent: Monday, November 26, 2007 8:30:52 AM > > > > > > Subject: CJK Analyzers for Solr > > > > > > > > > > > > Hi, > > > > > > > > > > > > Does Solr come with Language analyzers for CJK? If not, can you > > > please > > > > > > direct me to some good CJK analyzers? > > > > > > > > > > > > Regards, > > > > > > Eswar > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > regards > > > > > jl > > > > > > > > > > > > > > > > > > > > > -- > > > regards > > > jl > > > > > > > >
Combining SOLR and JAMon to monitor query execution times from a browser
Hi folks, working on a closed source project for an IP concerned company is not always fun ... we combined SOLR with JAMon (http://jamonapi.sourceforge.net/) to keep an eye of the query times and this might be of general interest +) JAMon comes with a ready-to-use ServletFilter +) we extended this implementation to keep track for queries issued by a customer and the requested domain objects, e.g. "artist", "album", "track" +) this allows us to keep track of the execution times and their distribution to find quickly long running queries without having access to the access.log from a web browser +) a small presentation can be found at http://people.apache.org/~sgoeschl/presentations/jamon-20070717.pdf +) if it is of general I can rewrite the code as contribution Cheers, Siegfried Goeschl
Re: Combining SOLR and JAMon to monitor query execution times from a browser
I'd be interested in seeing more logging in the admin section! I saw that there is QPS in 1.3, which is great, but it'd be wonderful to see more. --Matthew Runo On Nov 27, 2007, at 9:18 AM, Siegfried Goeschl wrote: Hi folks, working on a closed source project for an IP concerned company is not always fun ... we combined SOLR with JAMon (http://jamonapi.sourceforge.net/ ) to keep an eye of the query times and this might be of general interest +) JAMon comes with a ready-to-use ServletFilter +) we extended this implementation to keep track for queries issued by a customer and the requested domain objects, e.g. "artist", "album", "track" +) this allows us to keep track of the execution times and their distribution to find quickly long running queries without having access to the access.log from a web browser +) a small presentation can be found at http://people.apache.org/~sgoeschl/presentations/jamon-20070717.pdf +) if it is of general I can rewrite the code as contribution Cheers, Siegfried Goeschl
Re: CJK Analyzers for Solr
On 27-Nov-07, at 8:54 AM, Eswar K wrote: Is there any specific reason why the CJK analyzers in Solr were chosen to be n-gram based instead of it being a morphological analyzer which is kind of implemented in Google as it considered to be more effective than the n-gram ones? The CJK analyzers are just wrappers of the already-available analyzers in lucene. I suspect (but am not sure) that the core devs aren't fluent in the issues surrounding the analysis of asian text (I certainly am not). Any improvements in this regard would be greatly appreciated. -Mike
two solr instances?
Is it possible to deploy solr.war once to Tomcat (which is on top of an Apache HTTP Server in my configuration) which then can manage two Solr indexes? I have to make accessible two different Solr indexes (both have different schema.xml files) over the web. If the above architecture is not possible: is there any other solution?
Re: CJK Analyzers for Solr
Dictionaries are surprisingly expensive to build and maintain and bi-gram is surprisingly effective for Chinese. See this paper: http://citeseer.ist.psu.edu/kwok97comparing.html I expect that n-gram indexing would be less effective for Japanese because it is an inflected language. Korean is even harder. It might work to break Korean into the phonetic subparts and use n-gram on those. You should not do term highlighting with any of the n-gram methods. The relevance can be very good, but the highlighting just looks dumb. wunder On 11/27/07 8:54 AM, "Eswar K" <[EMAIL PROTECTED]> wrote: > Is there any specific reason why the CJK analyzers in Solr were chosen to be > n-gram based instead of it being a morphological analyzer which is kind of > implemented in Google as it considered to be more effective than the n-gram > ones? > > Regards, > Eswar > > > > On Nov 27, 2007 7:57 AM, Eswar K <[EMAIL PROTECTED]> wrote: > >> thanks james... >> >> How much time does it take to index 18m docs? >> >> - Eswar >> >> >> On Nov 27, 2007 7:43 AM, James liu <[EMAIL PROTECTED] > wrote: >> >>> i not use HYLANDA analyzer. >>> >>> i use je-analyzer and indexing at least 18m docs. >>> >>> i m sorry i only use chinese analyzer. >>> >>> >>> On Nov 27, 2007 10:01 AM, Eswar K <[EMAIL PROTECTED]> wrote: >>> What is the performance of these CJK analyzers (one in lucene and >>> hylanda )? We would potentially be indexing millions of documents. James, We would have a look at hylanda too. What abt japanese and korean analyzers, any recommendations? - Eswar On Nov 27, 2007 7:21 AM, James liu <[EMAIL PROTECTED]> wrote: > I don't think NGram is good method for Chinese. > > CJKAnalyzer of Lucene is 2-Gram. > > Eswar K: > if it is chinese analyzer,,i recommend hylanda(www.hylanda.com),,,it >>> is > the best chinese analyzer and it not free. > if u wanna free chinese analyzer, maybe u can try je-analyzer. it >>> have > some problem when using it. > > > > On Nov 27, 2007 5:56 AM, Otis Gospodnetic < >>> [EMAIL PROTECTED]> > wrote: > >> Eswar, >> >> We've uses the NGram stuff that exists in Lucene's >>> contrib/analyzers >> instead of CJK. Doesn't that allow you to do everything that the > Chinese >> and CJK analyzers do? It's been a few months since I've looked at > Chinese >> and CJK Analzyers, so I could be off. >> >> Otis >> >> -- >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >> >> - Original Message >> From: Eswar K <[EMAIL PROTECTED]> >> To: solr-user@lucene.apache.org >> Sent: Monday, November 26, 2007 8:30:52 AM >> Subject: CJK Analyzers for Solr >> >> Hi, >> >> Does Solr come with Language analyzers for CJK? If not, can you >>> please >> direct me to some good CJK analyzers? >> >> Regards, >> Eswar >> >> >> >> > > > -- > regards > jl > >>> >>> >>> >>> -- >>> regards >>> jl >>> >> >>
Re: two solr instances?
Have you looked at this page on the wiki: http://wiki.apache.org/solr/SolrTomcat#head-024d7e11209030f1dbcac9974e55106abae837ac That should get you started. -Chris Jörg Kiegeland wrote: > Is it possible to deploy solr.war once to Tomcat (which is on top of an > Apache HTTP Server in my configuration) which then can manage two Solr > indexes? > > I have to make accessible two different Solr indexes (both have > different schema.xml files) over the web. If the above architecture is > not possible: is there any other solution? >
RE: LSA Implementation
WordNet itself is English-only. There are various ontology projects for it. http://www.globalwordnet.org/ is a separate world language database project. I found it at the bottom of the WordNet wikipedia page. Thanks for starting me on the search! Lance -Original Message- From: Eswar K [mailto:[EMAIL PROTECTED] Sent: Monday, November 26, 2007 6:50 PM To: solr-user@lucene.apache.org Subject: Re: LSA Implementation The languages also include CJK :) among others. - Eswar On Nov 27, 2007 8:16 AM, Norskog, Lance <[EMAIL PROTECTED]> wrote: > The WordNet project at Princeton (USA) is a large database of synonyms. > If you're only working in English this might be useful instead of > running your own analyses. > > http://en.wikipedia.org/wiki/WordNet > http://wordnet.princeton.edu/ > > Lance > > -Original Message- > From: Eswar K [mailto:[EMAIL PROTECTED] > Sent: Monday, November 26, 2007 6:34 PM > To: solr-user@lucene.apache.org > Subject: Re: LSA Implementation > > In addition to recording which keywords a document contains, the > method examines the document collection as a whole, to see which other > documents contain some of those same words. this algo should consider > documents that have many words in common to be semantically close, and > ones with few words in common to be semantically distant. This simple > method correlates surprisingly well with how a human being, looking at > content, might classify a document collection. Although the algorithm > doesn't understand anything about what the words *mean*, the patterns > it notices can make it seem astonishingly intelligent. > > When you search an such an index, the search engine looks at > similarity values it has calculated for every content word, and > returns the documents that it thinks best fit the query. Because two > documents may be semantically very close even if they do not share a > particular keyword, > > Where a plain keyword search will fail if there is no exact match, > this algo will often return relevant documents that don't contain the > keyword at all. > > - Eswar > > On Nov 27, 2007 7:51 AM, Marvin Humphrey <[EMAIL PROTECTED]> wrote: > > > > > On Nov 26, 2007, at 6:06 PM, Eswar K wrote: > > > > > We essentially are looking at having an implementation for doing > > > search which can return documents having conceptually similar > > > words without necessarily having the original word searched for. > > > > Very challenging. Say someone searches for "LSA" and hits an > > archived > > > version of the mail you sent to this list. "LSA" is a reasonably > > discriminating term. But so is "Eswar". > > > > If you knew that the original term was "LSA", then you might look > > for documents near it in term vector space. But if you don't know > > the original term, only the content of the document, how do you know > > whether you should look for docs near "lsa" or "eswar"? > > > > Marvin Humphrey > > Rectangular Research > > http://www.rectangular.com/ > > > > > > >
Related Search
Hi, What is the best way to implement a related search like CNET with SOLR ? Ex.: Searching for "tv" the related searches are: lcd tv, lcd, hdtv, vizio, plasma tv, panasonic, gps, plasma Thanks, William.
Re: Related Search
Take a look at this thread http://www.gossamer-threads.com/lists/lucene/java-user/54996 There was a need to get all related topics for any selected topic. I have taken help of lucene-sand-box wordnet project to get all synoms of user selected topics. I am not sure whether wordnet project would help you as you look for products synonyms. In your case, you might need to maintain a vector of product synonyms E.g. If User searches TV, internally you would search for lcd tv, lcd, hdtv etc... Take a look at www.ajaxtrend.com, how all related topics are displayed and I am also keep on refining the related query search as this site is evolving. This is just prototype. - BR William Silva <[EMAIL PROTECTED]> wrote: Hi, What is the best way to implement a related search like CNET with SOLR ? Ex.: Searching for "tv" the related searches are: lcd tv, lcd, hdtv, vizio, plasma tv, panasonic, gps, plasma Thanks, William. - Be a better sports nut! Let your teams follow you with Yahoo Mobile. Try it now.
Solr and nutch, for reading a nutch index
I couldn't tell if this was asked before. But I want to perform a nutch crawl without any solr plugin which will simply write to some index directory. And then ideally I would like to use solr for searching? I am assuming this is possible? -- Berlin Brown [berlin dot brown at gmail dot com] http://botspiritcompany.com/botlist/?
Re: Solr and nutch, for reading a nutch index
On Nov 27, 2007, at 6:08 PM, bbrown wrote: I couldn't tell if this was asked before. But I want to perform a nutch crawl without any solr plugin which will simply write to some index directory. And then ideally I would like to use solr for searching? I am assuming this is possible? yes, this is quite possible. You need to have a solr schema that mimics the nutch schema, see sami's solrindexer for an example. Once you've got that schema, simply set the data dir in your solrconfig to the nutch index location and you'll be set.
Re: LSA Implementation
Using Wordnet may require having some type of disambiguation approach, otherwise you can end up w/ a lot of "synonyms". I also would look into how much coverage there is for non-English languages. If you have the resources, you may be better off developing/finding your own synonym/concept list based on your genres. You may also look into other approaches for assigning concepts off line and adding them to the document. -Grant On Nov 27, 2007, at 3:21 PM, Norskog, Lance wrote: WordNet itself is English-only. There are various ontology projects for it. http://www.globalwordnet.org/ is a separate world language database project. I found it at the bottom of the WordNet wikipedia page. Thanks for starting me on the search! Lance -Original Message- From: Eswar K [mailto:[EMAIL PROTECTED] Sent: Monday, November 26, 2007 6:50 PM To: solr-user@lucene.apache.org Subject: Re: LSA Implementation The languages also include CJK :) among others. - Eswar On Nov 27, 2007 8:16 AM, Norskog, Lance <[EMAIL PROTECTED]> wrote: The WordNet project at Princeton (USA) is a large database of synonyms. If you're only working in English this might be useful instead of running your own analyses. http://en.wikipedia.org/wiki/WordNet http://wordnet.princeton.edu/ Lance -Original Message- From: Eswar K [mailto:[EMAIL PROTECTED] Sent: Monday, November 26, 2007 6:34 PM To: solr-user@lucene.apache.org Subject: Re: LSA Implementation In addition to recording which keywords a document contains, the method examines the document collection as a whole, to see which other documents contain some of those same words. this algo should consider documents that have many words in common to be semantically close, and ones with few words in common to be semantically distant. This simple method correlates surprisingly well with how a human being, looking at content, might classify a document collection. Although the algorithm doesn't understand anything about what the words *mean*, the patterns it notices can make it seem astonishingly intelligent. When you search an such an index, the search engine looks at similarity values it has calculated for every content word, and returns the documents that it thinks best fit the query. Because two documents may be semantically very close even if they do not share a particular keyword, Where a plain keyword search will fail if there is no exact match, this algo will often return relevant documents that don't contain the keyword at all. - Eswar On Nov 27, 2007 7:51 AM, Marvin Humphrey <[EMAIL PROTECTED]> wrote: On Nov 26, 2007, at 6:06 PM, Eswar K wrote: We essentially are looking at having an implementation for doing search which can return documents having conceptually similar words without necessarily having the original word searched for. Very challenging. Say someone searches for "LSA" and hits an archived version of the mail you sent to this list. "LSA" is a reasonably discriminating term. But so is "Eswar". If you knew that the original term was "LSA", then you might look for documents near it in term vector space. But if you don't know the original term, only the content of the document, how do you know whether you should look for docs near "lsa" or "eswar"? Marvin Humphrey Rectangular Research http://www.rectangular.com/ -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
Re: Combining SOLR and JAMon to monitor query execution times from a browser
On Tue, 27 Nov 2007 18:18:16 +0100 Siegfried Goeschl <[EMAIL PROTECTED]> wrote: > Hi folks, > > working on a closed source project for an IP concerned company is not > always fun ... we combined SOLR with JAMon > (http://jamonapi.sourceforge.net/) to keep an eye of the query times and > this might be of general interest > > +) JAMon comes with a ready-to-use ServletFilter > +) we extended this implementation to keep track for queries issued by a > customer and the requested domain objects, e.g. "artist", "album", "track" > +) this allows us to keep track of the execution times and their > distribution to find quickly long running queries without having access > to the access.log from a web browser > +) a small presentation can be found at > http://people.apache.org/~sgoeschl/presentations/jamon-20070717.pdf > +) if it is of general I can rewrite the code as contribution Thanks Siegfried, I am further interested in plugging this information into something like Nagios , Cacti , Zenoss , bigsister , Openview or your monitoring system of choice, but I haven't had much time to look into this yet. How does JAMon compare to JMX ( http://java.sun.com/javase/technologies/core/mntr-mgmt/javamanagement/) ? cheers, B _ {Beto|Norberto|Numard} Meijome There are no stupid questions, but there are a LOT of inquisitive idiots. I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: Solr and nutch, for reading a nutch index
On Tue, 27 Nov 2007 18:12:13 -0500 Brian Whitman <[EMAIL PROTECTED]> wrote: > > On Nov 27, 2007, at 6:08 PM, bbrown wrote: > > > I couldn't tell if this was asked before. But I want to perform a > > nutch crawl > > without any solr plugin which will simply write to some index > > directory. And > > then ideally I would like to use solr for searching? I am assuming > > this is > > possible? > > > > yes, this is quite possible. You need to have a solr schema that > mimics the nutch schema, see sami's solrindexer for an example. Once > you've got that schema, simply set the data dir in your solrconfig to > the nutch index location and you'll be set. I think you should keep an eye on the versions of Lucene library used by both Nutch + Solr - differences at this layer *could* make them incompatible - but I am not an expert... B _ {Beto|Norberto|Numard} Meijome "Against logic there is no armor like ignorance." Laurence J. Peter I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: Solr and nutch, for reading a nutch index
I only glanced at Sami's post recently and what I think I saw there is something different. In other words, what Sami described is not a Solr instance pointing to a Nutch-built Lucene index, but rather an app that reads the appropriate Nutch/Hadoop files with fetched content and posts the read content to a Solr instance using a Solr java client like solrj. No? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Norberto Meijome <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Cc: [EMAIL PROTECTED] Sent: Tuesday, November 27, 2007 8:33:18 PM Subject: Re: Solr and nutch, for reading a nutch index On Tue, 27 Nov 2007 18:12:13 -0500 Brian Whitman <[EMAIL PROTECTED]> wrote: > > On Nov 27, 2007, at 6:08 PM, bbrown wrote: > > > I couldn't tell if this was asked before. But I want to perform a > > nutch crawl > > without any solr plugin which will simply write to some index > > directory. And > > then ideally I would like to use solr for searching? I am assuming > > this is > > possible? > > > > yes, this is quite possible. You need to have a solr schema that > mimics the nutch schema, see sami's solrindexer for an example. Once > you've got that schema, simply set the data dir in your solrconfig to > the nutch index location and you'll be set. I think you should keep an eye on the versions of Lucene library used by both Nutch + Solr - differences at this layer *could* make them incompatible - but I am not an expert... B _ {Beto|Norberto|Numard} Meijome "Against logic there is no armor like ignorance." Laurence J. Peter I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: CJK Analyzers for Solr
Eswar - I'm interested in the answer to John's question, too! :) As for why n-grams - probably because they are free and simple, while dictionary-based stuff would likely not be free (are there free dictionaries for C or J or K??), and a morphological analyzer would be a bit more work. That said, if you need a morphological analyzer for non-CJK languages, let me know - see my sig. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: John Stewart <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Tuesday, November 27, 2007 12:12:40 PM Subject: Re: CJK Analyzers for Solr Eswar, What type of morphological analysis do you suspect (or know) that Google does on east asian text? I don't think you can treat the three languages in the same way here. Japanese has multi-morphemic words, but Chinese doesn't really. jds On Nov 27, 2007 11:54 AM, Eswar K <[EMAIL PROTECTED]> wrote: > Is there any specific reason why the CJK analyzers in Solr were chosen to be > n-gram based instead of it being a morphological analyzer which is kind of > implemented in Google as it considered to be more effective than the n-gram > ones? > > Regards, > Eswar > > > > > On Nov 27, 2007 7:57 AM, Eswar K <[EMAIL PROTECTED]> wrote: > > > thanks james... > > > > How much time does it take to index 18m docs? > > > > - Eswar > > > > > > On Nov 27, 2007 7:43 AM, James liu <[EMAIL PROTECTED] > wrote: > > > > > i not use HYLANDA analyzer. > > > > > > i use je-analyzer and indexing at least 18m docs. > > > > > > i m sorry i only use chinese analyzer. > > > > > > > > > On Nov 27, 2007 10:01 AM, Eswar K <[EMAIL PROTECTED]> wrote: > > > > > > > What is the performance of these CJK analyzers (one in lucene and > > > hylanda > > > > )? > > > > We would potentially be indexing millions of documents. > > > > > > > > James, > > > > > > > > We would have a look at hylanda too. What abt japanese and korean > > > > analyzers, > > > > any recommendations? > > > > > > > > - Eswar > > > > > > > > On Nov 27, 2007 7:21 AM, James liu <[EMAIL PROTECTED]> wrote: > > > > > > > > > I don't think NGram is good method for Chinese. > > > > > > > > > > CJKAnalyzer of Lucene is 2-Gram. > > > > > > > > > > Eswar K: > > > > > if it is chinese analyzer,,i recommend hylanda(www.hylanda.com),,,it > > > is > > > > > the best chinese analyzer and it not free. > > > > > if u wanna free chinese analyzer, maybe u can try je-analyzer. it > > > have > > > > > some problem when using it. > > > > > > > > > > > > > > > > > > > > On Nov 27, 2007 5:56 AM, Otis Gospodnetic < > > > [EMAIL PROTECTED]> > > > > > wrote: > > > > > > > > > > > Eswar, > > > > > > > > > > > > We've uses the NGram stuff that exists in Lucene's > > > contrib/analyzers > > > > > > instead of CJK. Doesn't that allow you to do everything that the > > > > > Chinese > > > > > > and CJK analyzers do? It's been a few months since I've looked at > > > > > Chinese > > > > > > and CJK Analzyers, so I could be off. > > > > > > > > > > > > Otis > > > > > > > > > > > > -- > > > > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > > > > > > > - Original Message > > > > > > From: Eswar K <[EMAIL PROTECTED]> > > > > > > To: solr-user@lucene.apache.org > > > > > > Sent: Monday, November 26, 2007 8:30:52 AM > > > > > > Subject: CJK Analyzers for Solr > > > > > > > > > > > > Hi, > > > > > > > > > > > > Does Solr come with Language analyzers for CJK? If not, can you > > > please > > > > > > direct me to some good CJK analyzers? > > > > > > > > > > > > Regards, > > > > > > Eswar > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > regards > > > > > jl > > > > > > > > > > > > > > > > > > > > > -- > > > regards > > > jl > > > > > > > >
Re: Solr and nutch, for reading a nutch index
On Nov 28, 2007, at 1:24 AM, Otis Gospodnetic wrote: I only glanced at Sami's post recently and what I think I saw there is something different. In other words, what Sami described is not a Solr instance pointing to a Nutch-built Lucene index, but rather an app that reads the appropriate Nutch/Hadoop files with fetched content and posts the read content to a Solr instance using a Solr java client like solrj. No? Yes, to be clear, all you need from Sami's thing is the schema file. Ignore everything else. Then point solr at the nutch index directory (it's just a lucene index.) Sami's entire thing is for indexing with solr instead of nutch, separate issue... Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Norberto Meijome <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Cc: [EMAIL PROTECTED] Sent: Tuesday, November 27, 2007 8:33:18 PM Subject: Re: Solr and nutch, for reading a nutch index On Tue, 27 Nov 2007 18:12:13 -0500 Brian Whitman <[EMAIL PROTECTED]> wrote: On Nov 27, 2007, at 6:08 PM, bbrown wrote: I couldn't tell if this was asked before. But I want to perform a nutch crawl without any solr plugin which will simply write to some index directory. And then ideally I would like to use solr for searching? I am assuming this is possible? yes, this is quite possible. You need to have a solr schema that mimics the nutch schema, see sami's solrindexer for an example. Once you've got that schema, simply set the data dir in your solrconfig to the nutch index location and you'll be set. I think you should keep an eye on the versions of Lucene library used by both Nutch + Solr - differences at this layer *could* make them incompatible - but I am not an expert... B _ {Beto|Norberto|Numard} Meijome "Against logic there is no armor like ignorance." Laurence J. Peter I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned. -- http://variogr.am/
Re: CJK Analyzers for Solr
For what it's worth I worked on indexing and searching a *massive* pile of data, a good portion of which was in CJ and some K. The n-gram approach was used for all 3 languages and the quality of search results, including highlighting was evaluated and okay-ed by native speakers of these languages. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Walter Underwood <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Tuesday, November 27, 2007 2:41:38 PM Subject: Re: CJK Analyzers for Solr Dictionaries are surprisingly expensive to build and maintain and bi-gram is surprisingly effective for Chinese. See this paper: http://citeseer.ist.psu.edu/kwok97comparing.html I expect that n-gram indexing would be less effective for Japanese because it is an inflected language. Korean is even harder. It might work to break Korean into the phonetic subparts and use n-gram on those. You should not do term highlighting with any of the n-gram methods. The relevance can be very good, but the highlighting just looks dumb. wunder On 11/27/07 8:54 AM, "Eswar K" <[EMAIL PROTECTED]> wrote: > Is there any specific reason why the CJK analyzers in Solr were chosen to be > n-gram based instead of it being a morphological analyzer which is kind of > implemented in Google as it considered to be more effective than the n-gram > ones? > > Regards, > Eswar > > > > On Nov 27, 2007 7:57 AM, Eswar K <[EMAIL PROTECTED]> wrote: > >> thanks james... >> >> How much time does it take to index 18m docs? >> >> - Eswar >> >> >> On Nov 27, 2007 7:43 AM, James liu <[EMAIL PROTECTED] > wrote: >> >>> i not use HYLANDA analyzer. >>> >>> i use je-analyzer and indexing at least 18m docs. >>> >>> i m sorry i only use chinese analyzer. >>> >>> >>> On Nov 27, 2007 10:01 AM, Eswar K <[EMAIL PROTECTED]> wrote: >>> What is the performance of these CJK analyzers (one in lucene and >>> hylanda )? We would potentially be indexing millions of documents. James, We would have a look at hylanda too. What abt japanese and korean analyzers, any recommendations? - Eswar On Nov 27, 2007 7:21 AM, James liu <[EMAIL PROTECTED]> wrote: > I don't think NGram is good method for Chinese. > > CJKAnalyzer of Lucene is 2-Gram. > > Eswar K: > if it is chinese analyzer,,i recommend hylanda(www.hylanda.com),,,it >>> is > the best chinese analyzer and it not free. > if u wanna free chinese analyzer, maybe u can try je-analyzer. it >>> have > some problem when using it. > > > > On Nov 27, 2007 5:56 AM, Otis Gospodnetic < >>> [EMAIL PROTECTED]> > wrote: > >> Eswar, >> >> We've uses the NGram stuff that exists in Lucene's >>> contrib/analyzers >> instead of CJK. Doesn't that allow you to do everything that the > Chinese >> and CJK analyzers do? It's been a few months since I've looked at > Chinese >> and CJK Analzyers, so I could be off. >> >> Otis >> >> -- >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >> >> - Original Message >> From: Eswar K <[EMAIL PROTECTED]> >> To: solr-user@lucene.apache.org >> Sent: Monday, November 26, 2007 8:30:52 AM >> Subject: CJK Analyzers for Solr >> >> Hi, >> >> Does Solr come with Language analyzers for CJK? If not, can you >>> please >> direct me to some good CJK analyzers? >> >> Regards, >> Eswar >> >> >> >> > > > -- > regards > jl > >>> >>> >>> >>> -- >>> regards >>> jl >>> >> >>
Re: CJK Analyzers for Solr
James - can you elaborate on why you think the n-gram approach is not good for Chinese? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: James liu <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Monday, November 26, 2007 8:51:23 PM Subject: Re: CJK Analyzers for Solr I don't think NGram is good method for Chinese. CJKAnalyzer of Lucene is 2-Gram. Eswar K: if it is chinese analyzer,,i recommend hylanda(www.hylanda.com),,,it is the best chinese analyzer and it not free. if u wanna free chinese analyzer, maybe u can try je-analyzer. it have some problem when using it. On Nov 27, 2007 5:56 AM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > Eswar, > > We've uses the NGram stuff that exists in Lucene's contrib/analyzers > instead of CJK. Doesn't that allow you to do everything that the Chinese > and CJK analyzers do? It's been a few months since I've looked at Chinese > and CJK Analzyers, so I could be off. > > Otis > > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > - Original Message > From: Eswar K <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Monday, November 26, 2007 8:30:52 AM > Subject: CJK Analyzers for Solr > > Hi, > > Does Solr come with Language analyzers for CJK? If not, can you please > direct me to some good CJK analyzers? > > Regards, > Eswar > > > > -- regards jl
Re: CJK Analyzers for Solr
Eswar, I wouldn't worry about the performance of those CJK analyzers too much - they are fairly trivial. The StandardAnalyzer is slower, for example. I recently indexed cca 20MM large docs on a 8-core, 8 GB RAM box in 10 hours - 550 docs/second. No CJK, just English. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Eswar K <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Monday, November 26, 2007 9:27:15 PM Subject: Re: CJK Analyzers for Solr thanks james... How much time does it take to index 18m docs? - Eswar On Nov 27, 2007 7:43 AM, James liu <[EMAIL PROTECTED]> wrote: > i not use HYLANDA analyzer. > > i use je-analyzer and indexing at least 18m docs. > > i m sorry i only use chinese analyzer. > > > On Nov 27, 2007 10:01 AM, Eswar K <[EMAIL PROTECTED]> wrote: > > > What is the performance of these CJK analyzers (one in lucene and > hylanda > > )? > > We would potentially be indexing millions of documents. > > > > James, > > > > We would have a look at hylanda too. What abt japanese and korean > > analyzers, > > any recommendations? > > > > - Eswar > > > > On Nov 27, 2007 7:21 AM, James liu <[EMAIL PROTECTED]> wrote: > > > > > I don't think NGram is good method for Chinese. > > > > > > CJKAnalyzer of Lucene is 2-Gram. > > > > > > Eswar K: > > > if it is chinese analyzer,,i recommend hylanda(www.hylanda.com),,,it > is > > > the best chinese analyzer and it not free. > > > if u wanna free chinese analyzer, maybe u can try je-analyzer. it > have > > > some problem when using it. > > > > > > > > > > > > On Nov 27, 2007 5:56 AM, Otis Gospodnetic <[EMAIL PROTECTED]> > > > wrote: > > > > > > > Eswar, > > > > > > > > We've uses the NGram stuff that exists in Lucene's contrib/analyzers > > > > instead of CJK. Doesn't that allow you to do everything that the > > > Chinese > > > > and CJK analyzers do? It's been a few months since I've looked at > > > Chinese > > > > and CJK Analzyers, so I could be off. > > > > > > > > Otis > > > > > > > > -- > > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > > > - Original Message > > > > From: Eswar K <[EMAIL PROTECTED]> > > > > To: solr-user@lucene.apache.org > > > > Sent: Monday, November 26, 2007 8:30:52 AM > > > > Subject: CJK Analyzers for Solr > > > > > > > > Hi, > > > > > > > > Does Solr come with Language analyzers for CJK? If not, can you > please > > > > direct me to some good CJK analyzers? > > > > > > > > Regards, > > > > Eswar > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > regards > > > jl > > > > > > > > > -- > regards > jl >
Re: CJK Analyzers for Solr
John, There were two parts to my question, 1) n-gram vs morphological analyzer - This was based on what I read at a few places which rate morphological analysis higher than n-gram. An example being ( http://www.basistech.com/knowledge-center/products/N-Gram-vs-morphological-analysis.pdf). My intention of asking this was not to question the effectiveness of the existing implementation but was from the process of thought process behind the decision. I was and am curious to know if they are any downsides of using a morphological analyzer over the CJK analyzer, which prompted me to ask this. 2) Morphological Analyzer used by Google - I dont know which Morph analyzer Google uses, but I have read at different places that they do . - Eswar On Nov 27, 2007 10:42 PM, John Stewart <[EMAIL PROTECTED]> wrote: > Eswar, > > What type of morphological analysis do you suspect (or know) that > Google does on east asian text? I don't think you can treat the three > languages in the same way here. Japanese has multi-morphemic words, > but Chinese doesn't really. > > jds > > On Nov 27, 2007 11:54 AM, Eswar K <[EMAIL PROTECTED]> wrote: > > Is there any specific reason why the CJK analyzers in Solr were chosen > to be > > n-gram based instead of it being a morphological analyzer which is kind > of > > implemented in Google as it considered to be more effective than the > n-gram > > ones? > > > > Regards, > > Eswar > > > > > > > > > > On Nov 27, 2007 7:57 AM, Eswar K <[EMAIL PROTECTED]> wrote: > > > > > thanks james... > > > > > > How much time does it take to index 18m docs? > > > > > > - Eswar > > > > > > > > > On Nov 27, 2007 7:43 AM, James liu <[EMAIL PROTECTED] > wrote: > > > > > > > i not use HYLANDA analyzer. > > > > > > > > i use je-analyzer and indexing at least 18m docs. > > > > > > > > i m sorry i only use chinese analyzer. > > > > > > > > > > > > On Nov 27, 2007 10:01 AM, Eswar K <[EMAIL PROTECTED]> wrote: > > > > > > > > > What is the performance of these CJK analyzers (one in lucene and > > > > hylanda > > > > > )? > > > > > We would potentially be indexing millions of documents. > > > > > > > > > > James, > > > > > > > > > > We would have a look at hylanda too. What abt japanese and korean > > > > > analyzers, > > > > > any recommendations? > > > > > > > > > > - Eswar > > > > > > > > > > On Nov 27, 2007 7:21 AM, James liu <[EMAIL PROTECTED]> > wrote: > > > > > > > > > > > I don't think NGram is good method for Chinese. > > > > > > > > > > > > CJKAnalyzer of Lucene is 2-Gram. > > > > > > > > > > > > Eswar K: > > > > > > if it is chinese analyzer,,i recommend hylanda(www.hylanda.com) > ,,,it > > > > is > > > > > > the best chinese analyzer and it not free. > > > > > > if u wanna free chinese analyzer, maybe u can try je-analyzer. > it > > > > have > > > > > > some problem when using it. > > > > > > > > > > > > > > > > > > > > > > > > On Nov 27, 2007 5:56 AM, Otis Gospodnetic < > > > > [EMAIL PROTECTED]> > > > > > > wrote: > > > > > > > > > > > > > Eswar, > > > > > > > > > > > > > > We've uses the NGram stuff that exists in Lucene's > > > > contrib/analyzers > > > > > > > instead of CJK. Doesn't that allow you to do everything that > the > > > > > > Chinese > > > > > > > and CJK analyzers do? It's been a few months since I've > looked at > > > > > > Chinese > > > > > > > and CJK Analzyers, so I could be off. > > > > > > > > > > > > > > Otis > > > > > > > > > > > > > > -- > > > > > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > > > > > > > > > - Original Message > > > > > > > From: Eswar K <[EMAIL PROTECTED]> > > > > > > > To: solr-user@lucene.apache.org > > > > > > > Sent: Monday, November 26, 2007 8:30:52 AM > > > > > > > Subject: CJK Analyzers for Solr > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > Does Solr come with Language analyzers for CJK? If not, can > you > > > > please > > > > > > > direct me to some good CJK analyzers? > > > > > > > > > > > > > > Regards, > > > > > > > Eswar > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > regards > > > > > > jl > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > regards > > > > jl > > > > > > > > > > > > >
Re: CJK Analyzers for Solr
Otis, Thanks for the information, we will check this out. Regards, Eswar On Nov 28, 2007 12:20 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > Eswar, > > I wouldn't worry about the performance of those CJK analyzers too much - > they are fairly trivial. The StandardAnalyzer is slower, for example. I > recently indexed cca 20MM large docs on a 8-core, 8 GB RAM box in 10 hours - > 550 docs/second. No CJK, just English. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > - Original Message > From: Eswar K <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Monday, November 26, 2007 9:27:15 PM > Subject: Re: CJK Analyzers for Solr > > thanks james... > > How much time does it take to index 18m docs? > > - Eswar > > On Nov 27, 2007 7:43 AM, James liu <[EMAIL PROTECTED]> wrote: > > > i not use HYLANDA analyzer. > > > > i use je-analyzer and indexing at least 18m docs. > > > > i m sorry i only use chinese analyzer. > > > > > > On Nov 27, 2007 10:01 AM, Eswar K <[EMAIL PROTECTED]> wrote: > > > > > What is the performance of these CJK analyzers (one in lucene and > > hylanda > > > )? > > > We would potentially be indexing millions of documents. > > > > > > James, > > > > > > We would have a look at hylanda too. What abt japanese and korean > > > analyzers, > > > any recommendations? > > > > > > - Eswar > > > > > > On Nov 27, 2007 7:21 AM, James liu <[EMAIL PROTECTED]> wrote: > > > > > > > I don't think NGram is good method for Chinese. > > > > > > > > CJKAnalyzer of Lucene is 2-Gram. > > > > > > > > Eswar K: > > > > if it is chinese analyzer,,i recommend > hylanda(www.hylanda.com),,,it > > is > > > > the best chinese analyzer and it not free. > > > > if u wanna free chinese analyzer, maybe u can try je-analyzer. > it > > have > > > > some problem when using it. > > > > > > > > > > > > > > > > On Nov 27, 2007 5:56 AM, Otis Gospodnetic > <[EMAIL PROTECTED]> > > > > wrote: > > > > > > > > > Eswar, > > > > > > > > > > We've uses the NGram stuff that exists in Lucene's > contrib/analyzers > > > > > instead of CJK. Doesn't that allow you to do everything that > the > > > > Chinese > > > > > and CJK analyzers do? It's been a few months since I've looked > at > > > > Chinese > > > > > and CJK Analzyers, so I could be off. > > > > > > > > > > Otis > > > > > > > > > > -- > > > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > > > > > - Original Message > > > > > From: Eswar K <[EMAIL PROTECTED]> > > > > > To: solr-user@lucene.apache.org > > > > > Sent: Monday, November 26, 2007 8:30:52 AM > > > > > Subject: CJK Analyzers for Solr > > > > > > > > > > Hi, > > > > > > > > > > Does Solr come with Language analyzers for CJK? If not, can you > > please > > > > > direct me to some good CJK analyzers? > > > > > > > > > > Regards, > > > > > Eswar > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > regards > > > > jl > > > > > > > > > > > > > > > -- > > regards > > jl > > > > > >
Re: CJK Analyzers for Solr
Eswar - I can answer the Google question. Actually, you are pointing to it in 1) :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Eswar K <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Wednesday, November 28, 2007 2:21:40 AM Subject: Re: CJK Analyzers for Solr John, There were two parts to my question, 1) n-gram vs morphological analyzer - This was based on what I read at a few places which rate morphological analysis higher than n-gram. An example being ( http://www.basistech.com/knowledge-center/products/N-Gram-vs-morphological-analysis.pdf). My intention of asking this was not to question the effectiveness of the existing implementation but was from the process of thought process behind the decision. I was and am curious to know if they are any downsides of using a morphological analyzer over the CJK analyzer, which prompted me to ask this. 2) Morphological Analyzer used by Google - I dont know which Morph analyzer Google uses, but I have read at different places that they do . - Eswar On Nov 27, 2007 10:42 PM, John Stewart <[EMAIL PROTECTED]> wrote: > Eswar, > > What type of morphological analysis do you suspect (or know) that > Google does on east asian text? I don't think you can treat the three > languages in the same way here. Japanese has multi-morphemic words, > but Chinese doesn't really. > > jds > > On Nov 27, 2007 11:54 AM, Eswar K <[EMAIL PROTECTED]> wrote: > > Is there any specific reason why the CJK analyzers in Solr were chosen > to be > > n-gram based instead of it being a morphological analyzer which is kind > of > > implemented in Google as it considered to be more effective than the > n-gram > > ones? > > > > Regards, > > Eswar > > > > > > > > > > On Nov 27, 2007 7:57 AM, Eswar K <[EMAIL PROTECTED]> wrote: > > > > > thanks james... > > > > > > How much time does it take to index 18m docs? > > > > > > - Eswar > > > > > > > > > On Nov 27, 2007 7:43 AM, James liu <[EMAIL PROTECTED] > wrote: > > > > > > > i not use HYLANDA analyzer. > > > > > > > > i use je-analyzer and indexing at least 18m docs. > > > > > > > > i m sorry i only use chinese analyzer. > > > > > > > > > > > > On Nov 27, 2007 10:01 AM, Eswar K <[EMAIL PROTECTED]> wrote: > > > > > > > > > What is the performance of these CJK analyzers (one in lucene and > > > > hylanda > > > > > )? > > > > > We would potentially be indexing millions of documents. > > > > > > > > > > James, > > > > > > > > > > We would have a look at hylanda too. What abt japanese and korean > > > > > analyzers, > > > > > any recommendations? > > > > > > > > > > - Eswar > > > > > > > > > > On Nov 27, 2007 7:21 AM, James liu <[EMAIL PROTECTED]> > wrote: > > > > > > > > > > > I don't think NGram is good method for Chinese. > > > > > > > > > > > > CJKAnalyzer of Lucene is 2-Gram. > > > > > > > > > > > > Eswar K: > > > > > > if it is chinese analyzer,,i recommend hylanda(www.hylanda.com) > ,,,it > > > > is > > > > > > the best chinese analyzer and it not free. > > > > > > if u wanna free chinese analyzer, maybe u can try je-analyzer. > it > > > > have > > > > > > some problem when using it. > > > > > > > > > > > > > > > > > > > > > > > > On Nov 27, 2007 5:56 AM, Otis Gospodnetic < > > > > [EMAIL PROTECTED]> > > > > > > wrote: > > > > > > > > > > > > > Eswar, > > > > > > > > > > > > > > We've uses the NGram stuff that exists in Lucene's > > > > contrib/analyzers > > > > > > > instead of CJK. Doesn't that allow you to do everything that > the > > > > > > Chinese > > > > > > > and CJK analyzers do? It's been a few months since I've > looked at > > > > > > Chinese > > > > > > > and CJK Analzyers, so I could be off. > > > > > > > > > > > > > > Otis > > > > > > > > > > > > > > -- > > > > > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > > > > > > > > > - Original Message > > > > > > > From: Eswar K <[EMAIL PROTECTED]> > > > > > > > To: solr-user@lucene.apache.org > > > > > > > Sent: Monday, November 26, 2007 8:30:52 AM > > > > > > > Subject: CJK Analyzers for Solr > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > Does Solr come with Language analyzers for CJK? If not, can > you > > > > please > > > > > > > direct me to some good CJK analyzers? > > > > > > > > > > > > > > Regards, > > > > > > > Eswar > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > regards > > > > > > jl > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > regards > > > > jl > > > > > > > > > > > > >
Re: CJK Analyzers for Solr
Not sure how up to date this is: http://www.basistech.com/customers/ I've only used their C++ products, which generally worked well for web search with a few exceptions. According to http:// www.basistech.com/knowledge-center/chinese/chinese-language- analysis.pdf , they provide Java APIs as well. Their CJK language analyzers are all morphological, AFAIK. To process mixed languages properly, you'll also need a unicode/ language aware container analyzer that automatically picks the right analyzer for the right language. __Luke On Nov 27, 2007, at 10:29 PM, Otis Gospodnetic wrote: Eswar - I'm interested in the answer to John's question, too! :) As for why n-grams - probably because they are free and simple, while dictionary-based stuff would likely not be free (are there free dictionaries for C or J or K??), and a morphological analyzer would be a bit more work. That said, if you need a morphological analyzer for non-CJK languages, let me know - see my sig. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: John Stewart <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Tuesday, November 27, 2007 12:12:40 PM Subject: Re: CJK Analyzers for Solr Eswar, What type of morphological analysis do you suspect (or know) that Google does on east asian text? I don't think you can treat the three languages in the same way here. Japanese has multi-morphemic words, but Chinese doesn't really. jds On Nov 27, 2007 11:54 AM, Eswar K <[EMAIL PROTECTED]> wrote: Is there any specific reason why the CJK analyzers in Solr were chosen to be n-gram based instead of it being a morphological analyzer which is kind of implemented in Google as it considered to be more effective than the n-gram ones? Regards, Eswar On Nov 27, 2007 7:57 AM, Eswar K <[EMAIL PROTECTED]> wrote: thanks james... How much time does it take to index 18m docs? - Eswar On Nov 27, 2007 7:43 AM, James liu <[EMAIL PROTECTED] > wrote: i not use HYLANDA analyzer. i use je-analyzer and indexing at least 18m docs. i m sorry i only use chinese analyzer. On Nov 27, 2007 10:01 AM, Eswar K <[EMAIL PROTECTED]> wrote: What is the performance of these CJK analyzers (one in lucene and hylanda )? We would potentially be indexing millions of documents. James, We would have a look at hylanda too. What abt japanese and korean analyzers, any recommendations? - Eswar On Nov 27, 2007 7:21 AM, James liu <[EMAIL PROTECTED]> wrote: I don't think NGram is good method for Chinese. CJKAnalyzer of Lucene is 2-Gram. Eswar K: if it is chinese analyzer,,i recommend hylanda(www.hylanda.com),,,it is the best chinese analyzer and it not free. if u wanna free chinese analyzer, maybe u can try je-analyzer. it have some problem when using it. On Nov 27, 2007 5:56 AM, Otis Gospodnetic < [EMAIL PROTECTED]> wrote: Eswar, We've uses the NGram stuff that exists in Lucene's contrib/analyzers instead of CJK. Doesn't that allow you to do everything that the Chinese and CJK analyzers do? It's been a few months since I've looked at Chinese and CJK Analzyers, so I could be off. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Eswar K <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Monday, November 26, 2007 8:30:52 AM Subject: CJK Analyzers for Solr Hi, Does Solr come with Language analyzers for CJK? If not, can you please direct me to some good CJK analyzers? Regards, Eswar -- regards jl -- regards jl