Re: Multilanguage

2009-03-04 Thread Karl Wettin
17 feb 2009 kl. 21.26 skrev Grant Ingersoll: I believe Karl Wettin submitted a Lucene patch for a Language guesser: http://issues.apache.org/jira/browse/LUCENE-826 but it is marked as won't fix. The test case of LUCENE-1039 is a language classifier. I've use patch to detect languages of

Re: Multilanguage

2009-02-17 Thread Walter Underwood
On 2/17/09 12:26 PM, "Grant Ingersoll" wrote: > If purchasing, several companies offer solutions, but I don't know > that their quality is any better than what you can get through open > source, as generally speaking, the problem is solved with a high > degree of accuracy through n-gram analysis.

Re: Multilanguage

2009-02-17 Thread Grant Ingersoll
uesday, February 17, 2009 6:39:40 PM Subject: Re: Multilanguage Does Apache Tika help find the language of the given document? On 2/17/09, Till Kinstler wrote: Paul Libbrecht schrieb: Clearly, then, something that matches words in a dictionary and decides on the language based on the langu

Re: Multilanguage

2009-02-17 Thread revathy arun
ementation is at the URL below my name. > > Otis -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > From: revathy arun > To: solr-user@lucene.apache.org > Sent: Tuesday, February 17, 2009 6:39:40 PM > Subj

Re: Multilanguage

2009-02-17 Thread Otis Gospodnetic
- Solr - Nutch From: revathy arun To: solr-user@lucene.apache.org Sent: Tuesday, February 17, 2009 6:39:40 PM Subject: Re: Multilanguage Does Apache Tika help find the language of the given document? On 2/17/09, Till Kinstler wrote: > > Paul Libbrecht schrieb: > &g

Re: Multilanguage

2009-02-17 Thread revathy arun
Does Apache Tika help find the language of the given document? On 2/17/09, Till Kinstler wrote: > > Paul Libbrecht schrieb: > > Clearly, then, something that matches words in a dictionary and decides on >> the language based on the language of the majority could do a decent job to >> decide the

Re: Multilanguage

2009-02-17 Thread Till Kinstler
Paul Libbrecht schrieb: Clearly, then, something that matches words in a dictionary and decides on the language based on the language of the majority could do a decent job to decide the analyzer. Does such a tool exist? I once played around with http://ngramj.sourceforge.net/ for language

Re: Multilanguage

2009-02-17 Thread Paul Libbrecht
I was looking for such a tool and haven't found it yet. Using StandardAnalyzer one can obtain some form of token-stream which can be used for "agnostic analysis". Clearly, then, something that matches words in a dictionary and decides on the language based on the language of the majority could

Re: Multilanguage

2009-02-16 Thread Otis Gospodnetic
Hi, The best option would be to identify the language after parsing the PDF and then index it using an appropriate analyzer defined in schema.xml. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: revathy arun To: solr-user@lucene.apac

Re: Multilanguage

2009-02-16 Thread Erick Erickson
I recommend that you search both this and the Lucene list. You'll find that this topic has been discussed many times, and several approaches have been outlined. The searchable archives are linked to from here: http://lucene.apache.org/java/docs/mailinglists.html. Best Erick On Mon, Feb 16, 2009

Re: multilanguage + howto search in all languages?

2009-01-29 Thread Julian Davchev
Thank you both for points. For now I am hanlding with fuzzy search. Let's hope this will do for sometime :) Walter Underwood wrote: > I've done this. There are five cases for the tokens in the search > index: > > 1. Tokens that are unique after stemming (this is good). > 2. Tokens that are common

Re: multilanguage + howto search in all languages?

2009-01-28 Thread Walter Underwood
Duh. Four cases. For extra credit, what language is "wunder" in? wunder On 1/28/09 5:12 PM, "Walter Underwood" wrote: > I've done this. There are five cases for the tokens in the search > index: > > 1. Tokens that are unique after stemming (this is good). > 2. Tokens that are common after stem

Re: multilanguage + howto search in all languages?

2009-01-28 Thread Walter Underwood
I've done this. There are five cases for the tokens in the search index: 1. Tokens that are unique after stemming (this is good). 2. Tokens that are common after stemming (usually trademarks, like LaserJet). 3. Tokens with collisions after stemming: German "mit", "MIT" the university Germ

Re: multilanguage + howto search in all languages?

2009-01-28 Thread Erick Erickson
I'm not entirely sure about the fine points, but consider the filters that are available that fold all the diacritics into their low-ascii equivalents. Perhaps using that filter at *both* index and search time on the English index would do the trick. In your example, both would be 'munchen'. Strai

Re: multilanguage prototype

2009-01-28 Thread Jerven Bolleman
Hi, Your problem seems to be lower level than the SOLR code. You are sending an xml request that contains an illegal (to xml spec) character. You should strip these characters out of the data that you send. Or turn the xml validation (not recommended because of all kinds of risks). See http://www

Re: multilanguage prototype

2009-01-27 Thread revathy arun
Hi, I a, getting this error in the tomcat log file on passing chinese test to the content field The content field uses the ckj tokenizer. and is defined as INFO: [] webapp=/lang_prototype path=/update params={} status=0 QTime=69 Jan 28, 2009 12:17:03 PM org.apache.solr.common.

Re: multilanguage prototype

2009-01-27 Thread revathy arun
Hi, This is the only info in the tomcat log at indexing Jan 27, 2009 3:46:15 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/lang_prototype path=/update params={} status=0 QTime=191 I dont see any ohter errors in the logs . when i use curl to update i get success message. and commit

Re: multilanguage prototype

2009-01-27 Thread Erik Hatcher
errors: 11 What were those? My hunch is your indexer had issues. What did Solr output into the console or log during indexing? Erik On Jan 27, 2009, at 6:56 AM, revathy arun wrote: Hi Shalin, The admin page stats are as follows searcherName : searc...@1d4c3d5 main caching : true

Re: multilanguage prototype

2009-01-27 Thread revathy arun
Hi Shalin, The admin page stats are as follows searcherName : searc...@1d4c3d5 main caching : true numDocs : 0 maxDoc : 0 *name: * /update *class: * org.apache.solr.handler.XmlUpdateRequestHandler *version: * $Revision: 690026 $ *description: * Add documents with XML * stats: *handlerStart :

Re: multilanguage prototype

2009-01-27 Thread Shalin Shekhar Mangar
Are you looking for it in the right place? It is very unlikely that a commit happens and index is not created. The index is usually created inside the data directory as configured in your solconfig.xml Can you search for *:* from the solr admin page and see if documents are returned? On Tue, Jan

Re: multilanguage prototype

2009-01-27 Thread revathy arun
this is the stats of my updatehandler but i still dont see any index created *stats: *commits : 7 autocommits : 0 optimizes : 2 docsPending : 0 adds : 0 deletesById : 0 deletesByQuery : 0 errors : 0 cumulative_adds : 0 cumulative_deletesById : 0 cumulative_deletesByQuery : 0 cumulative_errors : 0

Re: multilanguage prototype

2009-01-27 Thread revathy arun
Hi I have committed.The admin page does not show any docs pending or committed or any errors. Regards Sujatha On 1/27/09, Shalin Shekhar Mangar wrote: > > Did you commit after the updates? > > 2009/1/27 revathy arun > > > Hi, > > > > I have downloade solr1.3.0 . > > > > I need to index chines

Re: multilanguage prototype

2009-01-27 Thread Shalin Shekhar Mangar
Did you commit after the updates? 2009/1/27 revathy arun > Hi, > > I have downloade solr1.3.0 . > > I need to index chinese content ,for this i have defined a new field in the > schema > > as > > > positionIncrementGap="100"> > > > > > > > > > > > > > > > > > > I beleive solr1.3 already