17 feb 2009 kl. 21.26 skrev Grant Ingersoll:
I believe Karl Wettin submitted a Lucene patch for a Language
guesser: http://issues.apache.org/jira/browse/LUCENE-826 but it is
marked as won't fix.
The test case of LUCENE-1039 is a language classifier. I've use patch
to detect languages of
On 2/17/09 12:26 PM, "Grant Ingersoll" wrote:
> If purchasing, several companies offer solutions, but I don't know
> that their quality is any better than what you can get through open
> source, as generally speaking, the problem is solved with a high
> degree of accuracy through n-gram analysis.
uesday, February 17, 2009 6:39:40 PM
Subject: Re: Multilanguage
Does Apache Tika help find the language of the given document?
On 2/17/09, Till Kinstler wrote:
Paul Libbrecht schrieb:
Clearly, then, something that matches words in a dictionary and
decides
on
the language based on the langu
ementation is at the URL below my name.
>
> Otis --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
>
>
> From: revathy arun
> To: solr-user@lucene.apache.org
> Sent: Tuesday, February 17, 2009 6:39:40 PM
> Subj
- Solr - Nutch
From: revathy arun
To: solr-user@lucene.apache.org
Sent: Tuesday, February 17, 2009 6:39:40 PM
Subject: Re: Multilanguage
Does Apache Tika help find the language of the given document?
On 2/17/09, Till Kinstler wrote:
>
> Paul Libbrecht schrieb:
>
&g
Does Apache Tika help find the language of the given document?
On 2/17/09, Till Kinstler wrote:
>
> Paul Libbrecht schrieb:
>
> Clearly, then, something that matches words in a dictionary and decides on
>> the language based on the language of the majority could do a decent job to
>> decide the
Paul Libbrecht schrieb:
Clearly, then, something that matches words in a dictionary and decides
on the language based on the language of the majority could do a decent
job to decide the analyzer.
Does such a tool exist?
I once played around with http://ngramj.sourceforge.net/ for language
I was looking for such a tool and haven't found it yet.
Using StandardAnalyzer one can obtain some form of token-stream which
can be used for "agnostic analysis".
Clearly, then, something that matches words in a dictionary and
decides on the language based on the language of the majority could
Hi,
The best option would be to identify the language after parsing the PDF and
then index it using an appropriate analyzer defined in schema.xml.
Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
From: revathy arun
To: solr-user@lucene.apac
I recommend that you search both this and the
Lucene list. You'll find that this topic has been
discussed many times, and several approaches
have been outlined.
The searchable archives are linked to from here:
http://lucene.apache.org/java/docs/mailinglists.html.
Best
Erick
On Mon, Feb 16, 2009
Thank you both for points. For now I am hanlding with fuzzy search.
Let's hope this will do for sometime :)
Walter Underwood wrote:
> I've done this. There are five cases for the tokens in the search
> index:
>
> 1. Tokens that are unique after stemming (this is good).
> 2. Tokens that are common
Duh. Four cases. For extra credit, what language is "wunder" in?
wunder
On 1/28/09 5:12 PM, "Walter Underwood" wrote:
> I've done this. There are five cases for the tokens in the search
> index:
>
> 1. Tokens that are unique after stemming (this is good).
> 2. Tokens that are common after stem
I've done this. There are five cases for the tokens in the search
index:
1. Tokens that are unique after stemming (this is good).
2. Tokens that are common after stemming (usually trademarks,
like LaserJet).
3. Tokens with collisions after stemming:
German "mit", "MIT" the university
Germ
I'm not entirely sure about the fine points, but consider the
filters that are available that fold all the diacritics into their
low-ascii equivalents. Perhaps using that filter at *both* index
and search time on the English index would do the trick.
In your example, both would be 'munchen'. Strai
Hi,
Your problem seems to be lower level than the SOLR code. You are sending
an xml request that contains an illegal (to xml spec) character. You
should strip these characters out of the data that you send. Or turn the
xml validation (not recommended because of all kinds of risks).
See
http://www
Hi,
I a, getting this error in the tomcat log file on passing chinese test to
the content field
The content field uses the ckj tokenizer.
and is defined as
INFO: [] webapp=/lang_prototype path=/update params={} status=0 QTime=69
Jan 28, 2009 12:17:03 PM org.apache.solr.common.
Hi,
This is the only info in the tomcat log at indexing
Jan 27, 2009 3:46:15 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/lang_prototype path=/update params={} status=0 QTime=191
I dont see any ohter errors in the logs .
when i use curl to update i get success message.
and commit
errors: 11
What were those?
My hunch is your indexer had issues. What did Solr output into the
console or log during indexing?
Erik
On Jan 27, 2009, at 6:56 AM, revathy arun wrote:
Hi Shalin,
The admin page stats are as follows
searcherName : searc...@1d4c3d5 main
caching : true
Hi Shalin,
The admin page stats are as follows
searcherName : searc...@1d4c3d5 main
caching : true
numDocs : 0
maxDoc : 0
*name: * /update *class: * org.apache.solr.handler.XmlUpdateRequestHandler
*version: * $Revision: 690026 $ *description: * Add documents with XML *
stats: *handlerStart :
Are you looking for it in the right place? It is very unlikely that a commit
happens and index is not created.
The index is usually created inside the data directory as configured in your
solconfig.xml
Can you search for *:* from the solr admin page and see if documents are
returned?
On Tue, Jan
this is the stats of my updatehandler
but i still dont see any index created
*stats: *commits : 7
autocommits : 0
optimizes : 2
docsPending : 0
adds : 0
deletesById : 0
deletesByQuery : 0
errors : 0
cumulative_adds : 0
cumulative_deletesById : 0
cumulative_deletesByQuery : 0
cumulative_errors : 0
Hi
I have committed.The admin page does not show any docs pending or committed
or any errors.
Regards
Sujatha
On 1/27/09, Shalin Shekhar Mangar wrote:
>
> Did you commit after the updates?
>
> 2009/1/27 revathy arun
>
> > Hi,
> >
> > I have downloade solr1.3.0 .
> >
> > I need to index chines
Did you commit after the updates?
2009/1/27 revathy arun
> Hi,
>
> I have downloade solr1.3.0 .
>
> I need to index chinese content ,for this i have defined a new field in the
> schema
>
> as
>
>
> positionIncrementGap="100">
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> I beleive solr1.3 already
23 matches
Mail list logo