Language support

2008-03-19 Thread David King
This has probably been asked before, but I'm having trouble finding  
it. Basically, we want to be able to search for content across several  
languages, given that we know what language a datum and a query are  
in. Is there an obvious way to do this?


Here's the longer version: I am trying to index content that occurs in  
multiple languages, including Asian languages. I'm in the process of  
moving from PyLucene to Solr. In PyLucene, I would have a list of  
analysers:


analyzers = dict(en = pyluc.SnowballAnalyzer("English"),
 cs = pyluc.CzechAnalyzer(),
 pt = pyluc.SnowballAnalyzer("Portuguese"),
 ...

Then when I want to index something, I do

   writer = pyluc.IndexWriter(store, analyzer, create)
   writer.addDocument(d.doc)

That is, I tell Lucene the language of every datum, and the analyser  
to use when writing out the field. Then when I want to search against  
it, I do


analyzer = LanguageAnalyzer.getanal(lang)
q = pyluc.QueryParser(field, analyzer).parse(value)

And use that QueryParser to parse the query in the given language  
before sending it off to PyLucene. (off-topic: getanal() is perhaps my  
favourite function-name ever). So the language of a given datum is  
attached to the datum itself. In Solr, however, this appears to be  
attached to the field, not to the individual data in it:



  


Does this mean there there's no way to have a single "contents" field  
that has content in multiple languages, and still have the queries be  
parsed and stemmed correctly? How are other people handling this? Does  
it makes sense to write a tokeniser factory and a query factory that  
look at, say, the 'lang' field and return the correct tokenisers? Does  
this already exist?


The other alternative is to have a text_zh field, a text_en field,  
etc, and to modify the query to search on that field depending on the  
language of the query, but that seems kind of hacky to me, especially  
if a query may be against more than one language. Is this the accepted  
way to go about it? Is there a benefit to this method over writing a  
detecting tokeniser factory?




Re: Language support

2008-03-20 Thread David King
You may be interested in a recent discussion that took place on a  
similar

subject:
http://www.mail-archive.com/solr-user@lucene.apache.org/msg09332.html


Interesting, yes. But since it doesn't actually exist, it's not much  
help.


I guess what I'm asking is, if my approach seems convoluted, I'm  
probably doing it wrong, so how *a*re people solving the problem of  
searching over multiple languages? What is the canonical way to do this?






Nicolas

-Message d'origine-
De : David King [mailto:[EMAIL PROTECTED]
Envoyé : mercredi 19 mars 2008 20:07
À : solr-user@lucene.apache.org
Objet : Language support

This has probably been asked before, but I'm having trouble finding
it. Basically, we want to be able to search for content across several
languages, given that we know what language a datum and a query are
in. Is there an obvious way to do this?

Here's the longer version: I am trying to index content that occurs in
multiple languages, including Asian languages. I'm in the process of
moving from PyLucene to Solr. In PyLucene, I would have a list of
analysers:

analyzers = dict(en = pyluc.SnowballAnalyzer("English"),
 cs = pyluc.CzechAnalyzer(),
 pt = pyluc.SnowballAnalyzer("Portuguese"),
 ...

Then when I want to index something, I do

   writer = pyluc.IndexWriter(store, analyzer, create)
   writer.addDocument(d.doc)

That is, I tell Lucene the language of every datum, and the analyser
to use when writing out the field. Then when I want to search against
it, I do

analyzer = LanguageAnalyzer.getanal(lang)
q = pyluc.QueryParser(field, analyzer).parse(value)

And use that QueryParser to parse the query in the given language
before sending it off to PyLucene. (off-topic: getanal() is perhaps my
favourite function-name ever). So the language of a given datum is
attached to the datum itself. In Solr, however, this appears to be
attached to the field, not to the individual data in it:


  


Does this mean there there's no way to have a single "contents" field
that has content in multiple languages, and still have the queries be
parsed and stemmed correctly? How are other people handling this? Does
it makes sense to write a tokeniser factory and a query factory that
look at, say, the 'lang' field and return the correct tokenisers? Does
this already exist?

The other alternative is to have a text_zh field, a text_en field,
etc, and to modify the query to search on that field depending on the
language of the query, but that seems kind of hacky to me, especially
if a query may be against more than one language. Is this the accepted
way to go about it? Is there a benefit to this method over writing a
detecting tokeniser factory?




Re: Language support

2008-03-20 Thread David King
Unless you can come up with language-neutral tokenization and  
stemming, you

need to:
a) know the language of each document.
b) run a different analyzer depending on the language.
c) force the user to tell you the language of the query.
d) run the query through the same analyzer.


I can do all of those. This implies storing all of the different  
languages in different fields, right? Then changing the default search- 
field to the language of the query for every query?








On Thu, Mar 20, 2008 at 12:17 PM, David King <[EMAIL PROTECTED]>  
wrote:



You may be interested in a recent discussion that took place on a
similar
subject:
http://www.mail-archive.com/solr-user@lucene.apache.org/ 
msg09332.html


Interesting, yes. But since it doesn't actually exist, it's not much
help.

I guess what I'm asking is, if my approach seems convoluted, I'm
probably doing it wrong, so how *a*re people solving the problem of
searching over multiple languages? What is the canonical way to do  
this?






Nicolas

-Message d'origine-
De : David King [mailto:[EMAIL PROTECTED]
Envoyé : mercredi 19 mars 2008 20:07
À : solr-user@lucene.apache.org
Objet : Language support

This has probably been asked before, but I'm having trouble finding
it. Basically, we want to be able to search for content across  
several

languages, given that we know what language a datum and a query are
in. Is there an obvious way to do this?

Here's the longer version: I am trying to index content that  
occurs in

multiple languages, including Asian languages. I'm in the process of
moving from PyLucene to Solr. In PyLucene, I would have a list of
analysers:

   analyzers = dict(en = pyluc.SnowballAnalyzer("English"),
cs = pyluc.CzechAnalyzer(),
pt = pyluc.SnowballAnalyzer("Portuguese"),
...

Then when I want to index something, I do

  writer = pyluc.IndexWriter(store, analyzer, create)
  writer.addDocument(d.doc)

That is, I tell Lucene the language of every datum, and the analyser
to use when writing out the field. Then when I want to search  
against

it, I do

   analyzer = LanguageAnalyzer.getanal(lang)
   q = pyluc.QueryParser(field, analyzer).parse(value)

And use that QueryParser to parse the query in the given language
before sending it off to PyLucene. (off-topic: getanal() is  
perhaps my

favourite function-name ever). So the language of a given datum is
attached to the datum itself. In Solr, however, this appears to be
attached to the field, not to the individual data in it:

   
 
   

Does this mean there there's no way to have a single "contents"  
field
that has content in multiple languages, and still have the queries  
be
parsed and stemmed correctly? How are other people handling this?  
Does

it makes sense to write a tokeniser factory and a query factory that
look at, say, the 'lang' field and return the correct tokenisers?  
Does

this already exist?

The other alternative is to have a text_zh field, a text_en field,
etc, and to modify the query to search on that field depending on  
the
language of the query, but that seems kind of hacky to me,  
especially
if a query may be against more than one language. Is this the  
accepted

way to go about it? Is there a benefit to this method over writing a
detecting tokeniser factory?