First of all, I am not really concerned with "per field"
(or per-column in DB term) portion of the original request.
Most documents are monolingual.

How languages are identified depends on your application,
and database support of language tagging is not necessary.

The database schema designer may have created a field that 
stores the language information, for example.

If you are indexing documents that live in a file system,
the directory hierarchy or the name of the documents might
tell the language, assuming you have set up some standard
naming convention.

HTML documents may have the META tag for Content-Language.  
If it is from an HTTP feed, there may be Content-Language header.

And if all else fails, or the information is not reliable, the language 
can be determined by analyzing the document statistically by software
such as Nutch's Language Identifier, or commercial language identifier
software like my employer, Basis Technology, sells.

> Most databases only RECENTLY have set up langauges per column. Languages per 
> ENTRY in a column? I don't think any support that yet. How would you get that 
> information from a database with the corresponding language attribute?
> 
> 
> Dennis Gearon
> 
> Signature Warning
> ----------------
> EARTH has a Right To Life,
>  otherwise we all die.
> 
> Read 'Hot, Flat, and Crowded'
> Laugh at http://www.yert.com/film.php
> 
> 
> --- On Wed, 3/24/10, Teruhiko Kurosaka <k...@basistech.com> wrote:
> 
>> From: Teruhiko Kurosaka <k...@basistech.com>
>> Subject: Re: If you could have one feature in Solr...
>> To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
>> Date: Wednesday, March 24, 2010, 11:36 AM
>> (Sorry for very late response on this
>> topic.)
>> 
>> On Feb 28, 2010, at 5:47 AM, Adrien Specq wrote:
>> 
>>> - langage attribute for each field
>> 
>> I was thinking about it and it was one of my wishes.
>> Currently, Solr practically requires that we have
>> a field for each natural language that an application
>> supports.  If the app needs to support English, French
>> and
>> German, we would have to have title_en, title_fr, and
>> title_de
>> (suffixes are ISO 2-letter lang codes) instead of just 
>> a title field.  This isn't pretty.  
>> 
>> What if we want to support 15 languages?  It would be
>> much 
>> better if we can have just one title field and language 
>> information associated with the value.  
>> 
>> But after I thought about it a bit deeper, I think the
>> current ugly solution is actually practical.  This is
>> because 
>> most users want to find documents of the languages they 
>> understand.  So if a user indicate they understand
>> English and 
>> German only, we just need to search title_en and title_de.
>> 
>> Maybe I'm missing something...
>> 
>> ----
>> Teruhiko "Kuro" Kurosaka, 415-227-9600 x122
>> RLP + Lucene & Solr = powerful search for global
>> contents
>> 
>> 

----
Teruhiko "Kuro" Kurosaka, 415-227-9600 x122
RLP + Lucene & Solr = powerful search for global contents

Reply via email to