Re: Preparing the ground for a real multilang index

Benson Margulies Tue, 07 Jul 2009 16:14:32 -0700

There is an alternative to knowing the language at query:
multiply-process for stems or lemmas of all the possible languages.
This may well be a cure much worse than the disease.


Yes, LI can sell you our lemma-production capability.

--benson margulies
basis technology




On Tue, Jul 7, 2009 at 6:50 PM, Jan Høydahl<j...@cominvent.com> wrote:
> When using stemming, you have to know the query language.
> For your project, perhaps you should look into switching to a lemmatizer
> instead. I believe Lucid can provide integration with a commercial
> lemmatizer. This way you can expand the document field itself and do not
> need to know the query language. You may then want to do a copyfield from
> all your text_<lang> -> text for convenient one-field-to-rule-them-all
> search.
>
> --
> Jan Høydahl
> Gründer & senior architect
> Cominvent AS, Stabekk, Norway
> www.cominvent.com
> +20 100930908
>
> On 3. juli. 2009, at 08.43, Michael Lackhoff wrote:
>
>> On 03.07.2009 00:49 Paul Libbrecht wrote:
>>
>> [I'll try to address the other responses as well]
>>
>>> I believe the proper way is for the server to compute a list of
>>> accepted languages in order of preferences.
>>> The web-platform language (e.g. the user-setting), and the values in
>>> the Accept-Language http header (which are from the browser or
>>> platform).
>>
>> All this is not going to help much because the main application is a
>> scientific search portal for books and articles with many users
>> searching cross-language. The most typical use case is a German user
>> searching multilingual. So we might even get the search multilingual,
>> e.g. TITLE:cancer OR TITLE:krebs. No way here to watch out for
>> Accept-headers or a language select field (would be left on "any" in
>> most cases). Other popular use cases are citations (in whatever
>> language) cut and pasted into the search field.
>>
>>> Then you expand your query for surfing waves (say) to:
>>> - phrase query: surfing waves exactly (^2.0)
>>> - two terms, no stemming: surfing waves (^1.5)
>>> - iterate through the languages and query for stemmed variants:
>>>  - english: surf wav ^1.0
>>>  - german surfing wave ^0.9
>>>  - ....
>>> - then maybe even try the phonetic analyzer (matched in a separate
>>> field probably)
>>
>> This is an even more sophisticated variant of the multiple "OR" I came
>> up with. Oh well...
>>
>>> I think this is a common pattern on the web where the users, browsers,
>>> and servers are all somewhat multilingual.
>>
>> indeed and often users are not even aware of it, especially in a
>> scientific context they use their native tongue and English almost
>> interchangably -- and they expect the search engine to cope with it.
>>
>> I think the best would be to process the data according to its language
>> but don't make any assumptions about the query language and I am totally
>> lost how to get a clever schema.xml out of all this.
>>
>> Thanks everyone for listening and I am still open for good suggestions
>> to deal with this problem!
>>
>> -Michael
>
>

Re: Preparing the ground for a real multilang index

Reply via email to