There is an alternative to knowing the language at query: multiply-process for stems or lemmas of all the possible languages. This may well be a cure much worse than the disease.
Yes, LI can sell you our lemma-production capability. --benson margulies basis technology On Tue, Jul 7, 2009 at 6:50 PM, Jan Høydahl<j...@cominvent.com> wrote: > When using stemming, you have to know the query language. > For your project, perhaps you should look into switching to a lemmatizer > instead. I believe Lucid can provide integration with a commercial > lemmatizer. This way you can expand the document field itself and do not > need to know the query language. You may then want to do a copyfield from > all your text_<lang> -> text for convenient one-field-to-rule-them-all > search. > > -- > Jan Høydahl > Gründer & senior architect > Cominvent AS, Stabekk, Norway > www.cominvent.com > +20 100930908 > > On 3. juli. 2009, at 08.43, Michael Lackhoff wrote: > >> On 03.07.2009 00:49 Paul Libbrecht wrote: >> >> [I'll try to address the other responses as well] >> >>> I believe the proper way is for the server to compute a list of >>> accepted languages in order of preferences. >>> The web-platform language (e.g. the user-setting), and the values in >>> the Accept-Language http header (which are from the browser or >>> platform). >> >> All this is not going to help much because the main application is a >> scientific search portal for books and articles with many users >> searching cross-language. The most typical use case is a German user >> searching multilingual. So we might even get the search multilingual, >> e.g. TITLE:cancer OR TITLE:krebs. No way here to watch out for >> Accept-headers or a language select field (would be left on "any" in >> most cases). Other popular use cases are citations (in whatever >> language) cut and pasted into the search field. >> >>> Then you expand your query for surfing waves (say) to: >>> - phrase query: surfing waves exactly (^2.0) >>> - two terms, no stemming: surfing waves (^1.5) >>> - iterate through the languages and query for stemmed variants: >>> - english: surf wav ^1.0 >>> - german surfing wave ^0.9 >>> - .... >>> - then maybe even try the phonetic analyzer (matched in a separate >>> field probably) >> >> This is an even more sophisticated variant of the multiple "OR" I came >> up with. Oh well... >> >>> I think this is a common pattern on the web where the users, browsers, >>> and servers are all somewhat multilingual. >> >> indeed and often users are not even aware of it, especially in a >> scientific context they use their native tongue and English almost >> interchangably -- and they expect the search engine to cope with it. >> >> I think the best would be to process the data according to its language >> but don't make any assumptions about the query language and I am totally >> lost how to get a clever schema.xml out of all this. >> >> Thanks everyone for listening and I am still open for good suggestions >> to deal with this problem! >> >> -Michael > >