When using stemming, you have to know the query language.
For your project, perhaps you should look into switching to a
lemmatizer instead. I believe Lucid can provide integration with a
commercial lemmatizer. This way you can expand the document field
itself and do not need to know the query language. You may then want
to do a copyfield from all your text_<lang> -> text for convenient one-
field-to-rule-them-all search.
--
Jan Høydahl
Gründer & senior architect
Cominvent AS, Stabekk, Norway
www.cominvent.com
+20 100930908
On 3. juli. 2009, at 08.43, Michael Lackhoff wrote:
On 03.07.2009 00:49 Paul Libbrecht wrote:
[I'll try to address the other responses as well]
I believe the proper way is for the server to compute a list of
accepted languages in order of preferences.
The web-platform language (e.g. the user-setting), and the values in
the Accept-Language http header (which are from the browser or
platform).
All this is not going to help much because the main application is a
scientific search portal for books and articles with many users
searching cross-language. The most typical use case is a German user
searching multilingual. So we might even get the search multilingual,
e.g. TITLE:cancer OR TITLE:krebs. No way here to watch out for
Accept-headers or a language select field (would be left on "any" in
most cases). Other popular use cases are citations (in whatever
language) cut and pasted into the search field.
Then you expand your query for surfing waves (say) to:
- phrase query: surfing waves exactly (^2.0)
- two terms, no stemming: surfing waves (^1.5)
- iterate through the languages and query for stemmed variants:
- english: surf wav ^1.0
- german surfing wave ^0.9
- ....
- then maybe even try the phonetic analyzer (matched in a separate
field probably)
This is an even more sophisticated variant of the multiple "OR" I came
up with. Oh well...
I think this is a common pattern on the web where the users,
browsers,
and servers are all somewhat multilingual.
indeed and often users are not even aware of it, especially in a
scientific context they use their native tongue and English almost
interchangably -- and they expect the search engine to cope with it.
I think the best would be to process the data according to its
language
but don't make any assumptions about the query language and I am
totally
lost how to get a clever schema.xml out of all this.
Thanks everyone for listening and I am still open for good suggestions
to deal with this problem!
-Michael