Then you expand your query for surfing waves (say) to: - phrase query: surfing waves exactly (^2.0) - two terms, no stemming: surfing waves (^1.5) - iterate through the languages and query for stemmed variants: - english: surf wav ^1.0 - german surfing wave ^0.9 - ....- then maybe even try the phonetic analyzer (matched in a separate field probably)
I think this is a common pattern on the web where the users, browsers, and servers are all somewhat multilingual.
paul Le 02-juil.-09 à 22:15, Otis Gospodnetic a écrit :
Michael,I think you really aught to know the language of the query (from a pulldown, from the browser, from user settings, somewhere) and pass that to the backend.... unless your queries are sufficiently long that their language can be identified.Here is a handy tool for playing with language identification: http://www.sematext.com/demo/lid/ You'll see how hard it is to guess a language of very short texts. :)You really want to avoid that huge OR. Often it makes no sense to OR in multilingual context. Think about the word "die" (English and German, as you know) and what happens when you include that in an OR. And does it make sense to include a "very language specific word", say "wunderbar", in an OR that goes across multiple/all languages? Funny, they have it listed at http://www.merriam-webster.com/dictionary/wunderbarOtis-- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ----From: Michael Lackhoff <mich...@lackhoff.de> To: solr-user@lucene.apache.org Sent: Thursday, July 2, 2009 2:58:41 PM Subject: Preparing the ground for a real multilang index As pointed out in the recent thread about stemmers and other language specifics I should handle them all in their own right. But how? The first problem is how to know the language. Sometimes I have alanguage identifier within the record, sometimes I have more than one,sometimes I have none. How should I handle the non-obvious cases? Given I somehow know record1 is English and record2 is German. Then I need all my (relevant) fields for every language, e.g. I will haveTITLE_ENG and TITLE_GER and both will have their respective stemmer. But what with exotic languages? Use a catch all "language" without a stemmer?Now a user searches for TITLE:term and I don't know beforehand the language of "term". Do I have to expand the query to something like"TITLE_ENG:term OR TITLE_GER:term OR TITLE_XY:term OR ..." or is there some sort of copyfield for analyzed fields? Then I could just copy all the TITLE_* fields to TITLE and don't bother with the language of the query.Are there any solutions that prevent an index with thousands of fieldsand dozens of ORed query terms? I know I will have to implement some better multilanguage support but would also like to keep it as simple as possible. -Michael
smime.p7s
Description: S/MIME cryptographic signature