Hi Jan, I totally agree with what you said.
In a), you talked about boosting. I guess you meant to boost at the client side, right? I still have a question: >> does Solr choose the appropriate analysis for the query. i.e., if a query is >> compared to a document having English free text (text_en is populated), does >> Solr analyze it as it was in English ? Thanks, -Saïd On Jul 1, 2010, at 1:26 PM, Jan Høydahl / Cominvent wrote: > Hi, > > I have chosen the same approach as you, indexing content into text_<language> > fields with custom analysis, and it works great. Solr does not have any > overhead with this even if there are hundreds of languages, due to the > schema-less nature of Lucene. > > And if you know which language is being searched, you can select only those > fields in question, and you'd still be as fast as the mono language case. But > you'd only get documents in that language returned. > > Say you want to match across languages, it could be you search for "obama" > which would be written the same in all languages. How to achieve this? I see > two approaches: > a) Seach across all languages with proper analysis, as you suggest qf=text_fr > text_en^10 (you can even boost the preferred languages). > b) Index all content in a "text_all" field with no stemming involved and > search qf=text_all (you will match "obama" in all languages but lose stemming) > > My feeling is that a) would work if you have a limited set of languages, but > b) might be necessary if you have dozens of languages to search across, due > to reduced query performance with such a large disMax query. > > Of course with a) there may be ambiguities that an english word gets stemmed > and hits the same stem as a totally different french word - I don't have any > hands on examples, but I'm sure the issue exists. Then it is probably better > to search the other languages un-stemmed, like a hybrid approach: > > c) Search the query language stemmed and all other unstemmed (qf=text_en^10 > text_all - giving increased recall) > > The downside of a text_all field is you almost double the size of your index > worst-case. > > Then you have the issue of displaying the results in front end. > Which title do you pick? title_en or title_fr? Here, I also see two solutions > and I have tried both: > 1) Store a title_display which is stored, while the title_<language> fields > are only indexed, not stored. Use the title_display in frontend > 2) Make a wrapper around QueryResult class so when frontend asks for "title", > you intelligently try to pull out title_XY where XY is pulled from documents > "language" metadata. > > I think which you choose depends on taste, each has its + and - > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > Training in Europe - www.solrtraining.com > > On 1. juli 2010, at 12.26, Saïd Radhouani wrote: > >> Hi, >> >> I know this topic has been treated many times in the (distant) past, but I >> wonder whether there are new better practices/tendencies. >> >> In my application, I'm dealing with documents in different languages. Each >> document is monolingual; it has some fields containing free text and a set >> of fields that do not require any text analysis. For the free text, we need >> to make a specific analysis based of the language of the document. >> >> I'm for the use of a single index for all the documents instead of one index >> per language (any objection?). Thus, in schema.xml, I need to declare a >> separate field for each language (text_fr, text_en, etc.), each with its own >> appropriate analysis. Then, during the indexing, I need to assign the free >> text content of each document to the appropriate field. Thus, for each >> document, only one of the freetext fields would be populated. >> >> My question is, at search time, what is the best solution to search against >> the appropriate field? >> >> I know that using dismax, we can define in "qf" the set the fields we want >> to search against. e.g., <str name="qf"> text_fr text_en</str> >> >> With this solution, does Solr choose the appropriate analysis for the query. >> i.e., if a query is compared to a document having English free text (text_en >> is populated), does Solr analyze the query as it was in English ? >> >> One problem with this approach is that, each query will be compared to all >> the available documents. i.e., a query in English would be compared to a >> document in French. As I know, if we know the query language, this problem >> can be avoided, either by searching against the appropriate field (e.g., >> text_fr:query), or by using a filter to select only those documents having >> English text. Am I correct? Or is there a better solution? >> >> Thanks, >> -Saïd >> > >