Re: Multilingual - Search against the appropriate field

Saïd Radhouani Thu, 01 Jul 2010 09:50:51 -0700

Hi Jan,

I totally agree with what you said.


In a), you talked about boosting. I guess you meant to boost at the client 
side, right?

I still have a question: 

>> does Solr choose the appropriate analysis for the query. i.e., if a query is 
>> compared to a document having English free text (text_en is populated), does 
>> Solr analyze it as it was in English ?


Thanks,
-Saïd

On Jul 1, 2010, at 1:26 PM, Jan Høydahl / Cominvent wrote:

> Hi,
> 
> I have chosen the same approach as you, indexing content into text_<language> 
> fields with custom analysis, and it works great. Solr does not have any 
> overhead with this even if there are hundreds of languages, due to the 
> schema-less nature of Lucene.
> 
> And if you know which language is being searched, you can select only those 
> fields in question, and you'd still be as fast as the mono language case. But 
> you'd only get documents in that language returned.
> 
> Say you want to match across languages, it could be you search for "obama" 
> which would be written the same in all languages. How to achieve this? I see 
> two approaches:
> a) Seach across all languages with proper analysis, as you suggest qf=text_fr 
> text_en^10 (you can even boost the preferred languages).
> b) Index all content in a "text_all" field with no stemming involved and 
> search qf=text_all (you will match "obama" in all languages but lose stemming)
> 
> My feeling is that a) would work if you have a limited set of languages, but 
> b) might be necessary if you have dozens of languages to search across, due 
> to reduced query performance with such a large disMax query.
> 
> Of course with a) there may be ambiguities that an english word gets stemmed 
> and hits the same stem as a totally different french word - I don't have any 
> hands on examples, but I'm sure the issue exists. Then it is probably better 
> to search the other languages un-stemmed, like a hybrid approach:
> 
> c) Search the query language stemmed and all other unstemmed (qf=text_en^10 
> text_all - giving increased recall)
> 
> The downside of a text_all field is you almost double the size of your index 
> worst-case.
> 
> Then you have the issue of displaying the results in front end.
> Which title do you pick? title_en or title_fr? Here, I also see two solutions 
> and I have tried both:
> 1) Store a title_display which is stored, while the title_<language> fields 
> are only indexed, not stored. Use the title_display in frontend
> 2) Make a wrapper around QueryResult class so when frontend asks for "title", 
> you intelligently try to pull out title_XY where XY is pulled from documents 
> "language" metadata.
> 
> I think which you choose depends on taste, each has its + and -
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
> 
> On 1. juli 2010, at 12.26, Saïd Radhouani wrote:
> 
>> Hi,
>> 
>> I know this topic has been treated many times in the (distant) past, but I 
>> wonder whether there are new better practices/tendencies.
>> 
>> In my application, I'm dealing with documents in different languages. Each 
>> document is monolingual; it has some fields containing free text and a set 
>> of fields that do not require any text analysis. For the free text, we need 
>> to make a specific analysis based of the language of the document.
>> 
>> I'm for the use of a single index for all the documents instead of one index 
>> per language (any objection?). Thus, in schema.xml, I need to declare a 
>> separate field for each language (text_fr, text_en, etc.), each with its own 
>> appropriate analysis. Then, during the indexing, I need to assign the free 
>> text content of each document to the appropriate field. Thus, for each 
>> document, only one of the freetext fields would be populated.
>> 
>> My question is, at search time, what is the best solution to search against 
>> the appropriate field?
>> 
>> I know that using dismax, we can define in "qf" the set the fields we want 
>> to search against. e.g., <str name="qf"> text_fr text_en</str>
>> 
>> With this solution, does Solr choose the appropriate analysis for the query. 
>> i.e., if a query is compared to a document having English free text (text_en 
>> is populated), does Solr analyze the query as it was in English ?
>> 
>> One problem with this approach is that, each query will be compared to all 
>> the available documents. i.e., a query in English would be compared to a 
>> document in French. As I know, if we know the query language, this problem 
>> can be avoided, either by searching against the appropriate field (e.g., 
>> text_fr:query), or by using a filter to select only those documents having 
>> English text. Am I correct? Or is there a better solution?
>> 
>> Thanks,
>> -Saïd
>> 
> 
>

Re: Multilingual - Search against the appropriate field

Reply via email to