RE: Proposition of a new feature: Dynamic Field Types

nicolas . dessaigne Mon, 03 Mar 2008 11:04:18 -0800

> How many languages are you dealing with?

The number of languages depends greatly on the project. We are usually
dealing with 2 or 3 languages and I've yet to see a project with more than
5.


> How are you generating your queries?

With a specific handler (based on the DisMax) that have an extra lng
parameter to know which analyzer to use.

> Are you taking your source language and translating it to each of the
languages?  Or are you just targeting one source to destination language?

No translation involved. We actually infer semantic information independent
of the language and use it in our cross-lingual queries: one source language
for the query and many destination languages for the results.

As discussed earlier, you're right about inadequacy of cross-lingual search
for correct ranking. It seems there is no perfect solution for this need,
just tradeoffs...

We'll do a test with the two approaches (one field for all / one field per
language) and I'll report back when we've made a choice.

> Again, I think this depends on how you setup the indices, etc.

About the performance question, I meant that for fields configured in the
same manner (except for the language supported by the analyzer of course)
and for a big number of documents, the time consumed by a query on one field
containing a value for all the documents would be nearly half of the
cumulative time consumed by two queries each on a partition of the
documents. This is due to the O(log n) complexity of queries: a query on a
index of 10,000,000 docs is only slightly slower than the same query on a
index of 5,000,000 docs.

Thanks for your advice,
I'll try to report back on our tests.

Nicolas

-----Message d'origine-----
De : Grant Ingersoll [mailto:[EMAIL PROTECTED] 
Envoyé : samedi 1 mars 2008 13:54
À : solr-user@lucene.apache.org
Objet : Re: Proposition of a new feature: Dynamic Field Types

How many languages are you dealing with?  How are you generating your  
queries?  Are you taking your source language and translating it to  
each of the languages?  Or are you just targeting one source to  
destination language?  Back when I was doing CLIR, I would create  
separate indexes, but that is not to say it is appropriate for your  
task.  It was always highly suspect in my mind the notion of combining  
the results of multiple languages into a single hit list.  I just  
doubt that the scores are meaningful in that way since scores should  
generally only be considered relative to each other within a single  
query.

Some more below

On Feb 29, 2008, at 11:52 AM, [EMAIL PROTECTED] wrote:

> Thanks for your response Grant.
>
> You are right, depending of the language we could index the text in a
> specific field. At request time, we would then ask all the fields  
> for the
> query.
>
> I see however a few possible problems with this approach. By order of
> decreasing importance:
>
> - Influence on relevance
>
> I assume the idf is calculated on a field by field basis? In the  
> context of
> one field per language, the documents whose language is the less  
> present in
> the index will receive an unusual boost for cross-lingual tokens. This
> situation can be quite frequent as the distribution of languages in  
> the
> index is usually heterogeneous. Even if it was homogeneous, we would  
> have
> the problem with rare text in one language citing words in another.
>
> On the other hand, you are right in the sense that the idf of language
> specific words is also altered. In the context of one field for all
> languages, the idf could be very low for a word if it is a common  
> word in
> another language. For example, the world "thé" in French is quite  
> rare, but
> its idf would be greatly altered by the word "the" in English.

I would be interesting to see a study on a real index about the  
effects of this.


>
>
> We have a dilemma here...
>
> - Performance
>
> Queries are in O(log n) if I'm not mistaken? Then a disjunction  
> query on x
> language fields would be nearly x times slower, no?

Again, I think this depends on how you setup the indices, etc.

>
>
> - Verbose configuration
>
> Not an important point, but with the dynamic field type, you  
> configure only
> one time all the languages. Otherwise, you must do so for each text  
> field.
>
> The query handler configuration would also be much more verbose. We  
> usually
> use the dismax handler and the qf could become very long.

True.


>
>
> - Highlight
>
> Not an important point either, but a bit of work need to be done to
> aggregate the results.
>
> In conclusion, the choice is not so clear for me. Your remark on the
> relevance made me think a bit more on multilingual problems. There  
> may be a
> way to tune the idf of some fields depending on others?
>
> Another idea would be to boost documents in the language of the  
> request.
> This may be actually much simpler.
>
> If you have any idea on the subject I'm very interested!
>
> Nicolas
>
>
> -----Message d'origine-----
> De : Grant Ingersoll [mailto:[EMAIL PROTECTED]
> Envoyé : vendredi 29 février 2008 14:06
> À : solr-user@lucene.apache.org
> Objet : Re: Proposition of a new feature: Dynamic Field Types
>
> Why can't you choose the proper field in your application and keep
> separate fields per language?  Putting them all in the same field,
> regardless of language, is not a good idea in my opinion because it is
> more than likely going to skew your statistics and lower your  
> relevance.
>
> That being said, the dynamic field type is still an interesting idea.
>
> -Grant
>
> On Feb 29, 2008, at 5:56 AM, [EMAIL PROTECTED] wrote:
>
>> Dynamic field types are field types that act as proxies to other  
>> field
>> types. The choice of the field type to use is done on a per document
>> basis
>> and is dependent of the values of the document's fields.
>>
>> The use case that led us to this feature is the indexation of
>> documents in
>> different languages. We use a specific analyzer for each language
>> but want
>> to index semantic information that is not specific to the language.
>>
>> For example, we would add in the index the semantic tag {co:Paris}
>> for the
>> expressions "Paris", "capital city of France", "the city of lights"  
>> in
>> English and "Paris", "capitale de la France", "la ville lumière" in
>> French.
>> This allows us to provide advanced functionalities such as semantic
>> and
>> cross-lingual search.
>>
>> To do so in SOLR, we chose to index texts written in different
>> languages in
>> the same field, while analyzing them with different analyzers. Hence
>> the
>> proposition of a new feature that respond to this need: Dynamic
>> Field Types.
>>
>> The idea of this new field type is to act as a proxy to other field
>> types.
>> Depending of the values of some fields of the document to index, it
>> chooses
>> the correct field type to use. In our situation, we use it to choose
>> the
>> correct language dependent field type based on the value of the
>> field named
>> "language". It is configured with a config similar to the following:
>>
>>      <fieldtype name="french_ft" ...>
>>      ...
>>      </fieldtype>
>>
>>      <fieldtype name="english_ft" ...>
>>      ...
>>      </fieldtype>
>>
>>      <dynamicFieldType name="multilanguage">
>>              <fieldtypes>
>>                      <fieldtype condition="language:fr"
>> name="french_ft"/>
>>                      <fieldtype condition="language:en"
>> name="english_ft"/>
>>                      <fieldtype condition="*:*" name="english_ft"/>
>>              </fieldtypes>
>>      </dynamicFieldType>
>>
>> The last condition is used as a catch-all if preceding conditions
>> are not
>> met.
>>
>> What do you think of this feature?
>>
>> Best regards,
>> Nicolas Dessaigne
>
>
>
>
>

--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

RE: Proposition of a new feature: Dynamic Field Types

Reply via email to