Re: Proposition of a new feature: Dynamic Field Types

Grant Ingersoll Sat, 01 Mar 2008 04:54:51 -0800

How many languages are you dealing with? How are you generating yourqueries? Are you taking your source language and translating it toeach of the languages? Or are you just targeting one source todestination language? Back when I was doing CLIR, I would createseparate indexes, but that is not to say it is appropriate for yourtask. It was always highly suspect in my mind the notion of combiningthe results of multiple languages into a single hit list. I justdoubt that the scores are meaningful in that way since scores shouldgenerally only be considered relative to each other within a singlequery.


Some more below


On Feb 29, 2008, at 11:52 AM, [EMAIL PROTECTED] wrote:

Thanks for your response Grant.

You are right, depending of the language we could index the text in a
specific field. At request time, we would then ask all the fieldsfor the
query.

I see however a few possible problems with this approach. By order of
decreasing importance:

- Influence on relevance
I assume the idf is calculated on a field by field basis? In thecontext ofone field per language, the documents whose language is the lesspresent in
the index will receive an unusual boost for cross-lingual tokens. This
situation can be quite frequent as the distribution of languages intheindex is usually heterogeneous. Even if it was homogeneous, we wouldhave
the problem with rare text in one language citing words in another.

On the other hand, you are right in the sense that the idf of language
specific words is also altered. In the context of one field for all
languages, the idf could be very low for a word if it is a commonword inanother language. For example, the world "thé" in French is quiterare, but
its idf would be greatly altered by the word "the" in English.

I would be interesting to see a study on a real index about theeffects of this.

We have a dilemma here...

- Performance
Queries are in O(log n) if I'm not mistaken? Then a disjunctionquery on x
language fields would be nearly x times slower, no?


Again, I think this depends on how you setup the indices, etc.

- Verbose configuration
Not an important point, but with the dynamic field type, youconfigure onlyone time all the languages. Otherwise, you must do so for each textfield.
The query handler configuration would also be much more verbose. Weusually
use the dismax handler and the qf could become very long.


True.



- Highlight

Not an important point either, but a bit of work need to be done to
aggregate the results.

In conclusion, the choice is not so clear for me. Your remark on the

relevance made me think a bit more on multilingual problems. Theremay be a

way to tune the idf of some fields depending on others?

Another idea would be to boost documents in the language of therequest.

This may be actually much simpler.

If you have any idea on the subject I'm very interested!

Nicolas


-----Message d'origine-----
De : Grant Ingersoll [mailto:[EMAIL PROTECTED]
Envoyé : vendredi 29 février 2008 14:06
À : [email protected]
Objet : Re: Proposition of a new feature: Dynamic Field Types

Why can't you choose the proper field in your application and keep
separate fields per language?  Putting them all in the same field,
regardless of language, is not a good idea in my opinion because it is

more than likely going to skew your statistics and lower yourrelevance.


That being said, the dynamic field type is still an interesting idea.

-Grant

On Feb 29, 2008, at 5:56 AM, [EMAIL PROTECTED] wrote:

Dynamic field types are field types that act as proxies to otherfield

types. The choice of the field type to use is done on a per document
basis
and is dependent of the values of the document's fields.

The use case that led us to this feature is the indexation of
documents in
different languages. We use a specific analyzer for each language
but want
to index semantic information that is not specific to the language.

For example, we would add in the index the semantic tag {co:Paris}
for the

expressions "Paris", "capital city of France", "the city of lights"in

English and "Paris", "capitale de la France", "la ville lumière" in
French.
This allows us to provide advanced functionalities such as semantic
and
cross-lingual search.

To do so in SOLR, we chose to index texts written in different
languages in
the same field, while analyzing them with different analyzers. Hence
the
proposition of a new feature that respond to this need: Dynamic
Field Types.

The idea of this new field type is to act as a proxy to other field
types.
Depending of the values of some fields of the document to index, it
chooses
the correct field type to use. In our situation, we use it to choose
the
correct language dependent field type based on the value of the
field named
"language". It is configured with a config similar to the following:

        <fieldtype name="french_ft" ...>
        ...
        </fieldtype>

        <fieldtype name="english_ft" ...>
        ...
        </fieldtype>

        <dynamicFieldType name="multilanguage">
                <fieldtypes>
                        <fieldtype condition="language:fr"
name="french_ft"/>
                        <fieldtype condition="language:en"
name="english_ft"/>
                        <fieldtype condition="*:*" name="english_ft"/>
                </fieldtypes>
        </dynamicFieldType>

The last condition is used as a catch-all if preceding conditions
are not
met.

What do you think of this feature?

Best regards,
Nicolas Dessaigne


--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: Proposition of a new feature: Dynamic Field Types

Reply via email to