True, synonyms can be grouped in cliques based on the strength of their "resemblence" given a specific context.

But what I'm indexing is the text content of TV programs produced by a public television, so the context is very large and non-specific. What I want is to find "automobile" for "car", "motorcycle" for "bike", "pub" for "restaurant", "woman" for "lady", and the likes.

There actually are free on-line resources for most European languages (of course, English included), check these out:
http://dico.isc.cnrs.fr/dico_html/en/index.html
http://www.crisco.unicaen.fr/alexandria2.html

Would you mind commenting on the following plan for a special synonym analyzer.
1/ We would start with an empty synonyms file.
2/ For each indexing request, the analyser looks up the file for synonyms. If it finds synonyms, it proceeds normally. 3/ Otherwise, it checks an online resource for synonyms, updates the synonyms file, and proceeds.

If you think this is workable, there are two problems left: which terms to look up for online synonyms, and how to select the "synonymity" clique.

For the first issue, I would definitely only search for synonyms of nouns, verbs and adjectives, so some stemming is required initially. For the second issue, I'd have a cut-off value for the strength of "resemblence", if this information is available, or / and use the frequency of the synonyms in the SOLR index as a measure.

Building the synonyms file that way would make the system quicker over time, and for a specific domain (chemistry, biology, sports, etc) the process would be auto-adaptive - perhaps with some human help from time to time.

Thanks,
Pierre

Walter Underwood a écrit :
Synonyms are domain-specific, so general-purpose lists are not very useful.

Ultraseek shipped a British-American synonym list as an example, but even
that wasn't very general. One of our customers was a chemical company and
was very surprised when the search "rocket fuel" suggested "arugula",
even though "rocket" is a perfectly good synonym for "arugula".

wunder

On 9/30/08 10:14 AM, "Otis Gospodnetic" <[EMAIL PROTECTED]> wrote:

Pierre,

1) I don't know, but a good place to check and see what previous answers to
this questions were is markmail.org
2) I don't think there is such a thing, but I also don't think there are sites
that make this data freely available (answer to 1?)

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
From: Pierre Auslaender <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Tuesday, September 30, 2008 11:28:40 AM
Subject: French synonyms & Online synonyms

Hello,

I'm sure these questions have been raised a million times, I'll try one
more:

1/ Is there any general-purpose, free, French synonyms file out there?

2/ Is there a Solr or Lucene analyser class that could tap an on-line
resource for synoynms at index-time? And by the same token, maintain and
complete a synoynms text file?

Thanks for the great work on SOLR and for the liveliness of this list.

Pierre


Reply via email to