True, synonyms can be grouped in cliques based on the strength of their
"resemblence" given a specific context.
But what I'm indexing is the text content of TV programs produced by a
public television, so the context is very large and non-specific. What I
want is to find "automobile" for "car", "motorcycle" for "bike", "pub"
for "restaurant", "woman" for "lady", and the likes.
There actually are free on-line resources for most European languages
(of course, English included), check these out:
http://dico.isc.cnrs.fr/dico_html/en/index.html
http://www.crisco.unicaen.fr/alexandria2.html
Would you mind commenting on the following plan for a special synonym
analyzer.
1/ We would start with an empty synonyms file.
2/ For each indexing request, the analyser looks up the file for
synonyms. If it finds synonyms, it proceeds normally.
3/ Otherwise, it checks an online resource for synonyms, updates the
synonyms file, and proceeds.
If you think this is workable, there are two problems left: which terms
to look up for online synonyms, and how to select the "synonymity" clique.
For the first issue, I would definitely only search for synonyms of
nouns, verbs and adjectives, so some stemming is required initially.
For the second issue, I'd have a cut-off value for the strength of
"resemblence", if this information is available, or / and use the
frequency of the synonyms in the SOLR index as a measure.
Building the synonyms file that way would make the system quicker over
time, and for a specific domain (chemistry, biology, sports, etc) the
process would be auto-adaptive - perhaps with some human help from time
to time.
Thanks,
Pierre
Walter Underwood a écrit :
Synonyms are domain-specific, so general-purpose lists are not very useful.
Ultraseek shipped a British-American synonym list as an example, but even
that wasn't very general. One of our customers was a chemical company and
was very surprised when the search "rocket fuel" suggested "arugula",
even though "rocket" is a perfectly good synonym for "arugula".
wunder
On 9/30/08 10:14 AM, "Otis Gospodnetic" <[EMAIL PROTECTED]> wrote:
Pierre,
1) I don't know, but a good place to check and see what previous answers to
this questions were is markmail.org
2) I don't think there is such a thing, but I also don't think there are sites
that make this data freely available (answer to 1?)
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
From: Pierre Auslaender <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Tuesday, September 30, 2008 11:28:40 AM
Subject: French synonyms & Online synonyms
Hello,
I'm sure these questions have been raised a million times, I'll try one
more:
1/ Is there any general-purpose, free, French synonyms file out there?
2/ Is there a Solr or Lucene analyser class that could tap an on-line
resource for synoynms at index-time? And by the same token, maintain and
complete a synoynms text file?
Thanks for the great work on SOLR and for the liveliness of this list.
Pierre