Re: Dictionary changes

Amodelo Thu, 03 Jul 2014 02:36:21 -0700

Am 02.07.2014 um 19:25 schrieb Steve Litt <sl...@troubleshooters.com>:

> On Wed, 2 Jul 2014 18:40:18 +0200
> Bzzzz <lazyvi...@gmx.com> wrote:
> 
>> On Wed, 2 Jul 2014 12:22:02 -0400
>> Steve Litt <sl...@troubleshooters.com> wrote:
> 
>>> If worst comes to worst and I can't find a way to get grep to do
>>> this, I'll just put together a substitution table,
>>> convert /usr/share/dict/words to words.ascii, line for line, search
>>> words.ascii, get the line number, and pull that line out of words.
>>> Crude, but effective.
>> 
>> AFAIK, this is the only way to be able to perform what you want.
>> 
> 
> So then, the question becomes, where does there exist a list of common
> letters that are, for want of a better word, "ornamented ascii"?
> Umlauts, Carats, Circles, Grave accents, etc.

This is a known problem without perfect solution. Some years ago I wrote a Perl 
module for this:

https://metacpan.org/pod/Text::Undiacritic

DESCRIPTION
Changes characters with diacritics into their base characters.
Also changes into base character in cases where UNICODE does not provide a 
decomposition.
E.g. all characters '... WITH STROKE' like 'LATIN SMALL LETTER L WITH STROKE' 
do not have a decomposition. In the latter case the result will be 'LATIN SMALL 
LETTER L'.
Removing diacritics is useful for matching text independent of spelling 
variants.

But a more general approach would be to use some sort of approximate matching 
via calculating a similarity coefficient and displaying the best matching 
strings.

See e.g. here:

https://metacpan.org/release/Set-Similarity
https://metacpan.org/pod/String::Similarity
http://www.chokkan.org/software/simstring/

HTH

Helmut Wollmersdorfer

--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: 
https://lists.debian.org/8b5f736b-8417-4717-8b98-fa81369c3...@amodelo.de

Re: Dictionary changes

Reply via email to