On Tue, Nov 12, 2013 at 09:43:59PM +0100, Rene Engelhard wrote: > On Tue, Nov 12, 2013 at 07:54:04PM +0100, Agustin Martin wrote: > > I will have a look at this (I once wrote ispellaff2myspell). Now I think the > > best is to change script to UTF8, but keep strings in code as escaped > > octal. > > Or rewrite that part. > > > > Let me think about this. Hope to find time tomorrow. > > Oops, too late. Just added the patch as I saw the patch and did it before > starting to read mail. My bad. > > Feel free to come up with a patch based on -5 and I'll happily add it, though.
Hi, Rene and Gregor Attached in two forms. One simple, just to see the differences I added and the good one with all trailing whitespace in ispellaff2myspell trimmed. Minimally tested with the faroese dictionary. I also looked at myspell-tools. If I find time I will also prepare a patch for myspell-tools also including changes by Gregor. I see that ispellaff2myspell is included through a dpatch patch. Do you think it would be interesting to change handling to something closer to what is used for hunspell-tools (plain file under debian/)? Regards, -- Agustin
diff --git a/debian/changelog b/debian/changelog index 2ca1fbe..0572e6c 100644 --- a/debian/changelog +++ b/debian/changelog @@ -1,3 +1,13 @@ +hunspell (1.3.2-6) unstable; urgency=low + + * debian/ispellaff2myspell: New upstream version. + - Incorporate changes by Gregor Herrmann (UTF-8 and typo fixes). + - Use octal codes for unibyte strings to make them coexist + with new UTF-8 encoding. + - Other minor changes. + + -- + hunspell (1.3.2-5) unstable; urgency=low * apply patch from Gregor Hermann, thanks diff --git a/debian/ispellaff2myspell b/debian/ispellaff2myspell index 692571c..940d82b 100644 --- a/debian/ispellaff2myspell +++ b/debian/ispellaff2myspell @@ -1,8 +1,7 @@ #!/usr/bin/perl -w -# -*- coding: iso-8859-1 -*- -# $Id: ispellaff2myspell,v 1.29 2005/07/04 12:21:55 agmartin Exp $ +# -*- coding: utf-8 -*- # -# (C) 2002-2005 Agustin Martin Domingo <agustin.mar...@hispalinux.es> +# (C) 2002-2013 Agustin Martin Domingo <agustin.mar...@hispalinux.es> # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by @@ -21,7 +20,7 @@ sub usage { print "ispellaff2myspell: A program to convert ispell affix tables to myspell format -(C) 2002-2005 Agustin Martin Domingo <agustin.martin\@hispalinux.es> License: GPL +(C) 2002-2013 Agustin Martin Domingo <agustin.martin\@hispalinux.es> License: GPL2+ Usage: ispellaff2myspell [options] <affixfile> @@ -98,17 +97,17 @@ sub mylc{ } } else { if ( $charset eq "latin0" ){ - $lowercase='a-zàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ½¨¸'; - $uppercase='A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ¼¦´'; + $lowercase='a-z\340\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357\360\361\362\363\364\365\366\370\371\372\373\374\375\376\275\250\270'; + $uppercase='A-Z\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316\317\320\321\322\323\324\325\326\330\331\332\333\334\335\336\274\246\264'; } elsif ( $charset eq "latin1" ){ - $lowercase='a-zàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ'; - $uppercase='A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ'; + $lowercase='a-z\340\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357\360\361\362\363\364\365\366\370\371\372\373\374\375\376'; + $uppercase='A-Z\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316\317\320\321\322\323\324\325\326\330\331\332\333\334\335\336'; } elsif ( $charset eq "latin2" ){ - $lowercase='a-z±³µ¶¹º»¼¾¿àáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ'; - $uppercase='A-Z¡£¥¦©ª«¬®¯ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ'; + $lowercase='a-z\261\263\265\266\271\272\273\274\276\277\340\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357\360\361\362\363\364\365\366\370\371\372\373\374\375\376'; + $uppercase='A-Z\241\243\245\246\251\252\253\254\256\257\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316\317\320\321\322\323\324\325\326\330\331\332\333\334\335\336'; } elsif ( $charset eq "latin3" ){ - $lowercase='a-z±¶¹º»¼¿àáâäåæçèéêëìíîïñòóôõö÷øùúûüýþ'; - $uppercase='A-Z¡¦©ª«¬¯ÀÁÂÄÅÆÇÈÉÊËÌÍÎÏÑÒÓÔÕÖרÙÚÛÜÝÞ'; + $lowercase='a-z\261\266\271\272\273\274\277\340\341\342\344\345\346\347\350\351\352\353\354\355\356\357\361\362\363\364\365\366\367\370\371\372\373\374\375\376'; + $uppercase='A-Z\241\246\251\252\253\254\257\300\301\302\304\305\306\307\310\311\312\313\314\315\316\317\321\322\323\324\325\326\327\330\331\332\333\334\335\336'; # } elsif ( $charset eq "other_charset" ){ # die "latin2 still unimplemented"; } else { @@ -440,13 +439,19 @@ requires B<--lowercase> having exactly that string but lowercase. =back -If your encoding is currently unsupported you can send me a file with -the two strings of lower and uppercase chars. Note that they must match -exactly but case changed. It will look something like +If your encoding is currently unsupported you can send me a separate file +with the two strings of lower and uppercase chars. Note that they must +match exactly but case changed. It will look something like $lowercase='a-zàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ'; $uppercase='A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ'; +A safer alternative against accidental recoding is to use octal codes for +non 7bit chars. Above strings would then look like + + $lowercase='a-z\340\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357\360\361\362\363\364\365\366\370\371\372\373\374\375\376'; + $uppercase='A-Z\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316\317\320\321\322\323\324\325\326\330\331\332\333\334\335\336'; + =head1 SEE ALSO The OpenOffice.org Lingucomponent Project home page
diff --git a/debian/changelog b/debian/changelog index 2ca1fbe..0572e6c 100644 --- a/debian/changelog +++ b/debian/changelog @@ -1,3 +1,13 @@ +hunspell (1.3.2-6) unstable; urgency=low + + * debian/ispellaff2myspell: New upstream version. + - Incorporate changes by Gregor Herrmann (UTF-8 and typo fixes). + - Use octal codes for unibyte strings to make them coexist + with new UTF-8 encoding. + - Other minor changes. + + -- + hunspell (1.3.2-5) unstable; urgency=low * apply patch from Gregor Hermann, thanks diff --git a/debian/ispellaff2myspell b/debian/ispellaff2myspell index 692571c..216ec75 100644 --- a/debian/ispellaff2myspell +++ b/debian/ispellaff2myspell @@ -1,9 +1,8 @@ #!/usr/bin/perl -w -# -*- coding: iso-8859-1 -*- -# $Id: ispellaff2myspell,v 1.29 2005/07/04 12:21:55 agmartin Exp $ -# -# (C) 2002-2005 Agustin Martin Domingo <agustin.mar...@hispalinux.es> -# +# -*- coding: utf-8 -*- +# +# (C) 2002-2013 Agustin Martin Domingo <agustin.mar...@hispalinux.es> +# # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or @@ -21,23 +20,23 @@ sub usage { print "ispellaff2myspell: A program to convert ispell affix tables to myspell format -(C) 2002-2005 Agustin Martin Domingo <agustin.martin\@hispalinux.es> License: GPL +(C) 2002-2013 Agustin Martin Domingo <agustin.martin\@hispalinux.es> License: GPL2+ Usage: ispellaff2myspell [options] <affixfile> Options: --affixfile=s Affix file - --bylocale Use current locale setup for upper/lowercase + --bylocale Use current locale setup for upper/lowercase conversion - --charset=s Use specified charset for upper/lowercase + --charset=s Use specified charset for upper/lowercase conversion (defaults to latin1) --debug Print debugging info --extraflags Allow some non alphabetic flags --lowercase=s Lowercase string --myheader=s Header file - --printcomments Print commented lines in output - --replacements=s Replacements file + --printcomments Print commented lines in output + --replacements=s Replacements file --split=i Split flags with more that i entries --uppercase=s Uppercase string --wordlist=s Still unused @@ -62,7 +61,7 @@ sub debugprint { sub shipoutflag{ my $flag_entries=scalar @flag_array; - + if ( $flag_entries != 0 ){ if ( $split ){ while ( @flag_array ){ @@ -92,23 +91,23 @@ sub mylc{ my $outputstring; if ( $bylocale ){ - { + { use locale; $outputstring = lc $inputstring; } } else { if ( $charset eq "latin0" ){ - $lowercase='a-zàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ½¨¸'; - $uppercase='A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ¼¦´'; + $lowercase='a-z\340\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357\360\361\362\363\364\365\366\370\371\372\373\374\375\376\275\250\270'; + $uppercase='A-Z\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316\317\320\321\322\323\324\325\326\330\331\332\333\334\335\336\274\246\264'; } elsif ( $charset eq "latin1" ){ - $lowercase='a-zàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ'; - $uppercase='A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ'; + $lowercase='a-z\340\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357\360\361\362\363\364\365\366\370\371\372\373\374\375\376'; + $uppercase='A-Z\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316\317\320\321\322\323\324\325\326\330\331\332\333\334\335\336'; } elsif ( $charset eq "latin2" ){ - $lowercase='a-z±³µ¶¹º»¼¾¿àáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ'; - $uppercase='A-Z¡£¥¦©ª«¬®¯ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ'; + $lowercase='a-z\261\263\265\266\271\272\273\274\276\277\340\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357\360\361\362\363\364\365\366\370\371\372\373\374\375\376'; + $uppercase='A-Z\241\243\245\246\251\252\253\254\256\257\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316\317\320\321\322\323\324\325\326\330\331\332\333\334\335\336'; } elsif ( $charset eq "latin3" ){ - $lowercase='a-z±¶¹º»¼¿àáâäåæçèéêëìíîïñòóôõö÷øùúûüýþ'; - $uppercase='A-Z¡¦©ª«¬¯ÀÁÂÄÅÆÇÈÉÊËÌÍÎÏÑÒÓÔÕÖרÙÚÛÜÝÞ'; + $lowercase='a-z\261\266\271\272\273\274\277\340\341\342\344\345\346\347\350\351\352\353\354\355\356\357\361\362\363\364\365\366\367\370\371\372\373\374\375\376'; + $uppercase='A-Z\241\246\251\252\253\254\257\300\301\302\304\305\306\307\310\311\312\313\314\315\316\317\321\322\323\324\325\326\327\330\331\332\333\334\335\336'; # } elsif ( $charset eq "other_charset" ){ # die "latin2 still unimplemented"; } else { @@ -116,7 +115,7 @@ sub mylc{ die "Unsupported charset [$charset] use explicitly --lowercase=string and --uppercase=string -options. Remember that both string must match exactly, but +options. Remember that both string must match exactly, but case changed. "; } @@ -136,17 +135,17 @@ sub validate_flag (){ if ($flag =~ m/^$_/){ $flag =~ s/^$_//; return $flag; - } + } } - } + } return ''; } sub process_replacements{ my $file = shift; my @replaces = (); - - open (REPLACE,"< $file") || + + open (REPLACE,"< $file") || die "Error: Could not open replacements file: $file\n"; while (<REPLACE>){ next unless m/^REP[\s\t]*\D.*/; @@ -178,7 +177,7 @@ $debug = ''; $lowercase = ''; $myheader = ''; $printcomments = ''; -$replacements = ''; +$replacements = ''; $split = ''; $uppercase = ''; $wordlist = ''; @@ -218,7 +217,7 @@ if ( not $affixfile ){ if ( $charset and ( $lowercase or $uppercase )){ die "Error: charset and lowercase/uppercase options -are incompatible. Use either charset or lowercase/uppercase options to +are incompatible. Use either charset or lowercase/uppercase options to specify the patterns " } elsif ( not $lowercase and not $uppercase and not $charset ){ @@ -231,7 +230,7 @@ if ( scalar(keys %theextraflags) == 0 && $hasextraflags ){ debugprint "$affixfile $charset"; -open (AFFIXFILE,"< $affixfile") || +open (AFFIXFILE,"< $affixfile") || die "Error: Could not open affix file: $affixfile"; if ( $myheader ){ @@ -259,7 +258,7 @@ while (<AFFIXFILE>){ s/^[\s\t]*flag[\s\t]*//; s/[\s\t]*:.*$//; debugprint "Found flag $_ in line $.\n"; - + if (/\*/){ s/[\*\s]//g; $flagcombine="Y"; @@ -267,7 +266,7 @@ while (<AFFIXFILE>){ } else { $flagcombine="N"; } - + if ( $flagname = &validate_flag($_) ){ $myaffix = $affix; } else { @@ -278,11 +277,11 @@ while (<AFFIXFILE>){ } elsif ( $affix and $inflags ) { ($rootname,@comments) = split('#',$_); $comment = '# ' . join('#',@comments); - + $rootname =~ s/\s*//g; $rootname = mylc $rootname; ($rootname,$addtoroot) = split('>',$rootname); - + if ( $addtoroot =~ s/^\-//g ){ ($rootremove,$addtoroot) = split(',',$addtoroot); $addtoroot = "0" unless $addtoroot; @@ -295,15 +294,15 @@ while (<AFFIXFILE>){ if ( $rootname eq '.' && $rootremove ne "0" ){ $rootname = $rootremove; } - + debugprint "$rootname, $addtoroot, $rootremove\n"; if ( $printcomments ){ $affix_line=sprintf("%s %s %-5s %-11s %-24s %s", - $myaffix, $flagname, $rootremove, + $myaffix, $flagname, $rootremove, $addtoroot, $rootname, $comment); } else { $affix_line=sprintf("%s %s %-5s %-11s %s", - $myaffix, $flagname, $rootremove, + $myaffix, $flagname, $rootremove, $addtoroot, $rootname); } $rootremove = "0"; @@ -340,23 +339,23 @@ B<ispellaff2myspell> - A program to convert ispell affix tables to myspell forma Options: --affixfile=s Affix file - --bylocale Use current locale setup for upper/lowercase + --bylocale Use current locale setup for upper/lowercase conversion - --charset=s Use specified charset for upper/lowercase + --charset=s Use specified charset for upper/lowercase conversion (defaults to latin1) --debug Print debugging info --extraflags=s Allow some non alphabetic flags --lowercase=s Lowercase string - --myheader=s Header file - --printcomments Print commented lines in output - --replacements=s Replacements file + --myheader=s Header file + --printcomments Print commented lines in output + --replacements=s Replacements file --split=i Split flags with more that i entries --uppercase=s Uppercase string =head1 DESCRIPTION -B<ispellaff2myspell> is a script that will convert ispell affix tables -to myspell format in a more or less successful way. +B<ispellaff2myspell> is a script that will convert ispell affix tables +to myspell format in a more or less successful way. This script does not create the dict file. Something like @@ -368,85 +367,91 @@ should do the work, with mydict.words+ being the munched wordlist =over 8 -=item B<--affixfile=s> +=item B<--affixfile=s> Affix file. You can put it directly in the command line. -=item B<--bylocale> +=item B<--bylocale> -Use current locale setup for upper/lowercase conversion. Make sure -that the selected locale match the dictionary one, or you might get +Use current locale setup for upper/lowercase conversion. Make sure +that the selected locale match the dictionary one, or you might get into trouble. -=item B<--charset=s> +=item B<--charset=s> -Use specified charset for upper/lowercase conversion (defaults to latin1). +Use specified charset for upper/lowercase conversion (defaults to latin1). Currently allowed values for charset are: latin0, latin1, latin2, latin3. -=item B<--debug> +=item B<--debug> Print some debugging info. -=item B<--extraflags:s> +=item B<--extraflags:s> -Allows some non alphabetic flags. +Allows some non alphabetic flags. -When invoked with no value the supported flags are currently those -corresponding to chars represented with the escape char B<\> as +When invoked with no value the supported flags are currently those +corresponding to chars represented with the escape char B<\> as first char. B<\> will be stripped. -When given with the flag prefix will allow that flag and strip the -given prefix. Be careful when giving the prefix to properly escape chars, -e.g. you will need B<-e "\\\\"> or B<-e '\\'> for flags like B<\[> to be stripped to -B<[>. Otherwise you might even get errors. Use B<-e "^"> to allow all +When given with the flag prefix will allow that flag and strip the +given prefix. Be careful when giving the prefix to properly escape chars, +e.g. you will need B<-e "\\\\"> or B<-e '\\'> for flags like B<\[> to be stripped to +B<[>. Otherwise you might even get errors. Use B<-e "^"> to allow all flags and pass them unmodified. -You will need a call to -e for each flag type, e.g., -B<-e "\\\\" -e "~\\\\"> (or B<-e '\\' -e '~\\'>). +You will need a call to -e for each flag type, e.g., +B<-e "\\\\" -e "~\\\\"> (or B<-e '\\' -e '~\\'>). -When a prefix is explicitly set, the default value (anything starting by B<\>) +When a prefix is explicitly set, the default value (anything starting by B<\>) is disabled and you need to enable it explicitly as in previous example. -=item B<--lowercase=s> +=item B<--lowercase=s> -Lowercase string. Manually set the string of lowercase chars. This +Lowercase string. Manually set the string of lowercase chars. This requires B<--uppercase> having exactly that string but uppercase. - -=item B<--myheader=s> -Header file. The myspell aff header. You need to write it +=item B<--myheader=s> + +Header file. The myspell aff header. You need to write it manually. This can contain everything you want to be before the affix table -=item B<--printcomments> +=item B<--printcomments> Print commented lines in output. -=item B<--replacements=file> +=item B<--replacements=file> Add a pre-defined replacements table taken from 'file' to the .aff file. Will skip lines not beginning with REP, and set the replacements number appropriately. -=item B<--split=i> +=item B<--split=i> -Split flags with more that i entries. This can be of interest for flags -having a lot of entries. Will split the flag in chunks containing B<i> +Split flags with more that i entries. This can be of interest for flags +having a lot of entries. Will split the flag in chunks containing B<i> entries. -=item B<--uppercase=s> +=item B<--uppercase=s> -Uppercase string. Manually set the sring of uppercase chars. This +Uppercase string. Manually set the sring of uppercase chars. This requires B<--lowercase> having exactly that string but lowercase. =back -If your encoding is currently unsupported you can send me a file with -the two strings of lower and uppercase chars. Note that they must match -exactly but case changed. It will look something like +If your encoding is currently unsupported you can send me a separate file +with the two strings of lower and uppercase chars. Note that they must +match exactly but case changed. It will look something like $lowercase='a-zàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ'; $uppercase='A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ'; +A safer alternative against accidental recoding is to use octal codes for +non 7bit chars. Above strings would then look like + + $lowercase='a-z\340\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357\360\361\362\363\364\365\366\370\371\372\373\374\375\376'; + $uppercase='A-Z\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316\317\320\321\322\323\324\325\326\330\331\332\333\334\335\336'; + =head1 SEE ALSO The OpenOffice.org Lingucomponent Project home page @@ -459,7 +464,7 @@ L<http://lingucomponent.openoffice.org/affix.readme> that provides information about the basics of the myspell affix file format. -You can also take a look at +You can also take a look at /usr/share/doc/libmyspell-dev/affix.readme.gz /usr/share/doc/libmyspell-dev/README.compoundwords