On Sun, Dec 09, 2007 at 07:32:46PM +0100, Jordi Mallach wrote: > On Sat, Dec 08, 2007 at 05:10:19PM +0100, Agustin Martin wrote: > > On Fri, Dec 07, 2007 at 08:03:31PM +0100, Marc Coll wrote: > > > The wordlist file is suprisingly big compared to the same file in > > > english or spanish (7.5 MB comapred to less than 1 MB). The cause > > > seems to be the fact that there are a lot of repeated words. A few > > > examples are: abacallani, embalsameu, embali... > > > > > > I'm currently working on a little program which should be able to > > > find and remove all duplicated occurences. I'll send the corrected > > > version of the file to the package maintainer as soon as I get it to > > > work. > > I do not have the sources here, but a combination of sort and uniq > > during the build process should do the trick. > > From the build proces in debian/rules: > > > # This generates the wcatalan wordlist. > debian/strip_mwl | ispell -d $(CURDIR)/catala.debian -e | \ > tr -s ' ' '\n' | uniq > catala.words.debian
What about # This generates the wcatalan wordlist. debian/strip_mwl | ispell -d $(CURDIR)/catala.debian -e | \ tr -s ' ' '\n' | sort -u > catala.words.debian using sort with the --unique (-u) option. > Weird. It is run through uniq, however Marc is right regarding the > examples he gave, like embalsameu. > > Running: > > uniq /usr/share/dict/catala > /tmp/catala.uniq > results in identical files. You can test with $ sort -u /usr/share/dict/catala > catala.tmp sizes: 6519080 catala.tmp 7450965 /usr/share/dict/catala $ grep -n embalsameu catala.tmp 221517:embalsameu $ grep -n embalsameu /usr/share/dict/catala 264507:embalsameu 264520:embalsameu -- Agustin -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]