Hi, On Thu, Aug 11, 2011 at 06:00:55PM +1000, Jim Breen wrote: > I would be quite happy to add some sort of frequency metric > to given and family names in the ENAMDICT file. The trouble > is I have no time spare to go digging out the data.
I have found a data as below in CSV format for family name. Anyway raw data has a bit over 100,600 names. Given name is a bit difficult. > If someone else were prepared to compile it, I'd be glad to i > add it. > > Jim Breen It looks like.... "sei","rank","number" "佐藤","1位",481980 "鈴木","2位",426804 "高橋","3位",353911 "田中","4位",334073 "渡辺","5位",276257 "伊藤","6位",270047 "山本","7位",269344 ... "天徳寺","88108位",1 "天寅","88108位",1 "天屯","88108位",1 "天秤","88108位",1 "天彦","88108位",1 "天峯","88108位",1 "天霧","88108位",1 "天野盛","88108位",1 "天雷","88108位",1 "天路","88108位",1 So remaining task is to ask copyright holder and merge this into your dictionary (I assume XML one is the one you wish to update.) I assume normalizing "Number" into % may be a good idea. But we may put low number ones as rare. Alternatively, -10*LOG(ratio) may provide better index covering wider range. Please think about it. I see there are some manual touch ups needed. I can help. I will write to the data producer for the license. I will mention our intent of use and ask him to put his database under the same term as yours. Regards, Osamu -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org