こんばんは, 2011/8/11 Osamu Aoki <os...@debian.org>: > I have found a data as below in CSV format for family name. > Anyway raw data has a bit over 100,600 names. > Given name is a bit difficult.
Yes, but family names is a great start. > It looks like.... > > "sei","rank","number" > "佐藤","1位",481980 > "鈴木","2位",426804 > "高橋","3位",353911 > "田中","4位",334073 > "渡辺","5位",276257 > "伊藤","6位",270047 > "山本","7位",269344 > ... > "天徳寺","88108位",1 > "天寅","88108位",1 > "天屯","88108位",1 > "天秤","88108位",1 > "天彦","88108位",1 > "天峯","88108位",1 > "天霧","88108位",1 > "天野盛","88108位",1 > "天雷","88108位",1 > "天路","88108位",1 > > So remaining task is to ask copyright holder and merge this into your > dictionary (I assume XML one is the one you wish to update.) I see you have emailed about it. Thank you for doing that. > I assume normalizing "Number" into % may be a good idea. But we may put > low number ones as rare. Alternatively, -10*LOG(ratio) may provide better > index covering wider range. Please think about it. I was thinking of dividing into 10 ranks: R1 to R10, with R1 being the most common. Something like (in Python): 10-int(math.log10(number)/.63) would turn those numbers into a 1-10 ranking. Thanks for doing this. Cheers JIm -- Jim Breen Adjunct Snr Research Fellow, Clayton School of IT, Monash University Webmaster: Hawthorn Rowing Club, Treasurer: Japanese Studies Centre Graduate student: Language Technology Group, University of Melbourne -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org