Bug#271397: enamdict: add frequency statistic

Jim Breen Thu, 11 Aug 2011 06:15:25 -0700

こんばんは,

2011/8/11 Osamu Aoki <os...@debian.org>:
> I have found a data as below in CSV format for family name.
> Anyway raw data has a bit over 100,600 names.
> Given name is a bit difficult.


Yes, but family names is a great start.

> It looks like....
>
> "sei","rank","number"
> "佐藤","1位",481980
> "鈴木","2位",426804
> "高橋","3位",353911
> "田中","4位",334073
> "渡辺","5位",276257
> "伊藤","6位",270047
> "山本","7位",269344
> ...
> "天徳寺","88108位",1
> "天寅","88108位",1
> "天屯","88108位",1
> "天秤","88108位",1
> "天彦","88108位",1
> "天峯","88108位",1
> "天霧","88108位",1
> "天野盛","88108位",1
> "天雷","88108位",1
> "天路","88108位",1
>
> So remaining task is to ask copyright holder and merge this into your
> dictionary (I assume XML one is the one you wish to update.)

I see you have emailed about it. Thank you for doing that.

> I assume normalizing "Number" into % may be a good idea.  But we may put
> low number ones as rare.  Alternatively, -10*LOG(ratio) may provide better
> index covering wider range.  Please think about it.

I was thinking of dividing into 10 ranks: R1 to R10, with R1 being the most
common.

Something like (in Python): 10-int(math.log10(number)/.63)
would turn those numbers into a 1-10 ranking.

Thanks for doing this.

Cheers

JIm
-- 
Jim Breen
Adjunct Snr Research Fellow, Clayton School of IT, Monash University
Webmaster: Hawthorn Rowing Club, Treasurer: Japanese Studies Centre
Graduate student: Language Technology Group, University of Melbourne



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Bug#271397: enamdict: add frequency statistic

Reply via email to