Bug#271397: enamdict: add frequency statistic

Osamu Aoki Thu, 11 Aug 2011 03:58:54 -0700

Hi,

On Thu, Aug 11, 2011 at 06:00:55PM +1000, Jim Breen wrote:
> I would be quite happy to add some sort of frequency metric
> to given and family names in the ENAMDICT file. The trouble
> is I have no time spare to go digging out the data.


I have found a data as below in CSV format for family name.
Anyway raw data has a bit over 100,600 names.
Given name is a bit difficult.

> If someone else were prepared to compile it, I'd be glad to i
> add it.
> 
> Jim Breen

It looks like....

"sei","rank","number"
"佐藤","1位",481980
"鈴木","2位",426804
"高橋","3位",353911
"田中","4位",334073
"渡辺","5位",276257
"伊藤","6位",270047
"山本","7位",269344
...
"天徳寺","88108位",1
"天寅","88108位",1
"天屯","88108位",1
"天秤","88108位",1
"天彦","88108位",1
"天峯","88108位",1
"天霧","88108位",1
"天野盛","88108位",1
"天雷","88108位",1
"天路","88108位",1

So remaining task is to ask copyright holder and merge this into your
dictionary (I assume XML one is the one you wish to update.)

I assume normalizing "Number" into % may be a good idea.  But we may put
low number ones as rare.  Alternatively, -10*LOG(ratio) may provide better
index covering wider range.  Please think about it.  

I see there are some manual touch ups needed.  I can help.

I will write to the data producer for the license.

I will mention our intent of use and ask him to put his database under
the same term as yours.

Regards,

Osamu




-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Bug#271397: enamdict: add frequency statistic

Reply via email to