Interesting enough, I just posted a package to CRAN with a function that might 
be useful. It is called MiscPsycho and is for psychometric work. The updated 
version of the package should be available in a day or so. It has a function 
called stringMatch which just implements the Levenshtein distance or a 
normalized version of the distance (what I call the LND). Then, there is a 
function called stringProbs which gives the probability of observing a given 
LND.

In education, we merge data sets all the time using a unique ID. It turns out, 
however, that the unique ID is not so unique. It is often shared by many kids 
over time, duplicated within a year, etc. So, we need to first merge using the 
ID and then validate that we have merged properly using some other mechanism. I 
think the LND is very useful for this purpose.

So, here is an example of the function in this package:

### A perfect match gives an LND of 1
> stringMatch('William Clinton', 'William Clinton', normalize='YES')
[1] 1

### A close match gives an LND less than 1
> stringMatch('William Clinton', 'Bill Clinton', normalize='YES')
[1] 0.7333333

If your database is small, you can actually look at the records and see if 
values less than 1 are really the same name spelled differently, misspelled, 
etc.

But, if your data set has hundreds of thousands of records that becomes 
impossible. So, what I do is compute the probability that you would observe an 
LND of .7 or higher. This is implemented in the stringProbs function. Let's say 
the probability of observing an LND of .7 is .05 and lower values are even 
higher. Assuming you are willing to live with this much risk, you might then 
subset your data and retain records as "valid merges" only if the LND value is 
greater than .7.

The record linking literature is very big, but it is extremely small in 
education. So, I have a paper in press demonstrating this application and 
comparing it to other linking methods, like use of Soundex codes. In the paper, 
I also discuss how you would combine other demographic information, such as 
birthdates, etc to further explore probabilities of a correct match.

Harold



-----Original Message-----
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of David Winsemius
Sent: Wednesday, November 18, 2009 4:32 PM
To: Dagan A WRIGHT
Cc: r-help@r-project.org
Subject: Re: [R] Data linkage functions for probabilistic linkage using person 
identifiers


On Nov 18, 2009, at 1:21 PM, Dagan A WRIGHT wrote:

> I am somewhat new to R although using and liking already.  I am  
> curious if there are any probabilistic packages similar in function  
> to others such and Link King (http://www.the-link-king.com/).  I am  
> looking for functions in SSN, First/Last name, date of birth, and a  
> couple other indicators for matching.
>

Cannot comment on similarities to Link King but have used the  
functions found with this search in similar applications:

RSiteSearch("Levenshtein")  #yes, that is spelled correctly


> Thanks
>
> Dagan Wright, Ph.D., M.S.P.H.
> Lead Addictions Research Analyst, Analysis & Evaluation Unit
> Addictions & Mental Health Division (AMH)
> 500 Summer St. NE E86
> Salem, Oregon 97301-1118
>
> Office number: 503-945-5726
> Fax number:     503-378-8467
> dagan.a.wri...@state.or.us
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to