Hi Team

This is what I want to do:
1. I have 2 datasets of the schema id-number and company-name
2. I want to ultimately be able to link (join or any other means) the 2 data 
sets based on the similarity between the company-name fields of the 2 data set.


Example:

Dataset 1
————————
Id | Company Name
—| —————————————
1 | Aroop Inc
2 | Ganguly & Ganguly Corp


Dataset 2
————————
Yo Revenue    | Company Name
— ————— |————————
1K                      | aroop and sons
2K                      | Ganguly Corp
3K                      | Ganguly and Ganguly
2K                      | Aroop Inc.
6K                      | Ganguly Corporation



I want to be able to get a join in the end, based on a smart similarity score 
between the company names in the 2 data sets.

Final Dataset
—---    | —————————————| ————————|—————————————————————   |————————————————————
Id      | Company Name                  |       Revenue         |       Matched 
Company Name from Dataset2      |       Similarity Score
—---    | —————————————-----------------------—| —————————————————————   
|———————————————————
1       | Aroop Inc                             |       2K                      
|       Aroop Inc.                                                      |       
99%
2       | Ganguly & Ganguly Corp        |       3K                      |       
Ganguly and Ganguly                                     |       75%
—---    | —————————————| ————————|—————————————————————--- |————————————————————

How should I proceed? (I have preprocessed the data sets to lowercase it and 
remove non essential words like pronouns and acronyms like LTD or Co. )

Thanks
Aroop

Text Similarity

Reply via email to