Hi Team This is what I want to do: 1. I have 2 datasets of the schema id-number and company-name 2. I want to ultimately be able to link (join or any other means) the 2 data sets based on the similarity between the company-name fields of the 2 data set.
Example: Dataset 1 ———————— Id | Company Name —| ————————————— 1 | Aroop Inc 2 | Ganguly & Ganguly Corp Dataset 2 ———————— Yo Revenue | Company Name — ————— |———————— 1K | aroop and sons 2K | Ganguly Corp 3K | Ganguly and Ganguly 2K | Aroop Inc. 6K | Ganguly Corporation I want to be able to get a join in the end, based on a smart similarity score between the company names in the 2 data sets. Final Dataset —--- | —————————————| ————————|————————————————————— |———————————————————— Id | Company Name | Revenue | Matched Company Name from Dataset2 | Similarity Score —--- | —————————————-----------------------—| ————————————————————— |——————————————————— 1 | Aroop Inc | 2K | Aroop Inc. | 99% 2 | Ganguly & Ganguly Corp | 3K | Ganguly and Ganguly | 75% —--- | —————————————| ————————|—————————————————————--- |———————————————————— How should I proceed? (I have preprocessed the data sets to lowercase it and remove non essential words like pronouns and acronyms like LTD or Co. ) Thanks Aroop