We receive about 100 documents a day of various sizes. The documents could pertain to any of 40,000 contacts stored in our database, and could include more than one. For each file we have, we maintain a list of contacts that are related to or involved in that file. I know it will never be exact, but I'd like to index possible names in the text, and then attempt to identify which files the document might pertain to, looking with files that are tied to contacts contained in the document.
I've found some regex code to parse names from the text, but does anyone have any ideas on how to set up the index. There are currently approximately 900,000 documents in our library. --Warren