What you want to do is basically named entity recognition. We have a quite similar use case (medical/scientific documents, need to look for disease names /drug names /MeSH terms, etc).
Take a look at David Smiley's Solr Text Tagger ( https://github.com/OpenSextant/SolrTextTagger ) which we've been using with some success for this task. best -Simon On Mon, Aug 24, 2015 at 2:13 PM, afrooz <afr.rahm...@gmail.com> wrote: > Thanks Erick, > I will explain the detail scenario so you might give me a solution: > I want to annotate a medical document base on only medical dictionary. I > don't need to annotate non medical words of document at all. > The medical dictionary contains terms which contains multiple words, and > these terms all together has a specific medical meanings. For example "back > Pain", "back" and "pain" are two separate words but together they have > another meaning. these terms might be using in different orders in a > sentences but all with a same meaning. Ex "breast cancer" or "cancer in > breast" should be consider the same... > We have terms even more than 6 words also. > > So the question is that "I have a document with around 700 words and i need > to annotate this document base on medical terminology of 3 million size in > records" > any idea how to do this? > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/how-to-index-document-with-multiple-words-phrases-and-words-permutation-tp4224919p4224970.html > Sent from the Solr - User mailing list archive at Nabble.com. >