Rohan, You can really do that with Lucene's tokenizers to get individual tokens/words and a HashMap where keys are those words/tokens from the first document. You can then tokenize the second doc and check each of its words in the HashMap.
Our Key Phrase Extractor ( http://sematext.com/products/key-phrase-extractor/index.html ) includes similar functionality that works with 2 corpora (or 2 pieces of text or 2 language models) and gets you the "overlap". I think it also takes into consideration term frequencies, which can be handy. Otis ---- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ ----- Original Message ---- > From: rohan rai <hiroha...@gmail.com> > To: solr-user@lucene.apache.org > Sent: Thu, February 3, 2011 2:35:39 PM > Subject: Re: Solr for finding similar word between two documents > > Lets say 1 have document(file) which is large and contains word inside it. > > And the 2nd document also is a text file. > > Problem is to find all those words in 2nd document which is present in first > document > when both of the files are large enough. > > Regards > Rohan > > On Fri, Feb 4, 2011 at 1:01 AM, openvictor Open <openvic...@gmail.com>wrote: > > > Rohan : what you want to do can be done with quite little effort if your > > document has a limited size (up to some Mo) with common and basic > > structures > > like Hasmap. > > > > Do you have any additional information on your problem so that we can give > > you more useful inputs ? > > > > 2011/2/3 Gora Mohanty <g...@mimirtech.com> > > > > > On Thu, Feb 3, 2011 at 11:32 PM, rohan rai <hiroha...@gmail.com> wrote: > > > > Is there a way to use solr and get similar words between two document > > > > (files). > > > [...] > > > > > > This is *way* too vague t make any sense out of. Could you elaborate, > > > as I could have sworn that what you seem to want is the essential > > > function of a search engine. > > > > > > Regards, > > > Gora > > > > > >