Rohan,

You can really do that with Lucene's tokenizers to get individual tokens/words 
and a HashMap where keys are those words/tokens from the first document.  You 
can then tokenize the second doc and check each of its words in the HashMap.

Our Key Phrase Extractor ( 
http://sematext.com/products/key-phrase-extractor/index.html ) includes similar 
functionality that works with 2 corpora (or 2 pieces of text or 2 language 
models) and gets you the "overlap".  I think it also takes into consideration 
term frequencies, which can be handy.

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: rohan rai <hiroha...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Thu, February 3, 2011 2:35:39 PM
> Subject: Re: Solr for finding similar word between two documents
> 
> Lets say 1 have document(file) which is large and contains word inside  it.
> 
> And the 2nd document also is a text file.
> 
> Problem is to find  all those words in 2nd document which is present in first
> document
> when  both of the files are large enough.
> 
> Regards
> Rohan
> 
> On Fri, Feb  4, 2011 at 1:01 AM, openvictor Open <openvic...@gmail.com>wrote:
> 
> >  Rohan : what you want to do can be done with quite little effort if  your
> > document has a limited size (up to some Mo) with common and  basic
> > structures
> > like Hasmap.
> >
> > Do you have any  additional information on your problem so that we can give
> > you more  useful inputs ?
> >
> > 2011/2/3 Gora Mohanty <g...@mimirtech.com>
> >
> > >  On Thu, Feb 3, 2011 at 11:32 PM, rohan rai <hiroha...@gmail.com> wrote:
> >  > > Is there a way to use solr and get similar words between two  document
> > > > (files).
> > > [...]
> > >
> > >  This is *way* too vague t make any sense out of. Could you elaborate,
> >  > as I could have sworn that what you seem to want is the essential
> >  > function of a search engine.
> > >
> > > Regards,
> >  > Gora
> > >
> >
> 

Reply via email to