On Nov 1, 2008, at 3:04 PM, Jon Baer wrote:


On Nov 1, 2008, at 1:16 PM, Grant Ingersoll wrote:

How do you propose to distinguish those words from the other ones?

** They are field values from other documents

But so are many other words from that document, what separates out [Lucene, PDF, HTML, Microsoft Word] from the rest? Your brain made the distinction, but what info exists in that document such that a computer can? (this is a leading question, I have some ideas, but I think hearing it from you will help me better understand what you are trying to do)



The problem you are addressing is often called keyword extraction. In general, it 's a difficult problem, but you may have domain knowledge that can help.

** Im finding it hard to think Lucene can do amazing job @ search but yet nothing to tell me if a generated list of content is present in a resulting document.

I think it can, I think the thing I'm missing is where the generated list comes from. Given the list, I think it's just another search, right?

So, I suppose you could get the TV for your current document, along with the DF (doc freq) and know which terms occur in other documents, then you could go get those documents by searching for each of those terms.

However, I still suspect I'm missing something, so I'd say give it a try! Maybe trying it out in code would be the best way to articulate it.

-Grant

Reply via email to