On Nov 1, 2008, at 3:04 PM, Jon Baer wrote:
On Nov 1, 2008, at 1:16 PM, Grant Ingersoll wrote:
How do you propose to distinguish those words from the other ones?
** They are field values from other documents
But so are many other words from that document, what separates out
[Lucene, PDF, HTML, Microsoft Word] from the rest? Your brain made
the distinction, but what info exists in that document such that a
computer can? (this is a leading question, I have some ideas, but I
think hearing it from you will help me better understand what you are
trying to do)
The problem you are addressing is often called keyword extraction.
In general, it 's a difficult problem, but you may have domain
knowledge that can help.
** Im finding it hard to think Lucene can do amazing job @ search
but yet nothing to tell me if a generated list of content is present
in a resulting document.
I think it can, I think the thing I'm missing is where the generated
list comes from. Given the list, I think it's just another search,
right?
So, I suppose you could get the TV for your current document, along
with the DF (doc freq) and know which terms occur in other documents,
then you could go get those documents by searching for each of those
terms.
However, I still suspect I'm missing something, so I'd say give it a
try! Maybe trying it out in code would be the best way to articulate
it.
-Grant