Re: TermVectorComponent for tag generation?

Grant Ingersoll Sat, 01 Nov 2008 12:34:35 -0700



On Nov 1, 2008, at 3:04 PM, Jon Baer wrote:


On Nov 1, 2008, at 1:16 PM, Grant Ingersoll wrote:

How do you propose to distinguish those words from the other ones?


** They are field values from other documents

But so are many other words from that document, what separates out[Lucene, PDF, HTML, Microsoft Word] from the rest? Your brain madethe distinction, but what info exists in that document such that acomputer can? (this is a leading question, I have some ideas, but Ithink hearing it from you will help me better understand what you aretrying to do)

The problem you are addressing is often called keyword extraction.In general, it 's a difficult problem, but you may have domainknowledge that can help.
** Im finding it hard to think Lucene can do amazing job @ searchbut yet nothing to tell me if a generated list of content is presentin a resulting document.

I think it can, I think the thing I'm missing is where the generatedlist comes from. Given the list, I think it's just another search,right?

So, I suppose you could get the TV for your current document, alongwith the DF (doc freq) and know which terms occur in other documents,then you could go get those documents by searching for each of thoseterms.

However, I still suspect I'm missing something, so I'd say give it atry! Maybe trying it out in code would be the best way to articulateit.


-Grant

Re: TermVectorComponent for tag generation?

Reply via email to