Hi,

On Thu, Aug 13, 2009 at 19:29, Mark Bennett <mbenn...@ideaeng.com> wrote:

There are comments in the Solr materials about having an option to cluster
> based on the entire document set, and some warning about this being
> atypical
> and possibly slow.  And from what you're saying, for a big enough docset,
> it
> might go from "slow" to "impossible", I'm not sure.


For Carrot2, it would go to "impossible" I'd say. But as Grant mentioned
earlier, Mahout is developing clustering algorithms that should be able to
handle the whole-index types of docsets.

And so my question was, *if* you were willing to spend that much time and
> effort to cluster all the text of all the documents (and if it were even
> possible), would the result perform better than the standard TF/IDF
> techniques?


Depends on the algorithm, really. In case of Carrot2, we don't do re-ranking
of documents within clusters, we simply use whatever document order we got
on input. As far as I'm aware, most clustering algorithms do pretty much the
same: they concentrate on finding groups of documents and don't delve much
into the issues of ranking documents within clusters.


> In the application I'm considering, the queries tend to be longer than
> average, more like full sentences or more.  And they tend to be of a
> question and answer nature.  I've seen references in several search engines
> that QandA search sometimes benefits from alternative search techniques.
> And, from a separate email, the IDF part of the standard similarity may be
> causing a problem, so I'm casting a wide net for other ideas.  Just
> brainstorming here... :-)


Because of what I described above, clustering the whole index may not give
you the best results. But you can try something different. You could try
fetching a bunch (100--500) of more or less relevant documents for the
question (MLT should be fine to start with), add your question as an extra
document, perform clustering and see where the question-document ends up. If
it doesn't end up in the Other Topics cluster, you could examine if the
other documents from the cluster give an answer to the question. In this
scenario, Carrot2 should be fine, at least performance-wise. I've not
followed the QA literature very closely, so it's hard to say what the
results would be quality-wise, but it should be very quick to try. Carrot2
Clustering Workbench [1][2] may come in handy for the experiments too.

S.

[1] http://download.carrot2.org/head/manual/#section.workbench
[2]
http://download.carrot2.org/head/manual/#section.getting-started.xml-files

Reply via email to