Hi, On Thu, Aug 13, 2009 at 19:29, Mark Bennett <mbenn...@ideaeng.com> wrote:
There are comments in the Solr materials about having an option to cluster > based on the entire document set, and some warning about this being > atypical > and possibly slow. And from what you're saying, for a big enough docset, > it > might go from "slow" to "impossible", I'm not sure. For Carrot2, it would go to "impossible" I'd say. But as Grant mentioned earlier, Mahout is developing clustering algorithms that should be able to handle the whole-index types of docsets. And so my question was, *if* you were willing to spend that much time and > effort to cluster all the text of all the documents (and if it were even > possible), would the result perform better than the standard TF/IDF > techniques? Depends on the algorithm, really. In case of Carrot2, we don't do re-ranking of documents within clusters, we simply use whatever document order we got on input. As far as I'm aware, most clustering algorithms do pretty much the same: they concentrate on finding groups of documents and don't delve much into the issues of ranking documents within clusters. > In the application I'm considering, the queries tend to be longer than > average, more like full sentences or more. And they tend to be of a > question and answer nature. I've seen references in several search engines > that QandA search sometimes benefits from alternative search techniques. > And, from a separate email, the IDF part of the standard similarity may be > causing a problem, so I'm casting a wide net for other ideas. Just > brainstorming here... :-) Because of what I described above, clustering the whole index may not give you the best results. But you can try something different. You could try fetching a bunch (100--500) of more or less relevant documents for the question (MLT should be fine to start with), add your question as an extra document, perform clustering and see where the question-document ends up. If it doesn't end up in the Other Topics cluster, you could examine if the other documents from the cluster give an answer to the question. In this scenario, Carrot2 should be fine, at least performance-wise. I've not followed the QA literature very closely, so it's hard to say what the results would be quality-wise, but it should be very quick to try. Carrot2 Clustering Workbench [1][2] may come in handy for the experiments too. S. [1] http://download.carrot2.org/head/manual/#section.workbench [2] http://download.carrot2.org/head/manual/#section.getting-started.xml-files