Hi, Sorry for being late to the party, let me try to clear some doubts about Carrot2.
Do you know under what circumstances or application should we cluster the > whole corpus of documents vs just the search results? I think it depends on what you're trying to achieve. If you'd like to give the users some alternative way of exploring the search results by organizing them into semantically related groups (search results clustering), Carrot2 would be the appropriate tool. Its algorithms are designed to work with small input (up to ~1000 results) and try to provide meaningful labels for each cluster. Currently, Carrot2 has two algorithms: an implementation of Suffix Tree Clustering (STC, a classic in search results clustering research, designed by O. Zamir, implemented by Dawid Weiss) and Lingo (designed and implemented by myself). STC is very fast compared to Lingo, but the latter will usually get you better clusters. Some comparison of the algorithms is here: http://project.carrot2.org/algorithms.html, but ultimately, I'd encourage you to experiment (e.g. using Clustering Workbench). For best results, I'd recommend feeding the algorithms with contextual snippets generated based on the user's query. If the summary could consist of complete sentence(s) containing the query (as opposed to individual words delimited by "..."), you should be getting even nicer labels. One important thing for search results clustering is that it is done on-line, so it will add extra time to each search query your server handles. Plus, to get reasonable clusters, you'd need to fetch at least 50 documents from your index, which may put more load on the disks as well (sometimes clustering time may be only be a fraction of the time required to get the documents from the index). Finally, to compare search results clustering with facets: UI-wise they may look similar, but I'd say they're two different things that complement each other. While the list of facets and their values is fairly static (brand names etc.), clusters are less "stable" -- they're generated dynamically for each search and will vary across queries. Plus, as for any other unsupervised machine learning technique, your clusters will never be 100% correct (as opposed to facets). Almost always you'll be getting one or two clusters that don't make much sense. When it comes to clustering the whole collection, it might be useful in a couple of scenarios: a) if you wanted to get some high level overview of what's in your collection, b) if you'd wanted to e.g. use clusters to re-rank the search results presented to the user (implicit clustering: showing a few documents from each cluster), c) if you wanted to distribute your index based on the semantics of the documents (wild guess, I'm not sure if anyone tried that in practice). In general, I feel clustering the whole index is much harder than search results clustering not only because of the different scale, but also because you'd need to tune the algorithm for your specific needs and data. For example, in scenario a) and a collection of 1M documents: how many top level clusters do you generate? 10? 10000? If it's 10, the clusters may end up too general / meaningless, it might be hard to describe them concisely. If it's 10000, clusters are likely to be more focused, but hard to browse... I must admit I haven't followed Mahout too closely, maybe there is some nice way of resolving these problems. If you have any other questions about Carrot2, I'll try to answer them here. Alternatively, feel free to join Carrot2 mailing lists. Thanks, Staszek -- http://www.carrot2.org