Hi,

Sorry for being late to the party, let me try to clear some doubts about
Carrot2.

Do you know under what circumstances or application should we cluster the
> whole corpus of documents vs just the search results?


I think it depends on what you're trying to achieve. If you'd like to give
the users some alternative way of exploring the search results by organizing
them into semantically related groups (search results clustering), Carrot2
would be the appropriate tool. Its algorithms are designed to work with
small input (up to ~1000 results) and try to provide meaningful labels for
each cluster. Currently, Carrot2 has two algorithms: an implementation of
Suffix Tree Clustering (STC, a classic in search results clustering
research, designed by O. Zamir, implemented by Dawid Weiss) and Lingo
(designed and implemented by myself). STC is very fast compared to Lingo,
but the latter will usually get you better clusters. Some comparison of the
algorithms is here: http://project.carrot2.org/algorithms.html, but
ultimately, I'd encourage you to experiment (e.g. using Clustering
Workbench). For best results, I'd recommend feeding the algorithms with
contextual snippets generated based on the user's query. If the summary
could consist of complete sentence(s) containing the query (as opposed to
individual words delimited by "..."), you should be getting even nicer
labels.

One important thing for search results clustering is that it is done
on-line, so it will add extra time to each search query your server handles.
Plus, to get reasonable clusters, you'd need to fetch at least 50 documents
from your index, which may put more load on the disks as well (sometimes
clustering time may be only be a fraction of the time required to get the
documents from the index).

Finally, to compare search results clustering with facets: UI-wise they may
look similar, but I'd say they're two different things that complement each
other. While the list of facets and their values is fairly static (brand
names etc.), clusters are less "stable" -- they're generated dynamically for
each search and will vary across queries. Plus, as for any other
unsupervised machine learning technique, your clusters will never be 100%
correct (as opposed to facets). Almost always you'll be getting one or two
clusters that don't make much sense.

When it comes to clustering the whole collection, it might be useful in a
couple of scenarios: a) if you wanted to get some high level overview of
what's in your collection, b) if you'd wanted to e.g. use clusters to
re-rank the search results presented to the user (implicit clustering:
showing a few documents from each cluster), c) if you wanted to distribute
your index based on the semantics of the documents (wild guess, I'm not sure
if anyone tried that in practice). In general, I feel clustering the whole
index is much harder than search results clustering not only because of the
different scale, but also because you'd need to tune the algorithm for your
specific needs and data. For example, in scenario a) and a collection of 1M
documents: how many top level clusters do you generate? 10? 10000? If it's
10, the clusters may end up too general / meaningless, it might be hard to
describe them concisely. If it's 10000, clusters are likely to be more
focused, but hard to browse... I must admit I haven't followed Mahout too
closely, maybe there is some nice way of resolving these problems.

If you have any other questions about Carrot2, I'll try to answer them here.
Alternatively, feel free to join Carrot2 mailing lists.

Thanks,

Staszek

--
http://www.carrot2.org

Reply via email to