With regards my second question, re. More Like this, I do see: "The MoreLikeThisHandler can also use a ContentStream to find similar documents. It will extract the "interesting terms" from the posted text." at http://wiki.apache.org/solr/MoreLikeThisHandler and that it uses the TF/IDF stuff.
Still wondering if anybody's tried MLK or Carrot clustering as a primary search entry point. On Tue, Aug 11, 2009 at 9:44 AM, Mark Bennett <mbenn...@ideaeng.com> wrote: > I'm going somewhere with this... be patient. :-) I had asked about this > briefly at the SF meetup, but there was a lot going on. > > 1: Suppose you had Solr 1.4 and all the Carrot^2 DOCUMENT clustering was > all in, and you had built the cluster index for all your docs. > > 2: Then, if you had a particular cluster, and one of the docs in that > cluster happened to be your search, then the other documents in the cluster > could be considered the results. In effect, the cluster is like the search > results. > > 3: Now imagine you can take an arbitrary doc and find the clusters that > document is in. (some clustering engines let you do this). > > 4: And then imagine that, when somebody submits a search, you quickly turn > it into a document, add it to the index, redo the clusters, find the > clusters this new temp doc is in, and use that as the results. > > Benefits? > > I'm not saying this would be practical, but would it be useful? Or, in > particular, would it be more useful than the normal Solr/Lucene relevancy? > As I recall Carrot^2 had 3 choices for clustering. > > And let's assume that the searches coming in are more than the 1.4 words > average. Maybe a few sentences or something. I'm mot sure a 1 word query > would really benefit from this. :-) > > Some clustering algorithms don't allow you to find a cluster containing a > specific document, so those wouldn't work as a "search engine". > > More Like This as a "cluster" search? > > A similar scenario could be made for the "more like this" feature. Take a > user's search text (presumably lengthy), quickly index it, then use that new > temp doc as a MLT seed doc. I haven't looked deep into the code, it might > be that it uses essentially the same relevancy as a query. > > -- > Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com > Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513 >