Thanks Grant. *** mlb: comments inline
On Tue, Aug 11, 2009 at 12:40 PM, Grant Ingersoll <gsing...@apache.org>wrote: > Inline... > > On Aug 11, 2009, at 12:44 PM, Mark Bennett wrote: > > I'm going somewhere with this... be patient. :-) I had asked about this >> briefly at the SF meetup, but there was a lot going on. >> >> 1: Suppose you had Solr 1.4 and all the Carrot^2 DOCUMENT clustering was >> all >> in, and you had built the cluster index for all your docs. >> >> 2: Then, if you had a particular cluster, and one of the docs in that >> cluster happened to be your search, then the other documents in the >> cluster >> could be considered the results. In effect, the cluster is like the >> search >> results. >> >> 3: Now imagine you can take an arbitrary doc and find the clusters that >> document is in. (some clustering engines let you do this). >> >> 4: And then imagine that, when somebody submits a search, you quickly turn >> it into a document, add it to the index, redo the clusters, find the >> clusters this new temp doc is in, and use that as the results. >> >> > I guess I'd argue that this is already what Lucene does, except for the > part about adding the query into the document set. The Lucene Query is just > your arbitrary document. Really, the primary difference as I see it, I > think, is that you want a the Carrot2 scoring mechanism instead of the > existing Lucene one, no? Otherwise, I don't see much benefit to actually > indexing the query, other than it could potentially be used to skew results > over time as people ask the same queries over and over again. *** mlb: Yes, this is essentially what I'm suggesting. Carrot2 has several pluggable algorithms to choose from, though I have no evidence that they're "better" than Lucene's. Where TF/IDF is sort of a one step algebraic calculation, some clustering algorithms use iterative approaches, etc. > > > Under a certain lens, couldn't you just argue that search is finding all > the docs that cluster around your query? (I know that isn't the traditional > description, but regardless, the math underneath is often very similar) > *** mlb: Yes, exactly. And so the question is might some of these other methods work better for certain applications, certain vocabularies, etc. So I guess it's about flexibility, etc. Though you can plugin your own similarity class, that's still the one shot algebraic model, regardless of the specific formulas. Some of the newer machine learning algorithms have other tricks up their sleeves that might fit some usage models better. > > > > Benefits? >> >> I'm not saying this would be practical, but would it be useful? Or, in >> particular, would it be more useful than the normal Solr/Lucene relevancy? >> As I recall Carrot^2 had 3 choices for clustering. >> > > >> And let's assume that the searches coming in are more than the 1.4 words >> average. Maybe a few sentences or something. I'm mot sure a 1 word query >> would really benefit from this. :-) >> >> Some clustering algorithms don't allow you to find a cluster containing a >> specific document, so those wouldn't work as a "search engine". >> >> More Like This as a "cluster" search? >> >> A similar scenario could be made for the "more like this" feature. Take a >> user's search text (presumably lengthy), quickly index it, then use that >> new >> temp doc as a MLT seed doc. I haven't looked deep into the code, it might >> be that it uses essentially the same relevancy as a query. >> > > Again, I don't see the benefit of indexing it. You slightly peturb the > corpus statistics, but other than that, how is it different from just > submitting the query and getting back the results? *** Yeah, actually I'm not wild about changing the index for the sake of processing a search. And looking at MLT, they claim you can send in a stream, so no need to update the index.