As soon as I started reading your message I started thinking "common grams", so that is what I would try first, esp. since somebody already did the work of porting that from Nutch to Solr (see Solr JIRA). Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
----- Original Message ---- > From: Mark Bennett <mbenn...@ideaeng.com> > To: solr-user@lucene.apache.org > Sent: Thursday, August 6, 2009 2:27:52 PM > Subject: Revisiting IDF Problems and Index Slices > > I'm investigating a problem I bet some of you have hit before, and exploring > several options to address it. I suspect that this specific IDF scenario is > common enough that it even has a name, though I'm not what it would be > called. > > The scenario: > > Suppose you have a search application focused on the products that you sell: > * You initially launch with just one product area for your "widget". > * You add a stop words list for really common words, say "a", "the", "and", > etc. > > Looking at the search logs you discover: > * Other common English words being used, such as "i", "how" and "can". > These are automatically getting low IDF scores, since they are also common > in your documents, so normally you wouldn't worry. > * But you also notice that the words "widget" is getting a very low score, > since it is mentioned in virtually every document. In fact, since widget is > more common than the word "can", it is actually getting a *lower* IDF score > than the word "can", giving some very unsatisfying search results. > > Some queries include other "good" words like "replace" and "battery", so > those searches are doing OK; those terms are relatively rare, so they > properly get the most weight. > > But some searches have nothing but "bad" words, for example "how can I get a > widget ?" > * The word "a" is stopped out. > * The words "how", "can", "i" and "get" are common English words, so they > get a low score. > * And the word "widget", your flagship product, is also getting a low score, > since it's in every document. > So some queries are ENTIRELY composed of "low quality" words, and the search > engine gives somewhat random results. > > Some ideas to address this: > > Idea 1: Do nothing to IDF and just deploy the content for OTHER products > ASAP; the IDF logic will automatically fix the problem. > > With content only about widgets, 100% of the docs have that score. > > But if you deploy content for 9 more products, giving a total of 10 > products, then only 1/10th of them will tend to have the word "widget". But > common English words will continue, on average, to appear in almost all > documents, across all product areas. Therefore they will continue to get > low scores. So relative to the words "i" and "can", the word "widget" will > automatically rise up out of the IDF muck. > > And when content for all 100 of your products are deployed, "widget" will > again have been boosted even further, compared to other common English > words. > > Though long term I have faith in this strategy, suppose I only had content > ready for two other products, say "cogs" and "hookas", so the term "widget" > will get a bit of a boost, but not by as much as I'd like. > > Idea 2: Compare terms in product "slices" and artificially boost them. > > In the Widget content there's references to related accessories like the > wBracket and wCase. > > Whereas in the Hooka content, I rarely see mention of Widgets or wCases, but > I do still see the English terms "i" and "can", and lots of stuff about > Hookas. > > 2a: So I do a histogram for each of the 3 products, and look for terms that > are common in one but not in the others. > > 2b: Or I do a histogram of all terms, and then a histogram of terms for each > of the 3 product slices. And I compare those lists. > > Of course we don't want to go overboard. If somebody is in the Widget forum > and asks "how do I change the widget battery?" then, although the term > Widget is more important than "how" or "do", it is NOT as important as > "battery" or "change", so I would not want to boost "widget" too much. > > BTW, Solr facets makes this term info relatively easy to gather, thanks for > the pointer Yonik! 1.4 also speeds up Facets. > > Idea 3: (possibly in conjunction with 2) Compare terms by product slice, and > do "complementary" IDF boosting. > > In other words, it's very common for content on the Widget forum to have the > terms "widget", "wBracket" and "wcase". > > But if a document in the "Cog" forum mentions "widget", that's actually > rather interesting. Mentions of "widgets" in the Cog and Hooka content > should be treated as "important", regardless of how common "widget" is in > the Widget forum. > > So we compare the common terms in the Widget, Hooka and Cog areas. Terms > that are specific to Widgets get a bigger boost in the Cog and Hooka areas. > Cog terms get bigger boosts in Hooka and Widget areas, and Hooka terms get > extra credit in Cogs and Widgets. > > This extra boosting could be done in addition to the minor boosting widget > terms get in the Widget area, just to keep them above the common English > terms. > > Implementation of ideas 2 and 3: > > This "diffing" of IDF by product slices would seem to be a good idea, though > a bit of coding to implement. I'd think somebody would have done this > already looked into this more formally? > > In terms of implementation, you could do an arbitrary cutoff at 100 terms > for each product area, and then just do a Boolean for whether it appears in > the other lists or not. But after running some tests, common terms are > likely to mentioned in many lists, and it's more a question of where on the > list they were. For example, if you can interface Cogs and Widgets, then > Widget content will occasionally mention Cogs, and vice versa. > > To me this seems like more of a Bayesian or Contingency Table issue, with > some weight given to the actual number of occurrences or the relative > position on lists. Any of you Brainiacs have specific insights? > > There's also a question of scaling the Histogram. If I compare > term-in-document counts in the Widget area to the entire corpus, then since > there are fewer documents in the Widget area (it's a subset of the total), > then by definition the number of documents with the word "i" in that area > will be lower. You'd think the relative order wouldn't be affected, on > average, but any calculation using the occurrence counts would need to take > that into account. In other words, if the word "can" appears more times in > the overall documents than it does in the Widget documents, well is that > just because there are more documents in the overall set, or because "can" > is really more dense. Simple scaling can probably handle this issue... > > Even if I compare Widget to Cog documents, both of which are subsets of the > total, there's likely to be a different number of documents in each area, > and therefore the counts would be different. So I'll get differences in > "how" and "can", in addition to "widget" and "cog", just because the sample > sizes are different. Again, scaling or limiting the sample size might fix > that too. Presumably the ranking of the words "i" and "can", which one is > more popular than the other, would stay the same regardless of sample size > and absolute counts, if I were looking at random English text. If those > terms change position, then that would be more interesting. > > However, I think randomness also plays a role here. If I do a histogram of > 100 Widget documents and only 5 Cog documents, and I notice that the > relative positions of "i" and "can" have changed places, I might be > skeptical that it's a random fluctuation because I only had 5 Cog > documents. If I added 50 more, would the difference still be there? > > Sample size it certainly a well established topic in statistics, though I > haven't see it applied to histograms. > > And another odd thing about histograms taken from different sample sizes are > the low frequency terms. In all search engine histograms you'll see terms > that appear only 1 time, and there'll be a bunch of them. But presumably if > I doubled the number of documents in the sample, some of those terms might > then appear twice, whereas others would still only appear once. There are > other breaks in histogram counts, like when it does from 2 to 1, that would > presumably also shift around a bit. But I'm not sure you can reliably use > sample size ration to scale these things with confidence. This feels like > Poisson distribution territory... > > I suspect all of these questions have also been examined in detail before; I > need to dig up my copy of that "Long Tail" book... Wondering if anybody has > specific links or verbiage / author names to suggest? > > There are certainly other ideas for indirectly addressing IDF problem by > using other methods to compensate: > > 4: You could overlay user reviews / popularity or click-through rates or > other social factors > > 5: Use flexible proximity searches, in addition to phrase matching, to boost > certain matches. Verity used to call this the "near" operator, Lucene/Solr > talk about span queries and "phrase slop". > > So in a question containing "how can I", documents with that exact phrase > get a substantial boost. But documents where "how", "can" and "I" appear > within 5 words of each other in any order get also get a small boost. It's > been suggested that this can help with Q&A type applications. > > 6: FAST's doc mentions that verbs can be more important in question and > answer type applications. Normally nouns are said to contain 60-70% of the > relevance of a query (reference?), but their linguistics guide suggests > verbs are more important in Q and A apps. > > 7: Another idea would be to boost word pairs using composite N-Gram tokens, > what Lucene and Solr call "shingles". > > Using my previous examples: > Question: "How can I get a widget ?" > Normal index: "how", "can", "i", "get", "widget" > Shingles: "how_can", "can_i", "i_get", "get_widget" > Question: "How do I change the widget battery?" > Normal index: "how", "do", "i", "change", "widget", "battery" > Shingles: "how_do", "do_i", "i_change", "change_widget", > "widget_battery" > > And both sets of tokens would be added to the index, in different fields, > and with possibly different boosts. > > The idea being that tokens like "can_i" might be somewhat more statistically > significant than "can" and "i" by themselves. So at least in the first > question, which has all low quality words, the "can_i" might help a bit. > > I'd appreciate any comments on these ideas from y'all, or perhaps names of > specific algorithms / authors. > > -- > Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com > Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513