I'm investigating a problem I bet some of you have hit before, and exploring several options to address it. I suspect that this specific IDF scenario is common enough that it even has a name, though I'm not what it would be called.
The scenario: Suppose you have a search application focused on the products that you sell: * You initially launch with just one product area for your "widget". * You add a stop words list for really common words, say "a", "the", "and", etc. Looking at the search logs you discover: * Other common English words being used, such as "i", "how" and "can". These are automatically getting low IDF scores, since they are also common in your documents, so normally you wouldn't worry. * But you also notice that the words "widget" is getting a very low score, since it is mentioned in virtually every document. In fact, since widget is more common than the word "can", it is actually getting a *lower* IDF score than the word "can", giving some very unsatisfying search results. Some queries include other "good" words like "replace" and "battery", so those searches are doing OK; those terms are relatively rare, so they properly get the most weight. But some searches have nothing but "bad" words, for example "how can I get a widget ?" * The word "a" is stopped out. * The words "how", "can", "i" and "get" are common English words, so they get a low score. * And the word "widget", your flagship product, is also getting a low score, since it's in every document. So some queries are ENTIRELY composed of "low quality" words, and the search engine gives somewhat random results. Some ideas to address this: Idea 1: Do nothing to IDF and just deploy the content for OTHER products ASAP; the IDF logic will automatically fix the problem. With content only about widgets, 100% of the docs have that score. But if you deploy content for 9 more products, giving a total of 10 products, then only 1/10th of them will tend to have the word "widget". But common English words will continue, on average, to appear in almost all documents, across all product areas. Therefore they will continue to get low scores. So relative to the words "i" and "can", the word "widget" will automatically rise up out of the IDF muck. And when content for all 100 of your products are deployed, "widget" will again have been boosted even further, compared to other common English words. Though long term I have faith in this strategy, suppose I only had content ready for two other products, say "cogs" and "hookas", so the term "widget" will get a bit of a boost, but not by as much as I'd like. Idea 2: Compare terms in product "slices" and artificially boost them. In the Widget content there's references to related accessories like the wBracket and wCase. Whereas in the Hooka content, I rarely see mention of Widgets or wCases, but I do still see the English terms "i" and "can", and lots of stuff about Hookas. 2a: So I do a histogram for each of the 3 products, and look for terms that are common in one but not in the others. 2b: Or I do a histogram of all terms, and then a histogram of terms for each of the 3 product slices. And I compare those lists. Of course we don't want to go overboard. If somebody is in the Widget forum and asks "how do I change the widget battery?" then, although the term Widget is more important than "how" or "do", it is NOT as important as "battery" or "change", so I would not want to boost "widget" too much. BTW, Solr facets makes this term info relatively easy to gather, thanks for the pointer Yonik! 1.4 also speeds up Facets. Idea 3: (possibly in conjunction with 2) Compare terms by product slice, and do "complementary" IDF boosting. In other words, it's very common for content on the Widget forum to have the terms "widget", "wBracket" and "wcase". But if a document in the "Cog" forum mentions "widget", that's actually rather interesting. Mentions of "widgets" in the Cog and Hooka content should be treated as "important", regardless of how common "widget" is in the Widget forum. So we compare the common terms in the Widget, Hooka and Cog areas. Terms that are specific to Widgets get a bigger boost in the Cog and Hooka areas. Cog terms get bigger boosts in Hooka and Widget areas, and Hooka terms get extra credit in Cogs and Widgets. This extra boosting could be done in addition to the minor boosting widget terms get in the Widget area, just to keep them above the common English terms. Implementation of ideas 2 and 3: This "diffing" of IDF by product slices would seem to be a good idea, though a bit of coding to implement. I'd think somebody would have done this already looked into this more formally? In terms of implementation, you could do an arbitrary cutoff at 100 terms for each product area, and then just do a Boolean for whether it appears in the other lists or not. But after running some tests, common terms are likely to mentioned in many lists, and it's more a question of where on the list they were. For example, if you can interface Cogs and Widgets, then Widget content will occasionally mention Cogs, and vice versa. To me this seems like more of a Bayesian or Contingency Table issue, with some weight given to the actual number of occurrences or the relative position on lists. Any of you Brainiacs have specific insights? There's also a question of scaling the Histogram. If I compare term-in-document counts in the Widget area to the entire corpus, then since there are fewer documents in the Widget area (it's a subset of the total), then by definition the number of documents with the word "i" in that area will be lower. You'd think the relative order wouldn't be affected, on average, but any calculation using the occurrence counts would need to take that into account. In other words, if the word "can" appears more times in the overall documents than it does in the Widget documents, well is that just because there are more documents in the overall set, or because "can" is really more dense. Simple scaling can probably handle this issue... Even if I compare Widget to Cog documents, both of which are subsets of the total, there's likely to be a different number of documents in each area, and therefore the counts would be different. So I'll get differences in "how" and "can", in addition to "widget" and "cog", just because the sample sizes are different. Again, scaling or limiting the sample size might fix that too. Presumably the ranking of the words "i" and "can", which one is more popular than the other, would stay the same regardless of sample size and absolute counts, if I were looking at random English text. If those terms change position, then that would be more interesting. However, I think randomness also plays a role here. If I do a histogram of 100 Widget documents and only 5 Cog documents, and I notice that the relative positions of "i" and "can" have changed places, I might be skeptical that it's a random fluctuation because I only had 5 Cog documents. If I added 50 more, would the difference still be there? Sample size it certainly a well established topic in statistics, though I haven't see it applied to histograms. And another odd thing about histograms taken from different sample sizes are the low frequency terms. In all search engine histograms you'll see terms that appear only 1 time, and there'll be a bunch of them. But presumably if I doubled the number of documents in the sample, some of those terms might then appear twice, whereas others would still only appear once. There are other breaks in histogram counts, like when it does from 2 to 1, that would presumably also shift around a bit. But I'm not sure you can reliably use sample size ration to scale these things with confidence. This feels like Poisson distribution territory... I suspect all of these questions have also been examined in detail before; I need to dig up my copy of that "Long Tail" book... Wondering if anybody has specific links or verbiage / author names to suggest? There are certainly other ideas for indirectly addressing IDF problem by using other methods to compensate: 4: You could overlay user reviews / popularity or click-through rates or other social factors 5: Use flexible proximity searches, in addition to phrase matching, to boost certain matches. Verity used to call this the "near" operator, Lucene/Solr talk about span queries and "phrase slop". So in a question containing "how can I", documents with that exact phrase get a substantial boost. But documents where "how", "can" and "I" appear within 5 words of each other in any order get also get a small boost. It's been suggested that this can help with Q&A type applications. 6: FAST's doc mentions that verbs can be more important in question and answer type applications. Normally nouns are said to contain 60-70% of the relevance of a query (reference?), but their linguistics guide suggests verbs are more important in Q and A apps. 7: Another idea would be to boost word pairs using composite N-Gram tokens, what Lucene and Solr call "shingles". Using my previous examples: Question: "How can I get a widget ?" Normal index: "how", "can", "i", "get", "widget" Shingles: "how_can", "can_i", "i_get", "get_widget" Question: "How do I change the widget battery?" Normal index: "how", "do", "i", "change", "widget", "battery" Shingles: "how_do", "do_i", "i_change", "change_widget", "widget_battery" And both sets of tokens would be added to the index, in different fields, and with possibly different boosts. The idea being that tokens like "can_i" might be somewhat more statistically significant than "can" and "i" by themselves. So at least in the first question, which has all low quality words, the "can_i" might help a bit. I'd appreciate any comments on these ideas from y'all, or perhaps names of specific algorithms / authors. -- Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513