Re: Revisiting IDF Problems and Index Slices

Grant Ingersoll Wed, 12 Aug 2009 06:52:14 -0700

Some thoughts below, sorry for the late reply...

On Aug 6, 2009, at 2:27 PM, Mark Bennett wrote:

I'm investigating a problem I bet some of you have hit before, andexploringseveral options to address it. I suspect that this specific IDFscenario is
common enough that it even has a name, though I'm not what it would be
called.

The scenario:
Suppose you have a search application focused on the products thatyou sell:
* You initially launch with just one product area for your "widget".
* You add a stop words list for really common words, say "a", "the","and",
etc.

Looking at the search logs you discover:
* Other common English words being used, such as "i", "how" and "can".
These are automatically getting low IDF scores, since they are alsocommon
in your documents, so normally you wouldn't worry.
* But you also notice that the words "widget" is getting a very lowscore,since it is mentioned in virtually every document. In fact, sincewidget ismore common than the word "can", it is actually getting a *lower*IDF score
than the word "can", giving some very unsatisfying search results.
Some queries include other "good" words like "replace" and"battery", so
those searches are doing OK; those terms are relatively rare, so they
properly get the most weight.
But some searches have nothing but "bad" words, for example "how canI get a
widget ?"
* The word "a" is stopped out.
* The words "how", "can", "i" and "get" are common English words, sothey
get a low score.
* And the word "widget", your flagship product, is also getting alow score,
since it's in every document.
So some queries are ENTIRELY composed of "low quality" words, andthe search
engine gives somewhat random results.

Some ideas to address this:
Idea 1: Do nothing to IDF and just deploy the content for OTHERproducts
ASAP; the IDF logic will automatically fix the problem.

With content only about widgets, 100% of the docs have that score.

But if you deploy content for 9 more products, giving a total of 10
products, then only 1/10th of them will tend to have the word"widget". Butcommon English words will continue, on average, to appear in almostalldocuments, across all product areas. Therefore they will continueto getlow scores. So relative to the words "i" and "can", the word"widget" will
automatically rise up out of the IDF muck.
And when content for all 100 of your products are deployed, "widget"will
again have been boosted even further, compared to other common English
words.
Though long term I have faith in this strategy, suppose I only hadcontentready for two other products, say "cogs" and "hookas", so the term"widget"
will get a bit of a boost, but not by as much as I'd like.

Idea 2: Compare terms in product "slices" and artificially boost them.
In the Widget content there's references to related accessories likethe
wBracket and wCase.
Whereas in the Hooka content, I rarely see mention of Widgets orwCases, butI do still see the English terms "i" and "can", and lots of stuffabout
Hookas.
2a: So I do a histogram for each of the 3 products, and look forterms that
are common in one but not in the others.
2b: Or I do a histogram of all terms, and then a histogram of termsfor each
of the 3 product slices.  And I compare those lists.
Of course we don't want to go overboard. If somebody is in theWidget forum
and asks "how do I change the widget battery?" then, although the term
Widget is more important than "how" or "do", it is NOT as important as
"battery" or "change", so I would not want to boost "widget" too much.
BTW, Solr facets makes this term info relatively easy to gather,thanks for
the pointer Yonik!  1.4 also speeds up Facets.
Idea 3: (possibly in conjunction with 2) Compare terms by productslice, and
do "complementary" IDF boosting.
In other words, it's very common for content on the Widget forum tohave the
terms "widget", "wBracket" and "wcase".
But if a document in the "Cog" forum mentions "widget", that'sactuallyrather interesting. Mentions of "widgets" in the Cog and Hookacontentshould be treated as "important", regardless of how common "widget"is in
the Widget forum.
So we compare the common terms in the Widget, Hooka and Cog areas.Termsthat are specific to Widgets get a bigger boost in the Cog and Hookaareas.Cog terms get bigger boosts in Hooka and Widget areas, and Hookaterms get
extra credit in Cogs and Widgets.
This extra boosting could be done in addition to the minor boostingwidgetterms get in the Widget area, just to keep them above the commonEnglish
terms.

Implementation of ideas 2 and 3:
This "diffing" of IDF by product slices would seem to be a goodidea, though
a bit of coding to implement.  I'd think somebody would have done this
already looked into this more formally?
In terms of implementation, you could do an arbitrary cutoff at 100termsfor each product area, and then just do a Boolean for whether itappears inthe other lists or not. But after running some tests, common termsarelikely to mentioned in many lists, and it's more a question of whereon thelist they were. For example, if you can interface Cogs and Widgets,then
Widget content will occasionally mention Cogs, and vice versa.
To me this seems like more of a Bayesian or Contingency Table issue,with
some weight given to the actual number of occurrences or the relative
position on lists.  Any of you Brainiacs have specific insights?

There's also a question of scaling the Histogram.  If I compare
term-in-document counts in the Widget area to the entire corpus,then sincethere are fewer documents in the Widget area (it's a subset of thetotal),then by definition the number of documents with the word "i" in thatareawill be lower. You'd think the relative order wouldn't be affected,onaverage, but any calculation using the occurrence counts would needto takethat into account. In other words, if the word "can" appears moretimes inthe overall documents than it does in the Widget documents, well isthatjust because there are more documents in the overall set, or because"can"is really more dense. Simple scaling can probably handle thisissue...
Even if I compare Widget to Cog documents, both of which are subsetsof thetotal, there's likely to be a different number of documents in eacharea,and therefore the counts would be different. So I'll getdifferences in"how" and "can", in addition to "widget" and "cog", just because thesamplesizes are different. Again, scaling or limiting the sample sizemight fixthat too. Presumably the ranking of the words "i" and "can", whichone ismore popular than the other, would stay the same regardless ofsample sizeand absolute counts, if I were looking at random English text. Ifthose
terms change position, then that would be more interesting.
However, I think randomness also plays a role here. If I do ahistogram of
100 Widget documents and only 5 Cog documents, and I notice that the
relative positions of "i" and "can" have changed places, I might be
skeptical that it's a random fluctuation because I only had 5 Cog
documents.  If I added 50 more, would the difference still be there?
Sample size it certainly a well established topic in statistics,though I
haven't see it applied to histograms.
And another odd thing about histograms taken from different samplesizes arethe low frequency terms. In all search engine histograms you'll seetermsthat appear only 1 time, and there'll be a bunch of them. Butpresumably ifI doubled the number of documents in the sample, some of those termsmightthen appear twice, whereas others would still only appear once.There areother breaks in histogram counts, like when it does from 2 to 1,that wouldpresumably also shift around a bit. But I'm not sure you canreliably usesample size ration to scale these things with confidence. Thisfeels like
Poisson distribution territory...
I suspect all of these questions have also been examined in detailbefore; Ineed to dig up my copy of that "Long Tail" book... Wondering ifanybody has
specific links or verbiage / author names to suggest?
There are certainly other ideas for indirectly addressing IDFproblem by
using other methods to compensate:
4: You could overlay user reviews / popularity or click-throughrates or
other social factors
5: Use flexible proximity searches, in addition to phrase matching,to boostcertain matches. Verity used to call this the "near" operator,Lucene/Solr
talk about span queries and "phrase slop".
So in a question containing "how can I", documents with that exactphraseget a substantial boost. But documents where "how", "can" and "I"appearwithin 5 words of each other in any order get also get a smallboost. It's
been suggested that this can help with Q&A type applications.
6: FAST's doc mentions that verbs can be more important in questionandanswer type applications. Normally nouns are said to contain 60-70%of therelevance of a query (reference?), but their linguistics guidesuggests
verbs are more important in Q and A apps.
7: Another idea would be to boost word pairs using composite N-Gramtokens,
what Lucene and Solr call "shingles".

Using my previous examples:
Question: "How can I get a widget ?"
   Normal index: "how", "can", "i", "get", "widget"
   Shingles: "how_can", "can_i", "i_get", "get_widget"
Question: "How do I change the widget battery?"
   Normal index: "how", "do", "i", "change", "widget", "battery"
   Shingles: "how_do", "do_i", "i_change", "change_widget",
"widget_battery"
And both sets of tokens would be added to the index, in differentfields,
and with possibly different boosts.
The idea being that tokens like "can_i" might be somewhat morestatisticallysignificant than "can" and "i" by themselves. So at least in thefirstquestion, which has all low quality words, the "can_i" might help abit.
I'd appreciate any comments on these ideas from y'all, or perhapsnames of
specific algorithms / authors.


Other ideas:

-Doc Analysis: Boost documents that have Widget in them.

-Query Analysis: Boost the Widget query term by using someintelligent query analysis to detect when certain keywords are used-Payloads. If you know Widget is important ahead of time (throughother means), give it a payload and use the likes of theBoostingFunctionTermQuery instead of a regular Term Query.- Query Elevation Component - sometimes it may just make sense to makesome results be the top results regardless of the scores.- While it is good to keep stopwords in the index, it isn't alwaysgood to keep them in the query. Again, query analysis can be employedto determine what words to keep and what to get rid of.- Override Similarity to downplay the importance of IDF - this islikely a rather hard thing to get right



--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search

Re: Revisiting IDF Problems and Index Slices

Reply via email to