I'm investigating a problem I bet some of you have hit before, and
exploring
several options to address it. I suspect that this specific IDF
scenario is
common enough that it even has a name, though I'm not what it would be
called.
The scenario:
Suppose you have a search application focused on the products that
you sell:
* You initially launch with just one product area for your "widget".
* You add a stop words list for really common words, say "a", "the",
"and",
etc.
Looking at the search logs you discover:
* Other common English words being used, such as "i", "how" and "can".
These are automatically getting low IDF scores, since they are also
common
in your documents, so normally you wouldn't worry.
* But you also notice that the words "widget" is getting a very low
score,
since it is mentioned in virtually every document. In fact, since
widget is
more common than the word "can", it is actually getting a *lower*
IDF score
than the word "can", giving some very unsatisfying search results.
Some queries include other "good" words like "replace" and
"battery", so
those searches are doing OK; those terms are relatively rare, so they
properly get the most weight.
But some searches have nothing but "bad" words, for example "how can
I get a
widget ?"
* The word "a" is stopped out.
* The words "how", "can", "i" and "get" are common English words, so
they
get a low score.
* And the word "widget", your flagship product, is also getting a
low score,
since it's in every document.
So some queries are ENTIRELY composed of "low quality" words, and
the search
engine gives somewhat random results.
Some ideas to address this:
Idea 1: Do nothing to IDF and just deploy the content for OTHER
products
ASAP; the IDF logic will automatically fix the problem.
With content only about widgets, 100% of the docs have that score.
But if you deploy content for 9 more products, giving a total of 10
products, then only 1/10th of them will tend to have the word
"widget". But
common English words will continue, on average, to appear in almost
all
documents, across all product areas. Therefore they will continue
to get
low scores. So relative to the words "i" and "can", the word
"widget" will
automatically rise up out of the IDF muck.
And when content for all 100 of your products are deployed, "widget"
will
again have been boosted even further, compared to other common English
words.
Though long term I have faith in this strategy, suppose I only had
content
ready for two other products, say "cogs" and "hookas", so the term
"widget"
will get a bit of a boost, but not by as much as I'd like.
Idea 2: Compare terms in product "slices" and artificially boost them.
In the Widget content there's references to related accessories like
the
wBracket and wCase.
Whereas in the Hooka content, I rarely see mention of Widgets or
wCases, but
I do still see the English terms "i" and "can", and lots of stuff
about
Hookas.
2a: So I do a histogram for each of the 3 products, and look for
terms that
are common in one but not in the others.
2b: Or I do a histogram of all terms, and then a histogram of terms
for each
of the 3 product slices. And I compare those lists.
Of course we don't want to go overboard. If somebody is in the
Widget forum
and asks "how do I change the widget battery?" then, although the term
Widget is more important than "how" or "do", it is NOT as important as
"battery" or "change", so I would not want to boost "widget" too much.
BTW, Solr facets makes this term info relatively easy to gather,
thanks for
the pointer Yonik! 1.4 also speeds up Facets.
Idea 3: (possibly in conjunction with 2) Compare terms by product
slice, and
do "complementary" IDF boosting.
In other words, it's very common for content on the Widget forum to
have the
terms "widget", "wBracket" and "wcase".
But if a document in the "Cog" forum mentions "widget", that's
actually
rather interesting. Mentions of "widgets" in the Cog and Hooka
content
should be treated as "important", regardless of how common "widget"
is in
the Widget forum.
So we compare the common terms in the Widget, Hooka and Cog areas.
Terms
that are specific to Widgets get a bigger boost in the Cog and Hooka
areas.
Cog terms get bigger boosts in Hooka and Widget areas, and Hooka
terms get
extra credit in Cogs and Widgets.
This extra boosting could be done in addition to the minor boosting
widget
terms get in the Widget area, just to keep them above the common
English
terms.
Implementation of ideas 2 and 3:
This "diffing" of IDF by product slices would seem to be a good
idea, though
a bit of coding to implement. I'd think somebody would have done this
already looked into this more formally?
In terms of implementation, you could do an arbitrary cutoff at 100
terms
for each product area, and then just do a Boolean for whether it
appears in
the other lists or not. But after running some tests, common terms
are
likely to mentioned in many lists, and it's more a question of where
on the
list they were. For example, if you can interface Cogs and Widgets,
then
Widget content will occasionally mention Cogs, and vice versa.
To me this seems like more of a Bayesian or Contingency Table issue,
with
some weight given to the actual number of occurrences or the relative
position on lists. Any of you Brainiacs have specific insights?
There's also a question of scaling the Histogram. If I compare
term-in-document counts in the Widget area to the entire corpus,
then since
there are fewer documents in the Widget area (it's a subset of the
total),
then by definition the number of documents with the word "i" in that
area
will be lower. You'd think the relative order wouldn't be affected,
on
average, but any calculation using the occurrence counts would need
to take
that into account. In other words, if the word "can" appears more
times in
the overall documents than it does in the Widget documents, well is
that
just because there are more documents in the overall set, or because
"can"
is really more dense. Simple scaling can probably handle this
issue...
Even if I compare Widget to Cog documents, both of which are subsets
of the
total, there's likely to be a different number of documents in each
area,
and therefore the counts would be different. So I'll get
differences in
"how" and "can", in addition to "widget" and "cog", just because the
sample
sizes are different. Again, scaling or limiting the sample size
might fix
that too. Presumably the ranking of the words "i" and "can", which
one is
more popular than the other, would stay the same regardless of
sample size
and absolute counts, if I were looking at random English text. If
those
terms change position, then that would be more interesting.
However, I think randomness also plays a role here. If I do a
histogram of
100 Widget documents and only 5 Cog documents, and I notice that the
relative positions of "i" and "can" have changed places, I might be
skeptical that it's a random fluctuation because I only had 5 Cog
documents. If I added 50 more, would the difference still be there?
Sample size it certainly a well established topic in statistics,
though I
haven't see it applied to histograms.
And another odd thing about histograms taken from different sample
sizes are
the low frequency terms. In all search engine histograms you'll see
terms
that appear only 1 time, and there'll be a bunch of them. But
presumably if
I doubled the number of documents in the sample, some of those terms
might
then appear twice, whereas others would still only appear once.
There are
other breaks in histogram counts, like when it does from 2 to 1,
that would
presumably also shift around a bit. But I'm not sure you can
reliably use
sample size ration to scale these things with confidence. This
feels like
Poisson distribution territory...
I suspect all of these questions have also been examined in detail
before; I
need to dig up my copy of that "Long Tail" book... Wondering if
anybody has
specific links or verbiage / author names to suggest?
There are certainly other ideas for indirectly addressing IDF
problem by
using other methods to compensate:
4: You could overlay user reviews / popularity or click-through
rates or
other social factors
5: Use flexible proximity searches, in addition to phrase matching,
to boost
certain matches. Verity used to call this the "near" operator,
Lucene/Solr
talk about span queries and "phrase slop".
So in a question containing "how can I", documents with that exact
phrase
get a substantial boost. But documents where "how", "can" and "I"
appear
within 5 words of each other in any order get also get a small
boost. It's
been suggested that this can help with Q&A type applications.
6: FAST's doc mentions that verbs can be more important in question
and
answer type applications. Normally nouns are said to contain 60-70%
of the
relevance of a query (reference?), but their linguistics guide
suggests
verbs are more important in Q and A apps.
7: Another idea would be to boost word pairs using composite N-Gram
tokens,
what Lucene and Solr call "shingles".
Using my previous examples:
Question: "How can I get a widget ?"
Normal index: "how", "can", "i", "get", "widget"
Shingles: "how_can", "can_i", "i_get", "get_widget"
Question: "How do I change the widget battery?"
Normal index: "how", "do", "i", "change", "widget", "battery"
Shingles: "how_do", "do_i", "i_change", "change_widget",
"widget_battery"
And both sets of tokens would be added to the index, in different
fields,
and with possibly different boosts.
The idea being that tokens like "can_i" might be somewhat more
statistically
significant than "can" and "i" by themselves. So at least in the
first
question, which has all low quality words, the "can_i" might help a
bit.
I'd appreciate any comments on these ideas from y'all, or perhaps
names of
specific algorithms / authors.