As soon as I started reading your message I started thinking "common
grams", so that is what I would try first, esp. since somebody already
did the work of porting that from Nutch to Solr (see Solr JIRA).
 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----
> From: Mark Bennett <mbenn...@ideaeng.com>
> To: solr-user@lucene.apache.org
> Sent: Thursday, August 6, 2009 2:27:52 PM
> Subject: Revisiting IDF Problems and Index Slices
> 
> I'm investigating a problem I bet some of you have hit before, and exploring
> several options to address it.  I suspect that this specific IDF scenario is
> common enough that it even has a name, though I'm not what it would be
> called.
> 
> The scenario:
> 
> Suppose you have a search application focused on the products that you sell:
> * You initially launch with just one product area for your "widget".
> * You add a stop words list for really common words, say "a", "the", "and",
> etc.
> 
> Looking at the search logs you discover:
> * Other common English words being used, such as "i", "how" and "can".
> These are automatically getting low IDF scores, since they are also common
> in your documents, so normally you wouldn't worry.
> * But you also notice that the words "widget" is getting a very low score,
> since it is mentioned in virtually every document.  In fact, since widget is
> more common than the word "can", it is actually getting a *lower* IDF score
> than the word "can", giving some very unsatisfying search results.
> 
> Some queries include other "good" words like "replace" and "battery", so
> those searches are doing OK; those terms are relatively rare, so they
> properly get the most weight.
> 
> But some searches have nothing but "bad" words, for example "how can I get a
> widget ?"
> * The word "a" is stopped out.
> * The words "how", "can", "i" and "get" are common English words, so they
> get a low score.
> * And the word "widget", your flagship product, is also getting a low score,
> since it's in every document.
> So some queries are ENTIRELY composed of "low quality" words, and the search
> engine gives somewhat random results.
> 
> Some ideas to address this:
> 
> Idea 1: Do nothing to IDF and just deploy the content for OTHER products
> ASAP; the IDF logic will automatically fix the problem.
> 
> With content only about widgets, 100% of the docs have that score.
> 
> But if you deploy content for 9 more products, giving a total of 10
> products, then only 1/10th of them will tend to have the word "widget".  But
> common English words will continue, on average, to appear in almost all
> documents, across all product areas.  Therefore they will continue to get
> low scores.  So relative to the words "i" and "can", the word "widget" will
> automatically rise up out of the IDF muck.
> 
> And when content for all 100 of your products are deployed, "widget" will
> again have been boosted even further, compared to other common English
> words.
> 
> Though long term I have faith in this strategy, suppose I only had content
> ready for two other products, say "cogs" and "hookas", so the term "widget"
> will get a bit of a boost, but not by as much as I'd like.
> 
> Idea 2: Compare terms in product "slices" and artificially boost them.
> 
> In the Widget content there's references to related accessories like the
> wBracket and wCase.
> 
> Whereas in the Hooka content, I rarely see mention of Widgets or wCases, but
> I do still see the English terms "i" and "can", and lots of stuff about
> Hookas.
> 
> 2a: So I do a histogram for each of the 3 products, and look for terms that
> are common in one but not in the others.
> 
> 2b: Or I do a histogram of all terms, and then a histogram of terms for each
> of the 3 product slices.  And I compare those lists.
> 
> Of course we don't want to go overboard.  If somebody is in the Widget forum
> and asks "how do I change the widget battery?" then, although the term
> Widget is more important than "how" or "do", it is NOT as important as
> "battery" or "change", so I would not want to boost "widget" too much.
> 
> BTW, Solr facets makes this term info relatively easy to gather, thanks for
> the pointer Yonik!  1.4 also speeds up Facets.
> 
> Idea 3: (possibly in conjunction with 2) Compare terms by product slice, and
> do "complementary" IDF boosting.
> 
> In other words, it's very common for content on the Widget forum to have the
> terms "widget", "wBracket" and "wcase".
> 
> But if a document in the "Cog" forum mentions "widget", that's actually
> rather interesting.  Mentions of "widgets" in the Cog and Hooka content
> should be treated as "important", regardless of how common "widget" is in
> the Widget forum.
> 
> So we compare the common terms in the Widget, Hooka and Cog areas.  Terms
> that are specific to Widgets get a bigger boost in the Cog and Hooka areas.
> Cog terms get bigger boosts in Hooka and Widget areas, and Hooka terms get
> extra credit in Cogs and Widgets.
> 
> This extra boosting could be done in addition to the minor boosting widget
> terms get in the Widget area, just to keep them above the common English
> terms.
> 
> Implementation of ideas 2 and 3:
> 
> This "diffing" of IDF by product slices would seem to be a good idea, though
> a bit of coding to implement.  I'd think somebody would have done this
> already looked into this more formally?
> 
> In terms of implementation, you could do an arbitrary cutoff at 100 terms
> for each product area, and then just do a Boolean for whether it appears in
> the other lists or not.  But after running some tests, common terms are
> likely to mentioned in many lists, and it's more a question of where on the
> list they were.  For example, if you can interface Cogs and Widgets, then
> Widget content will occasionally mention Cogs, and vice versa.
> 
> To me this seems like more of a Bayesian or Contingency Table issue, with
> some weight given to the actual number of occurrences or the relative
> position on lists.  Any of you Brainiacs have specific insights?
> 
> There's also a question of scaling the Histogram.  If I compare
> term-in-document counts in the Widget area to the entire corpus, then since
> there are fewer documents in the Widget area (it's a subset of the total),
> then by definition the number of documents with the word "i" in that area
> will be lower.  You'd think the relative order wouldn't be affected, on
> average, but any calculation using the occurrence counts would need to take
> that into account.  In other words, if the word "can" appears more times in
> the overall documents than it does in the Widget documents, well is that
> just because there are more documents in the overall set, or because "can"
> is really more dense.  Simple scaling can probably handle this issue...
> 
> Even if I compare Widget to Cog documents, both of which are subsets of the
> total, there's likely to be a different number of documents in each area,
> and therefore the counts would be different.  So I'll get differences in
> "how" and "can", in addition to "widget" and "cog", just because the sample
> sizes are different.  Again, scaling or limiting the sample size might fix
> that too.  Presumably the ranking of the words "i" and "can", which one is
> more popular than the other, would stay the same regardless of sample size
> and absolute counts, if I were looking at random English text.  If those
> terms change position, then that would be more interesting.
> 
> However, I think randomness also plays a role here.  If I do a histogram of
> 100 Widget documents and only 5 Cog documents, and I notice that the
> relative positions of "i" and "can" have changed places, I might be
> skeptical that it's a random fluctuation because I only had 5 Cog
> documents.  If I added 50 more, would the difference still be there?
> 
> Sample size it certainly a well established topic in statistics, though I
> haven't see it applied to histograms.
> 
> And another odd thing about histograms taken from different sample sizes are
> the low frequency terms.  In all search engine histograms you'll see terms
> that appear only 1 time, and there'll be a bunch of them.  But presumably if
> I doubled the number of documents in the sample, some of those terms might
> then appear twice, whereas others would still only appear once.  There are
> other breaks in histogram counts, like when it does from 2 to 1, that would
> presumably also shift around a bit.  But I'm not sure you can reliably use
> sample size ration to scale these things with confidence.  This feels like
> Poisson distribution territory...
> 
> I suspect all of these questions have also been examined in detail before; I
> need to dig up my copy of that "Long Tail" book...  Wondering if anybody has
> specific links or verbiage / author names to suggest?
> 
> There are certainly other ideas for indirectly addressing IDF problem by
> using other methods to compensate:
> 
> 4: You could overlay user reviews / popularity or click-through rates or
> other social factors
> 
> 5: Use flexible proximity searches, in addition to phrase matching, to boost
> certain matches.  Verity used to call this the "near" operator, Lucene/Solr
> talk about span queries and "phrase slop".
> 
> So in a question containing "how can I", documents with that exact phrase
> get a substantial boost.  But documents where "how", "can" and "I" appear
> within 5 words of each other in any order get also get a small boost.  It's
> been suggested that this can help with Q&A type applications.
> 
> 6: FAST's doc mentions that verbs can be more important in question and
> answer type applications.  Normally nouns are said to contain 60-70% of the
> relevance of a query (reference?), but their linguistics guide suggests
> verbs are more important in Q and A apps.
> 
> 7: Another idea would be to boost word pairs using composite N-Gram tokens,
> what Lucene and Solr call "shingles".
> 
> Using my previous examples:
> Question: "How can I get a widget ?"
>     Normal index: "how", "can", "i", "get", "widget"
>     Shingles: "how_can", "can_i", "i_get", "get_widget"
> Question: "How do I change the widget battery?"
>     Normal index: "how", "do", "i", "change", "widget", "battery"
>     Shingles: "how_do", "do_i", "i_change", "change_widget",
> "widget_battery"
> 
> And both sets of tokens would be added to the index, in different fields,
> and with possibly different boosts.
> 
> The idea being that tokens like "can_i" might be somewhat more statistically
> significant than "can" and "i" by themselves.  So at least in the first
> question, which has all low quality words, the "can_i" might help a bit.
> 
> I'd appreciate any comments on these ideas from y'all, or perhaps names of
> specific algorithms / authors.
> 
> --
> Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

Reply via email to