Re: Boost Strangeness

Erick Erickson Thu, 16 Jun 2011 18:47:57 -0700

Right, if you've only changed WordDelimiterFilterFactory in the query, then
then tokens you're analyzing may be split up. Try running some of the
terms through the admin/analysis page.... Unless you have
"catenateAll=1", in the definition, the whole term won't be there....


It becomes a question of why you even want WDFF in there in the first
place, do you ever want to split these fields up this way? Maybe start
by just taking it out completely?

Best
Erick

On Thu, Jun 16, 2011 at 9:55 AM, Judioo <cont...@judioo.com> wrote:
> fascinating!!!!
>
> Thank you so much Erik, I'm slowly beginning to understand.
>
> SO I've discovered that by defining 'splitOnNumerics="0"' on the filter
> class 'solr.WordDelimiterFilterFactory' ( for ONLY the query analyzer ) I
> can get *closer* to my required goal!
>
> Now something else odd is occuring.
>
> It only returns 2 results where there is over 70?
>
> Why is that? I can't find were this is explained :(
>
> query
>
> /solr/select?omitNorms=true&q=b006m86d&defType=dismax&qf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1&debugQuery=on&fl=type,id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,score&wt=json&indent=on&omitNorms=true
>
> output
>
> {
>
>   - -
>   responseHeader: {
>      - status: 0
>      - QTime: 51
>      - -
>      params: {
>         - debugQuery: "on"
>         - fl:
>         
> "type,id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,score"
>         - indent: "on"
>         - q: "b006m86d"
>         - qf: "id^10 parent_id^9 brand_container_id^8 series_container_id^8
>         subseries_container_id^8 clip_container_id^1 clip_episode_id^1"
>         - wt: "json"
>         - -
>         omitNorms: [
>            - "true"
>            - "true"
>         ]
>         - defType: "dismax"
>      }
>   }
>   - -
>   response: {
>      - numFound: 2
>      - start: 0
>      - maxScore: 13.473297
>      - -
>      docs: [
>         - -
>         {
>            - parent_id: ""
>            - id: "b006m86d"
>            - type: "brand"
>            - score: 13.473297
>         }
>         - -
>         {
>            - series_container_id: ""
>            - id: "b00y1w9h"
>            - type: "episode"
>            - brand_container_id: "b006m86d"
>            - subseries_container_id: ""
>            - clip_episode_id: ""
>            - score: 11.437143
>         }
>      ]
>   }
>   - -
>   debug: {
>      - rawquerystring: "b006m86d"
>      - querystring: "b006m86d"
>      - parsedquery: "+DisjunctionMaxQuery((id:b006m86d^10.0 |
>      clip_episode_id:b006m86d | subseries_container_id:b006m86d^8.0 |
>      series_container_id:b006m86d^8.0 | clip_container_id:b006m86d |
>      brand_container_id:b006m86d^8.0 | parent_id:b006m86d^9.0)) ()"
>      - parsedquery_toString: "+(id:b006m86d^10.0 | clip_episode_id:b006m86d
>      | subseries_container_id:b006m86d^8.0 |
> series_container_id:b006m86d^8.0 |
>      clip_container_id:b006m86d | brand_container_id:b006m86d^8.0 |
>      parent_id:b006m86d^9.0) ()"
>      - -
>      explain: {
>         - b006m86d: " 13.473297 = (MATCH) sum of: 13.473297 = (MATCH) max
>         of: 13.473297 = (MATCH) fieldWeight(id:b006m86d in 27636),
> product of: 1.0 =
>         tf(termFreq(id:b006m86d)=1) 13.473297 = idf(docFreq=2,
> maxDocs=783800) 1.0 =
>         fieldNorm(field=id, doc=27636) "
>         - b00y1w9h: " 11.437143 = (MATCH) sum of: 11.437143 = (MATCH) max
>         of: 11.437143 = (MATCH) weight(brand_container_id:b006m86d^8.0 in 61),
>         product of: 0.82407516 = queryWeight(brand_container_id:b006m86d^8.0),
>         product of: 8.0 = boost 13.878762 = idf(docFreq=1, maxDocs=783800)
>         0.007422088 = queryNorm 13.878762 = (MATCH)
>         fieldWeight(brand_container_id:b006m86d in 61), product of: 1.0 =
>         tf(termFreq(brand_container_id:b006m86d)=1) 13.878762 = idf(docFreq=1,
>         maxDocs=783800) 1.0 = fieldNorm(field=brand_container_id, doc=61) "
>      }
>      - QParser: "DisMaxQParser"
>      - altquerystring: null
>      - boostfuncs: null
>      - -
>      timing: {
>         - time: 51
>         - -
>         prepare: {
>            - time: 6
>            - -
>            org.apache.solr.handler.component.QueryComponent: {
>               - time: 5
>            }
>            - -
>            org.apache.solr.handler.component.FacetComponent: {
>               - time: 0
>            }
>            - -
>            org.apache.solr.handler.component.MoreLikeThisComponent: {
>               - time: 0
>            }
>            - -
>            org.apache.solr.handler.component.HighlightComponent: {
>               - time: 1
>            }
>            - -
>            org.apache.solr.handler.component.StatsComponent: {
>               - time: 0
>            }
>            - -
>            org.apache.solr.handler.component.DebugComponent: {
>               - time: 0
>            }
>         }
>         - -
>         process: {
>            - time: 45
>            - -
>            org.apache.solr.handler.component.QueryComponent: {
>               - time: 27
>            }
>            - -
>            org.apache.solr.handler.component.FacetComponent: {
>               - time: 0
>            }
>            - -
>            org.apache.solr.handler.component.MoreLikeThisComponent: {
>               - time: 0
>            }
>            - -
>            org.apache.solr.handler.component.HighlightComponent: {
>               - time: 0
>            }
>            - -
>            org.apache.solr.handler.component.StatsComponent: {
>               - time: 0
>            }
>            - -
>            org.apache.solr.handler.component.DebugComponent: {
>               - time: 18
>            }
>         }
>      }
>   }
>
> }
>
>
> On 15 June 2011 13:16, Erick Erickson <erickerick...@gmail.com> wrote:
>
>> First off, you didn't "violate groups ettiquette". In fact, yours was
>> one of the better first posts in terms or providing enough information
>> for us to actually help!
>>
>> A very useful page is the admin/analysis page to see how the
>> analysis chain works. For instance, if you haven't changed the
>> field type (i.e. <fieldType name="text">) that your input is
>> being broken up by WordDelimiterFilterFactory. Be sure to check
>> the "verbose" checkbox and enter text in both the query and
>> index boxes!
>>
>> Here's an invaluable page, though do note that it's not exhaustive:
>>
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>>
>>
>> But on to your problem:
>>
>> First, boosting isn't absolute, boosting terms just tends to
>> bubble things up, you have to experiment with various weights....
>>
>> To get the full comparison for both documents you're curious about,
>> try using "explainOther". see:
>>
>> http://wiki.apache.org/solr/SolrRelevancyFAQ#Why_doesn.27t_document_id:juggernaut_appear_in_the_top_10_results_for_my_query
>>
>> If you use that against the two docs in question, you should
>> see (although it's a hard read!) the reason the docs got
>> their relative scores.
>>
>> Finally, your next e-mail hints at what's happening. If you're
>> putting multiple tokens in some of these fields, the length
>> normalization may be causing the matches to score lower. You can
>> try disabling those calculations (omitNorms="true" in your field
>> definition).
>> See:
>>
>> http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr
>>
>> String types accept spaces just fine, but you might want to define
>> the fields with 'multiValued="true" ' and index each as a separate
>> field (note that won't work with a field that's also your <uniqueKey>).
>>
>> Best
>> Erick
>>
>> On Wed, Jun 15, 2011 at 7:16 AM, Judioo <cont...@judioo.com> wrote:
>> >   <dynamicField name="*_id"  type="text"    indexed="true"
>>  stored="true"/>
>> >
>> > so all attributes except 'id' are of type text.
>> >
>> > I didn't know that about the string type. So is my problem as described (
>> > that partial matches are contributing to the calculation ) and does
>> defining
>> > the filed type as string solve this problem.
>> >
>> > Or is my understanding completely incorrect?
>> >
>> > Thanks in advance
>> >
>> > On 15 June 2011 12:08, Ahmet Arslan <iori...@yahoo.com> wrote:
>> >
>> >> >
>> >>
>> /solr/select/?q=b007vty6&defType=dismax&qf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1&debugQuery=on&fl=id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,score&wt=json&indent=on
>> >> >
>> >> >
>> >> > same result ( just higher scores ). It's almost as if
>> >> > partial matches on
>> >> > brand|series_container_id and id are being considered in
>> >> > the 1st document.
>> >> > Surely this can't be right / expected?
>> >>
>> >> What is your fieldType definition? Don't you think it is better to use
>> >> string type which is not tokenized?
>> >>
>> >
>>
>

Re: Boost Strangeness

Reply via email to