Re: "index-time" over boosted

Jan Høydahl Thu, 19 Jan 2012 05:53:55 -0800

Hi,

The schema you pasted in your mail is NOT Solr3.5's default example schema. Did 
you get it from the Nutch project?


And the "omitNorms" parameter is supposed to go in the <field> tag in 
schema.xml, and the "content" field in the example schema does not have 
omitNorms="true". Try to change

       <field name="content" type="text" stored="false" indexed="true"/>
to
       <field name="content" type="text" stored="false" indexed="true" 
omitNorms="true"/>

and try again. Please note that you SHOULD customize your schema, there is 
really no "default" schema in Solr (or Nutch), it's only an example or starting 
point. For your search application to work well you will have to invest some 
time in designing a schema, working with your queries, perhaps exploring DisMax 
query parser etc etc.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 19. jan. 2012, at 13:01, remi tassing wrote:

> Hello Jan,
> 
> My schema wasn't changed from the release 3.5.0. The content can be seen
> below:
> 
> <schema name="nutch" version="1.1">
>    <types>
>        <fieldType name="string" class="solr.StrField"
>            sortMissingLast="true" omitNorms="true"/>
>        <fieldType name="long" class="solr.LongField"
>            omitNorms="true"/>
>        <fieldType name="float" class="solr.FloatField"
>            omitNorms="true"/>
>        <fieldType name="text" class="solr.TextField"
>            positionIncrementGap="100">
>            <analyzer>
>                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                <filter class="solr.StopFilterFactory"
>                    ignoreCase="true" words="stopwords.txt"/>
>                <filter class="solr.WordDelimiterFilterFactory"
>                    generateWordParts="1" generateNumberParts="1"
>                    catenateWords="1" catenateNumbers="1" catenateAll="0"
>                    splitOnCaseChange="1"/>
>                <filter class="solr.LowerCaseFilterFactory"/>
>                <filter class="solr.EnglishPorterFilterFactory"
>                    protected="protwords.txt"/>
>                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>            </analyzer>
>        </fieldType>
>        <fieldType name="url" class="solr.TextField"
>            positionIncrementGap="100">
>            <analyzer>
>                <tokenizer class="solr.StandardTokenizerFactory"/>
>                <filter class="solr.LowerCaseFilterFactory"/>
>                <filter class="solr.WordDelimiterFilterFactory"
>                    generateWordParts="1" generateNumberParts="1"/>
>                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>            </analyzer>
>        </fieldType>
>    </types>
>    <fields>
>        <field name="id" type="string" stored="true" indexed="true"/>
> 
>        <!-- core fields -->
>        <field name="segment" type="string" stored="true" indexed="false"/>
>        <field name="digest" type="string" stored="true" indexed="false"/>
>        <field name="boost" type="float" stored="true" indexed="false"/>
> 
>        <!-- fields for index-basic plugin -->
>        <field name="host" type="url" stored="false" indexed="true"/>
>        <field name="site" type="string" stored="false" indexed="true"/>
>        <field name="url" type="url" stored="true" indexed="true"
>            required="true"/>
>        <field name="content" type="text" stored="false" indexed="true"/>
>        <field name="title" type="text" stored="true" indexed="true"/>
>        <field name="cache" type="string" stored="true" indexed="false"/>
>        <field name="tstamp" type="long" stored="true" indexed="false"/>
> 
>        <!-- fields for index-anchor plugin -->
>        <field name="anchor" type="string" stored="true" indexed="true"
>            multiValued="true"/>
> 
>        <!-- fields for index-more plugin -->
>        <field name="type" type="string" stored="true" indexed="true"
>            multiValued="true"/>
>        <field name="contentLength" type="long" stored="true"
>            indexed="false"/>
>        <field name="lastModified" type="long" stored="true"
>            indexed="false"/>
>        <field name="date" type="string" stored="true" indexed="true"/>
> 
>        <!-- fields for languageidentifier plugin -->
>        <field name="lang" type="string" stored="true" indexed="true"/>
> 
>        <!-- fields for subcollection plugin -->
>        <field name="subcollection" type="string" stored="true"
>            indexed="true" multiValued="true"/>
> 
>        <!-- fields for feed plugin -->
>        <field name="author" type="string" stored="true" indexed="true"/>
>        <field name="tag" type="string" stored="true" indexed="true"/>
>        <field name="feed" type="string" stored="true" indexed="true"/>
>        <field name="publishedDate" type="string" stored="true"
>            indexed="true"/>
>        <field name="updatedDate" type="string" stored="true"
>            indexed="true"/>
>    </fields>
>    <uniqueKey>id</uniqueKey>
>    <defaultSearchField>content</defaultSearchField>
>    <solrQueryParser defaultOperator="OR"/>
> </schema>
> 
> Remi
> 
> On Thu, Jan 19, 2012 at 1:28 PM, Jan Høydahl <[email protected]> wrote:
> 
>> Hi,
>> 
>> Can you paste exactly both <fieldType> and <field> definitions from your
>> schema? omitNorms="true" should kill norms.
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Solr Training - www.solrtraining.com
>> 
>> On 19. jan. 2012, at 08:18, remi tassing wrote:
>> 
>>> Hi,
>>> 
>>> just a background on my setup. I'm crawling with Nutch-1.2, I used
>> Solr-1.4
>>> and Solr-3.5, with the same result. Solr is still using the default
>>> settings.
>>> 
>>> I found this problem just by accident. I queried "mobile broadband", page
>>> A, has 2 occurences and scores higher than page B that has 19
>> occurences. I
>>> found it weird and that's why I started investigating.
>>> 
>>> The debug results are given below and you can see that queryWeight, idf
>>> and queryNorm are the same, tf is higher, as expected, in B but what
>> makes
>>> the difference is clearly fieldNorm.
>>> 
>>> A: 0.010779975 = (MATCH) weight(content:"mobil broadband" in 18730),
>>> product of: 1.0 = queryWeight(content:"mobil broadband"), product of:
>>> 6.2444286 = idf(content: mobil=4922 broadband=2290) 0.16014275 =
>> queryNorm
>>> 0.010779975 = fieldWeight(content:"mobil broadband" in 18730), product
>> of:
>>> 1.4142135 = tf(phraseFreq=2.0) 6.2444286 = idf(content: mobil=4922
>>> broadband=2290) 0.0012207031 = fieldNorm(field=content, doc=18730)
>>> 
>>> B: 8.5223187E-4 = (MATCH) weight(content:"mobil broadband" in 14391),
>>> product of: 1.0 = queryWeight(content:"mobil broadband"), product of:
>>> 6.2444286 = idf(content: mobil=4922 broadband=2290) 0.16014275 =
>> queryNorm
>>> 8.5223187E-4 = fieldWeight(content:"mobil broadband" in 14391), product
>> of:
>>> 4.472136 = tf(phraseFreq=20.0) 6.2444286 = idf(content: mobil=4922
>>> broadband=2290) 3.0517578E-5 = fieldNorm(field=content, doc=14391)
>>> 
>>> Remi
>>> 
>>> On Wed, Jan 18, 2012 at 8:52 PM, Jan Høydahl <[email protected]>
>> wrote:
>>> 
>>>>> I've come accros a problem where newly indexed pages almost always come
>>>>> first even when the term frequency is relatively slow.
>>>> 
>>>> There is no inherent index-time boost, so this must be something else.
>>>> Can you give us an example of a query? Which query parser do you use?
>>>> 
>>>>> I read the posts below on "fieldNorm" and "omitNorms" but setting
>>>>> "omitNorms=true" doesn't change anything for me on the calculation of
>>>>> fieldNorm.
>>>> 
>>>> Are you sure you have spelled omitNorms="true" correctly, then restarted
>>>> Solr (to refresh config)? The effect of Norms on your score will be that
>>>> shorter fields score higher than long fields.
>>>> 
>>>> Perhaps you instead can try to tell us your use-case. What kind of
>> raning
>>>> are you trying to achieve? Then we can help suggest how to get there.
>>>> 
>>>> --
>>>> Jan Høydahl, search solution architect
>>>> Cominvent AS - www.cominvent.com
>>>> Solr Training - www.solrtraining.com
>> 
>>

Re: "index-time" over boosted

Reply via email to