Re: Index-time Boosting

Tracey Jaquith Tue, 05 Dec 2006 11:41:35 -0800

Hi Yonik!

Yonik Seeley wrote:

On 12/5/06, Tracey Jaquith <[EMAIL PROTECTED]> wrote:

Quick intro.  Server Engineer at Internet Archive.
I just spent a mere 3 days porting nearly our entire site to use your
*wonderful* project!


I, too, am looking for a kind of "boosting".
If I understand your reply here, if i reindex *all* my documents with
   <field name="title" boost="100">i'm super, thanks for asking!</field>
and make sure that any subsequent incremental (re)indexing of documents

use that same extra ' boost="100" ' then I should be making therelevance

of the title in our documents 100x (or whatever that translates to)
"heavier"
than other non-title fields, correct?

I know this prolly isn't the relevant place to otherwise gush,
but THANK YOU for this fantastic (and maintained!) code
and we look forward to using this in the near future on our site!
Go opensource!


Welcome aboard!

From a "fresh" user perspective, what was your hardest or most
confusing part of starting to use Solr?

Thanks! Well, we presently have a (very badly) homegrown version of anSE that has lucene + jetty under the hood. It locks up a lot (badlythreaded), hangs on updates, and generally has "persona non gratia"status with developers here where noone wants to touch it. So the*easiest* thing about Solr was the fact that it uses lucene query syntax(like ours). The hardest parts were:1) I tried to make ant run from the included ant.jar (w/o getting thelatest ant from apache) (and spent an hour or so before trying getting ant)2) Our SE starts responses with document "1". Initially (totally myoverlooking from going a little too fast) I just directly "translated"that concept so I was crushed to find a lot of my documents weren'tcoming back like they should. Once I figured I needed to make"start=0", not "start=1", everything was great.3) boosts! I spent just about 2 days porting our entire site over (havea nice PHP toggle "define('SOLR', 1);" now in a single place to cut overto it; spent only 2 hours (clocktime) to index our site (about 450Kdocuments). But now I've spent about 1-1/2 days experimenting and notquite getting the boosts right 8-)

[We are most interested in always having "title", "description", and a
few other
 fields boosted.  We have both user queries of phrases/words as well as

"field-specific" queries (eg: "mediatype:moves ANDcollection:prelinger")

 so my thought is std might be better than dismax.


Yes, for the example above you want the standard request handler
because you are searching for different things in different fields
rather than the same thing in different fields.

However, there are multiple ways of doing everything...
It looks like at least some of your clauses are restrictions rather
than full-text queries, and can be more efficiently modeled as
filters.  Since filters are cached separately, this can lead to a
large increase in performance.

So in either the standard or dismax handlers, you could do
q="foo bar"&fq=mediatype:movies&fq=collection:prelinger

OK, great to know. I'll prolly stick with our current "pass through" ofour queries in lucene syntax version, and in the future, for speedups,start moving some of the filters to "&fq="....

I've tried some experiments, adjusting the boosts at index time andrunning
 the std handler to see the ordering of the results change for
"fieldless queries"
 (eg: "q=tracey+pooh").  I have 33 fields using <copyField dest="text"
source="..."/>
  (where "text" is our default field to query)
 to allow for checking across most of our std XML fields.  I gather that
a boost
applied to "title" on indexing a docuement must somehow "propogate"to the
  "text" field?


Background: for an indexed field name there is a single boost value
per document.  This is true even if the field is multi-valued... all
values for that document "share" the same boost.  This is a Lucene
restriction so we can't fix it in Solr in any way.

ok, that's no problem for us -- our main two fields to boost are"singletons"

anyway.  the other two fields we boost can have multiple values, but
most of the time, in practice, they won't matter.  of course, great to know.

Solr *does* propagate the index-time boost when doing copyField, but
this just ends up being multiplied into all the other boosts for
values for that document.   Matches on the resulting text field will
*always* score higher, regardless of which "part" matched.  Does that
make sense?

OK, that *mostly* is making sense.  Let me see if I'm understanding it
mostly.  I'm thinking (after still thrashing around a bit) that the way that
seems to be getting the results I *expect* (or at least, that we are likely
used to here with our current IA SE) is something like (std req handler):
    &q="commute" title:"commute"^10
where i did no index boosting, and "title" (and other fields) were being
copied into the the default-to-search-for-unspecified-query "text" field).

That nicely makes items with "commute" in the title show up higher in
the results than those with commute only in the "text" field.

Were I to switch course and index boost each document with
  <field name="title" boost="10">
I would think the documents would come back in the same order for
   &q="commute"
as the first scheme, because the relevance of the title copied into
"text" boosted the documents relevance.
I could see other queries could have different rankings of results
in the two schems above that had more complex AND clauses perhaps.

My new plan is something like:
for each "clause" we get in a raw search bar query, if it doesn't have a ":"
in it, "expand" it to:

q=text:"commute" title:"commute"^100 description:"commute"^15collection:"commute"^10 language:"commute"^10I think I could even then stop copyField-ing title, description,collection, and language

into "text".

Index time boosts can make sense if you want to boost the importance
of certain *documents*.  Query time boosts make more sense when you
want certain fields or certain search terms to count more than others.

So if you want to search across your general text field, while at the
same time boosting the title field, you could do:

q="foo bar" title:"foo bar"^10

Or you could search across all the fields individually, giving them
all different boosts:
q=subject:foo^3 title:foo^10 body:foo

thanks! these two examples were perfect and got me the approach that Ithink

will work for us!


The dismax handler has a different way of specifying fields to search
across and boosts:
q=foo&qf=subject^3,title^10,body

If you really want index-time boosts, there was a bug fix to
index-time field boosts on 11/3, so make sure you are using a later
version.

something about dismax (for me, or for my mangling of it) with variousattempts

didn't seem to always be getting me every result I expected, so I've mostly
"chickened out" of dismax for now 8-)


Now I have one new mystery that's popped up for me.
With std req handler, this simple query
   q=title:commute

is *not* returning me all documents that have the word "commute" in thetitle.

There must be some other filter/clause or something happening that I'm not
aware of?

(For example, I do "indent=on&fl=title&q=commute" in a wget and grep theresults

for <title> and then grep -i for commute, there are 23 hits.  But doing
"&q=title:commute" only returns one of those hits..)

I can provide the url to our open test server so anyone interested canlook at our

config/schema and the query results if need be.

Thanks much!
--tracey
(and as we're about 95% integrated, these will be my most verbose posts)

* --Tracey Jaquith - http://www.archive.org/~tracey<http://www.archive.org/%7Etracey> --*

Re: Index-time Boosting

Reply via email to