Re: Relevancy Scoring

John Blythe Mon, 18 May 2015 12:03:13 -0700

Thanks again for the speediness, Doug.

Good to know on some of those things, not least of all the + indicating a
mandatory field and the parentheses. It seems like the escaping is pretty
robust in light of the product number.


I'm thinking it has to be largely related to the analyzer. Check this out,
this time with more of a real world case for us. Searching for "descript2:
CANN SCREW PT 3.5X50MM" produces a top result that has "Cannulated screw PT
4.0x40mm" as its description. There is a document, though, that has the
description of "Cannulated screw PT 3.5x50mm"—the exact same thing (minus
lowercases) rendering that the analyzer is producing (per the /analysis
page). Why would 4.0x40 come up first?  The top four results have
4.0x[Something]. It's not till the fifth result that you see a 3.5
something: "Cannulated screw PT 3.5x105mm" at which point I'm saying WTF.
So close, but then it ignores the "50" for a "105" instead.

Further, adding parenthesis around the phrase—"descript2: (CANN SCREW PT
3.5X50MM)"—produces top results that have the correct dimensions—3.5x50—but
the wrong type. Instead of "cannulated" screws we see "cortical." I'm
convinced Solr is trolling me at this point :p

-- 
*John Blythe*
Product Manager & Lead Developer

251.605.3071 | j...@curvolabs.com
www.curvolabs.com

58 Adams Ave
Evansville, IN 47713

On Mon, May 18, 2015 at 2:34 PM, Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> You might just need some syntax help. Not sure what the Solr admin escapes,
> but many of the text in your query actually have reserved meaning. Also,
> when a term appears without a fieldName:value directly in front of it, I
> believe its going to search the default field (it's no longer attached to
> the field). You need to use parens to attach multiple terms to that field
> for search.
>
> I'd try to see if doing any of the following help:
>
> Add parens to group terms to the field:
>
> mfgname2:(Ben & Jerry's) +descript1:(Strawberry Shortcake Ice Cream 1.5pt)
> +
> productnumber:(001-029-1298)
>
> Also keep in mind "+" means mandatory, and its an operator on just one
> field. So in the above you're requiring description and product number
> match the provided terms.
>
> Further, you may need to escape the "-" as that means "NOT". You can do
> that with the following:
> mfgname2:(Ben & Jerry's) +descript1:(Strawberry Shortcake Ice Cream 1.5pt)
> +
> productnumber:(001\-029\-1298)
>
> You can read more in the article on Solr query syntax
> https://wiki.apache.org/solr/SolrQuerySyntax
>
> Hope that helps, for all I know your cut and paste didn't work and I'm
> assuming you have syntax issues :)
>
> -Doug
>
> On Mon, May 18, 2015 at 2:25 PM, John Blythe <j...@curvolabs.com> wrote:
>
> > Hey Doug,
> >
> > Thanks for the quick reply.
> >
> > No edismax just yet. Planning on getting there, but have been trying to
> > fine tune the 3 primary fields we use over the last week or so before
> > jumping into edismax and its nifty toolset to help push our accuracy and
> > precision even further (aside: is this a good strategy?)
> >
> > For now I'm querying directly in the admin interface, doing something
> like
> > this:
> > mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake Ice Cream
> 1.5pt +
> > productnumber: 001-029-1298
> >
> > versus
> > mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake Ice Cream 1.5pt
> >
> > Another interesting and likely related factor is the description's lack
> of
> > help. With the product number in place it gets nailed even with stray
> > zeros, 4's instead of 1's, etc.
> >
> > Without it, though, the querying just flat out sucks. For instance, I
> just
> > saw something akin to this:
> > mfgname2: Ben & Jerry's + descript1: Straw Shortcake Ice Cream 1.5pt
> >
> > that got nowhere near what it should have. Straw would have a synonym to
> > map to strawberry and would match the document's description *exactly,
> *yet
> > Solr would push out all sorts of peripheral suggestions that didn't match
> > strawberry or was a different amount (.75pt, for instance). I know I'm no
> > expert, but I was thinking my analyzer was a bit better than that :p
> >
> > --
> > *John Blythe*
> > Product Manager & Lead Developer
> >
> > 251.605.3071 | j...@curvolabs.com
> > www.curvolabs.com
> >
> > 58 Adams Ave
> > Evansville, IN 47713
> >
> > On Mon, May 18, 2015 at 2:18 PM, Doug Turnbull <
> > dturnb...@opensourceconnections.com> wrote:
> >
> > > > The maxScore is 772 when I remove the
> > > description.
> > > > I suppose the actual question, then, is if a low relevancy score on
> one
> > > field
> > > hurts the rest of them / the cumulative score,
> > >
> > > This depends a lot on how you're searching over these fields. Is this a
> > > (e)dismax query? Or a lucene query? Something else?
> > >
> > > Across fields there's query normalization, which attempts to take a sum
> > of
> > > squares of IDFs of the search terms across the fields being searched.
> > > Adding/removing a field could impact query normalization.
> > >
> > > By removing a field, you also likely remove a boolean clause. By
> removing
> > > the clause, there's less of a chance the coordinating factor (known as
> > > coord) would punish your relevancy score.
> > >
> > > Otherwise, don't know -- perhaps you could give us more information on
> > how
> > > you're searching your documents? Perhaps a sample Solr URL that shows
> how
> > > you're querying?
> > >
> > > Cheers,
> > > --
> > > *Doug Turnbull **| *Search Relevance Consultant | OpenSource
> Connections,
> > > LLC | 240.476.9983 | http://www.opensourceconnections.com
> > > Author: Relevant Search <http://manning.com/turnbull> from Manning
> > > Publications
> > > This e-mail and all contents, including attachments, is considered to
> be
> > > Company Confidential unless explicitly stated otherwise, regardless
> > > of whether attachments are marked as such.
> > > On Mon, May 18, 2015 at 1:57 PM, John Blythe <j...@curvolabs.com>
> wrote:
> > >
> > > > Background:
> > > > I'm using Solr as a mechanism for search for users, but before even
> > > getting
> > > > to that point as a means of intelligent inference more or less.
> Product
> > > > data comes in and we're hoping to match it to the correct known
> product
> > > > without having to use the user for confirmation/search.
> > > >
> > > > Problem:
> > > > I get a maxScore (with the correct result at the top) of 618.22626
> > using
> > > > the manufacturer's name, the product number, and the product
> > description.
> > > > All of these items are coming from a previous purchaser so we have to
> > > > account for manufacturer name variations, miskeying of product
> numbers,
> > > and
> > > > variances of descriptions. The maxScore is 772 when I remove the
> > > > description.
> > > >
> > > > My initial question is regarding relevancy scoring (
> > > > https://wiki.apache.org/solr/SolrRelevancyFAQ). I get that many of
> the
> > > > description's tokens will be found throughout the other documents,
> thus
> > > > keeping the relevancy at bay per the IDF portion of the relevancy
> > score.
> > > I
> > > > suppose the actual question, then, is if a low relevancy score on one
> > > field
> > > > hurts the rest of them / the cumulative score, or if it simply keep
> > that
> > > > field's contribution lower than it'd otherwise be. I thought it was
> the
> > > > latter, but the results I mention above are making me think that the
> > > first
> > > > scenario is actually the case.
> > > >
> > > > Based on what I hear about the above, a follow up question may be
> what
> > in
> > > > the world is wrong with my analyzer :)
> > > >
> > > > Thanks for any thoughts!
> > > >
> > > > Best,
> > > > John
> > > >
> > >
> >
>
>
>
> --
> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
> LLC | 240.476.9983 | http://www.opensourceconnections.com
> Author: Relevant Search <http://manning.com/turnbull> from Manning
> Publications
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
>

Re: Relevancy Scoring

Reply via email to