Re: Relevancy Scoring

John Blythe Mon, 18 May 2015 13:59:35 -0700

Doug,

A couple things quickly:
- I'll check in to that. How would you go about testing things, direct URL?
If so, how would you compose one of the examples above?
- yup, I used it extensively before testing scores to ensure that I was
getting things parsed appropriately (segmenting off the unit of measure
[mm] whilst still maintaining the decimal instead of breaking it up was my
largest concern as of late)
- to that point, though, it looks like one of my blunders was in the
synonyms file. i just referenced /analysis/ again and realized "CANN" was
being transposed to "cannula" instead of "cannulated" #facepalm
- i'll be GLAD to use that! i'd been trying to use http://explain.solr.pl/
previously but it kept error'ing out on me :\


thanks again, will report back!

-- 
*John Blythe*
Product Manager & Lead Developer

251.605.3071 | j...@curvolabs.com
www.curvolabs.com

58 Adams Ave
Evansville, IN 47713

On Mon, May 18, 2015 at 4:47 PM, Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> Hey John,
>
> I think you likely do need to think about escaping the query operators. I
> doubt the Solr admin could tell the difference.
>
> For analysis, have you looked at the handy analysis tool in the Solr Admin
> UI? Its pretty indespensible for figuring out if an analyzed query matches
> an analyzed field.
>
> Outside of that, I can selfishly plug Splainer (http://splainer.io) that
> gives you more insight into the Solr relevance explain. You would paste in
> something like
> http://solr.quepid.com/solr/statedecoded/select?q=text:(deer%20hunting).
>
> Cheers!
> -Doug
>
> On Mon, May 18, 2015 at 3:02 PM, John Blythe <j...@curvolabs.com> wrote:
>
> > Thanks again for the speediness, Doug.
> >
> > Good to know on some of those things, not least of all the + indicating a
> > mandatory field and the parentheses. It seems like the escaping is pretty
> > robust in light of the product number.
> >
> > I'm thinking it has to be largely related to the analyzer. Check this
> out,
> > this time with more of a real world case for us. Searching for
> "descript2:
> > CANN SCREW PT 3.5X50MM" produces a top result that has "Cannulated screw
> PT
> > 4.0x40mm" as its description. There is a document, though, that has the
> > description of "Cannulated screw PT 3.5x50mm"—the exact same thing (minus
> > lowercases) rendering that the analyzer is producing (per the /analysis
> > page). Why would 4.0x40 come up first?  The top four results have
> > 4.0x[Something]. It's not till the fifth result that you see a 3.5
> > something: "Cannulated screw PT 3.5x105mm" at which point I'm saying WTF.
> > So close, but then it ignores the "50" for a "105" instead.
> >
> > Further, adding parenthesis around the phrase—"descript2: (CANN SCREW PT
> > 3.5X50MM)"—produces top results that have the correct
> dimensions—3.5x50—but
> > the wrong type. Instead of "cannulated" screws we see "cortical." I'm
> > convinced Solr is trolling me at this point :p
> >
> > --
> > *John Blythe*
> > Product Manager & Lead Developer
> >
> > 251.605.3071 | j...@curvolabs.com
> > www.curvolabs.com
> >
> > 58 Adams Ave
> > Evansville, IN 47713
> >
> > On Mon, May 18, 2015 at 2:34 PM, Doug Turnbull <
> > dturnb...@opensourceconnections.com> wrote:
> >
> > > You might just need some syntax help. Not sure what the Solr admin
> > escapes,
> > > but many of the text in your query actually have reserved meaning.
> Also,
> > > when a term appears without a fieldName:value directly in front of it,
> I
> > > believe its going to search the default field (it's no longer attached
> to
> > > the field). You need to use parens to attach multiple terms to that
> field
> > > for search.
> > >
> > > I'd try to see if doing any of the following help:
> > >
> > > Add parens to group terms to the field:
> > >
> > > mfgname2:(Ben & Jerry's) +descript1:(Strawberry Shortcake Ice Cream
> > 1.5pt)
> > > +
> > > productnumber:(001-029-1298)
> > >
> > > Also keep in mind "+" means mandatory, and its an operator on just one
> > > field. So in the above you're requiring description and product number
> > > match the provided terms.
> > >
> > > Further, you may need to escape the "-" as that means "NOT". You can do
> > > that with the following:
> > > mfgname2:(Ben & Jerry's) +descript1:(Strawberry Shortcake Ice Cream
> > 1.5pt)
> > > +
> > > productnumber:(001\-029\-1298)
> > >
> > > You can read more in the article on Solr query syntax
> > > https://wiki.apache.org/solr/SolrQuerySyntax
> > >
> > > Hope that helps, for all I know your cut and paste didn't work and I'm
> > > assuming you have syntax issues :)
> > >
> > > -Doug
> > >
> > > On Mon, May 18, 2015 at 2:25 PM, John Blythe <j...@curvolabs.com>
> wrote:
> > >
> > > > Hey Doug,
> > > >
> > > > Thanks for the quick reply.
> > > >
> > > > No edismax just yet. Planning on getting there, but have been trying
> to
> > > > fine tune the 3 primary fields we use over the last week or so before
> > > > jumping into edismax and its nifty toolset to help push our accuracy
> > and
> > > > precision even further (aside: is this a good strategy?)
> > > >
> > > > For now I'm querying directly in the admin interface, doing something
> > > like
> > > > this:
> > > > mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake Ice Cream
> > > 1.5pt +
> > > > productnumber: 001-029-1298
> > > >
> > > > versus
> > > > mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake Ice Cream
> > 1.5pt
> > > >
> > > > Another interesting and likely related factor is the description's
> lack
> > > of
> > > > help. With the product number in place it gets nailed even with stray
> > > > zeros, 4's instead of 1's, etc.
> > > >
> > > > Without it, though, the querying just flat out sucks. For instance, I
> > > just
> > > > saw something akin to this:
> > > > mfgname2: Ben & Jerry's + descript1: Straw Shortcake Ice Cream 1.5pt
> > > >
> > > > that got nowhere near what it should have. Straw would have a synonym
> > to
> > > > map to strawberry and would match the document's description
> *exactly,
> > > *yet
> > > > Solr would push out all sorts of peripheral suggestions that didn't
> > match
> > > > strawberry or was a different amount (.75pt, for instance). I know
> I'm
> > no
> > > > expert, but I was thinking my analyzer was a bit better than that :p
> > > >
> > > > --
> > > > *John Blythe*
> > > > Product Manager & Lead Developer
> > > >
> > > > 251.605.3071 | j...@curvolabs.com
> > > > www.curvolabs.com
> > > >
> > > > 58 Adams Ave
> > > > Evansville, IN 47713
> > > >
> > > > On Mon, May 18, 2015 at 2:18 PM, Doug Turnbull <
> > > > dturnb...@opensourceconnections.com> wrote:
> > > >
> > > > > > The maxScore is 772 when I remove the
> > > > > description.
> > > > > > I suppose the actual question, then, is if a low relevancy score
> on
> > > one
> > > > > field
> > > > > hurts the rest of them / the cumulative score,
> > > > >
> > > > > This depends a lot on how you're searching over these fields. Is
> > this a
> > > > > (e)dismax query? Or a lucene query? Something else?
> > > > >
> > > > > Across fields there's query normalization, which attempts to take a
> > sum
> > > > of
> > > > > squares of IDFs of the search terms across the fields being
> searched.
> > > > > Adding/removing a field could impact query normalization.
> > > > >
> > > > > By removing a field, you also likely remove a boolean clause. By
> > > removing
> > > > > the clause, there's less of a chance the coordinating factor (known
> > as
> > > > > coord) would punish your relevancy score.
> > > > >
> > > > > Otherwise, don't know -- perhaps you could give us more information
> > on
> > > > how
> > > > > you're searching your documents? Perhaps a sample Solr URL that
> shows
> > > how
> > > > > you're querying?
> > > > >
> > > > > Cheers,
> > > > > --
> > > > > *Doug Turnbull **| *Search Relevance Consultant | OpenSource
> > > Connections,
> > > > > LLC | 240.476.9983 | http://www.opensourceconnections.com
> > > > > Author: Relevant Search <http://manning.com/turnbull> from Manning
> > > > > Publications
> > > > > This e-mail and all contents, including attachments, is considered
> to
> > > be
> > > > > Company Confidential unless explicitly stated otherwise, regardless
> > > > > of whether attachments are marked as such.
> > > > > On Mon, May 18, 2015 at 1:57 PM, John Blythe <j...@curvolabs.com>
> > > wrote:
> > > > >
> > > > > > Background:
> > > > > > I'm using Solr as a mechanism for search for users, but before
> even
> > > > > getting
> > > > > > to that point as a means of intelligent inference more or less.
> > > Product
> > > > > > data comes in and we're hoping to match it to the correct known
> > > product
> > > > > > without having to use the user for confirmation/search.
> > > > > >
> > > > > > Problem:
> > > > > > I get a maxScore (with the correct result at the top) of
> 618.22626
> > > > using
> > > > > > the manufacturer's name, the product number, and the product
> > > > description.
> > > > > > All of these items are coming from a previous purchaser so we
> have
> > to
> > > > > > account for manufacturer name variations, miskeying of product
> > > numbers,
> > > > > and
> > > > > > variances of descriptions. The maxScore is 772 when I remove the
> > > > > > description.
> > > > > >
> > > > > > My initial question is regarding relevancy scoring (
> > > > > > https://wiki.apache.org/solr/SolrRelevancyFAQ). I get that many
> of
> > > the
> > > > > > description's tokens will be found throughout the other
> documents,
> > > thus
> > > > > > keeping the relevancy at bay per the IDF portion of the relevancy
> > > > score.
> > > > > I
> > > > > > suppose the actual question, then, is if a low relevancy score on
> > one
> > > > > field
> > > > > > hurts the rest of them / the cumulative score, or if it simply
> keep
> > > > that
> > > > > > field's contribution lower than it'd otherwise be. I thought it
> was
> > > the
> > > > > > latter, but the results I mention above are making me think that
> > the
> > > > > first
> > > > > > scenario is actually the case.
> > > > > >
> > > > > > Based on what I hear about the above, a follow up question may be
> > > what
> > > > in
> > > > > > the world is wrong with my analyzer :)
> > > > > >
> > > > > > Thanks for any thoughts!
> > > > > >
> > > > > > Best,
> > > > > > John
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > *Doug Turnbull **| *Search Relevance Consultant | OpenSource
> Connections,
> > > LLC | 240.476.9983 | http://www.opensourceconnections.com
> > > Author: Relevant Search <http://manning.com/turnbull> from Manning
> > > Publications
> > > This e-mail and all contents, including attachments, is considered to
> be
> > > Company Confidential unless explicitly stated otherwise, regardless
> > > of whether attachments are marked as such.
> > >
> >
>
>
>
> --
> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
> LLC | 240.476.9983 | http://www.opensourceconnections.com
> Author: Relevant Search <http://manning.com/turnbull> from Manning
> Publications
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
>

Re: Relevancy Scoring

Reply via email to