Awesome, following it now! -- *John Blythe* Product Manager & Lead Developer
251.605.3071 | j...@curvolabs.com www.curvolabs.com 58 Adams Ave Evansville, IN 47713 On Mon, May 18, 2015 at 8:21 PM, Doug Turnbull < dturnb...@opensourceconnections.com> wrote: > Glad you figured things out and found splainer useful! Pull requests, bugs, > feature requests welcome! > > https://github.com/o19s/splainer > > Doug > > On Monday, May 18, 2015, John Blythe <j...@curvolabs.com> wrote: > > > Doug, > > > > very very cool tool you've made there. thanks so much for sharing! > > > > i ended up removing the shinglefilterfactory and voila! things are back > in > > good, working order with some great matching. i'm not 100% certain as to > > why shingling was so ineffective. i'm guessing the stacked terms created > > lower relevancy due to IDF on the *joint *terms/token? > > > > -- > > *John Blythe* > > Product Manager & Lead Developer > > > > 251.605.3071 | j...@curvolabs.com <javascript:;> > > www.curvolabs.com > > > > 58 Adams Ave > > Evansville, IN 47713 > > > > On Mon, May 18, 2015 at 4:57 PM, John Blythe <j...@curvolabs.com > > <javascript:;>> wrote: > > > > > Doug, > > > > > > A couple things quickly: > > > - I'll check in to that. How would you go about testing things, direct > > > URL? If so, how would you compose one of the examples above? > > > - yup, I used it extensively before testing scores to ensure that I was > > > getting things parsed appropriately (segmenting off the unit of measure > > > [mm] whilst still maintaining the decimal instead of breaking it up was > > my > > > largest concern as of late) > > > - to that point, though, it looks like one of my blunders was in the > > > synonyms file. i just referenced /analysis/ again and realized "CANN" > was > > > being transposed to "cannula" instead of "cannulated" #facepalm > > > - i'll be GLAD to use that! i'd been trying to use > > http://explain.solr.pl/ > > > previously but it kept error'ing out on me :\ > > > > > > thanks again, will report back! > > > > > > -- > > > *John Blythe* > > > Product Manager & Lead Developer > > > > > > 251.605.3071 | j...@curvolabs.com <javascript:;> > > > www.curvolabs.com > > > > > > 58 Adams Ave > > > Evansville, IN 47713 > > > > > > On Mon, May 18, 2015 at 4:47 PM, Doug Turnbull < > > > dturnb...@opensourceconnections.com <javascript:;>> wrote: > > > > > >> Hey John, > > >> > > >> I think you likely do need to think about escaping the query > operators. > > I > > >> doubt the Solr admin could tell the difference. > > >> > > >> For analysis, have you looked at the handy analysis tool in the Solr > > Admin > > >> UI? Its pretty indespensible for figuring out if an analyzed query > > matches > > >> an analyzed field. > > >> > > >> Outside of that, I can selfishly plug Splainer (http://splainer.io) > > that > > >> gives you more insight into the Solr relevance explain. You would > paste > > in > > >> something like > > >> > http://solr.quepid.com/solr/statedecoded/select?q=text:(deer%20hunting) > > . > > >> > > >> Cheers! > > >> -Doug > > >> > > >> On Mon, May 18, 2015 at 3:02 PM, John Blythe <j...@curvolabs.com > > <javascript:;>> wrote: > > >> > > >> > Thanks again for the speediness, Doug. > > >> > > > >> > Good to know on some of those things, not least of all the + > > indicating > > >> a > > >> > mandatory field and the parentheses. It seems like the escaping is > > >> pretty > > >> > robust in light of the product number. > > >> > > > >> > I'm thinking it has to be largely related to the analyzer. Check > this > > >> out, > > >> > this time with more of a real world case for us. Searching for > > >> "descript2: > > >> > CANN SCREW PT 3.5X50MM" produces a top result that has "Cannulated > > >> screw PT > > >> > 4.0x40mm" as its description. There is a document, though, that has > > the > > >> > description of "Cannulated screw PT 3.5x50mm"—the exact same thing > > >> (minus > > >> > lowercases) rendering that the analyzer is producing (per the > > /analysis > > >> > page). Why would 4.0x40 come up first? The top four results have > > >> > 4.0x[Something]. It's not till the fifth result that you see a 3.5 > > >> > something: "Cannulated screw PT 3.5x105mm" at which point I'm saying > > >> WTF. > > >> > So close, but then it ignores the "50" for a "105" instead. > > >> > > > >> > Further, adding parenthesis around the phrase—"descript2: (CANN > SCREW > > PT > > >> > 3.5X50MM)"—produces top results that have the correct > > >> dimensions—3.5x50—but > > >> > the wrong type. Instead of "cannulated" screws we see "cortical." > I'm > > >> > convinced Solr is trolling me at this point :p > > >> > > > >> > -- > > >> > *John Blythe* > > >> > Product Manager & Lead Developer > > >> > > > >> > 251.605.3071 | j...@curvolabs.com <javascript:;> > > >> > www.curvolabs.com > > >> > > > >> > 58 Adams Ave > > >> > Evansville, IN 47713 > > >> > > > >> > On Mon, May 18, 2015 at 2:34 PM, Doug Turnbull < > > >> > dturnb...@opensourceconnections.com <javascript:;>> wrote: > > >> > > > >> > > You might just need some syntax help. Not sure what the Solr admin > > >> > escapes, > > >> > > but many of the text in your query actually have reserved meaning. > > >> Also, > > >> > > when a term appears without a fieldName:value directly in front of > > >> it, I > > >> > > believe its going to search the default field (it's no longer > > >> attached to > > >> > > the field). You need to use parens to attach multiple terms to > that > > >> field > > >> > > for search. > > >> > > > > >> > > I'd try to see if doing any of the following help: > > >> > > > > >> > > Add parens to group terms to the field: > > >> > > > > >> > > mfgname2:(Ben & Jerry's) +descript1:(Strawberry Shortcake Ice > Cream > > >> > 1.5pt) > > >> > > + > > >> > > productnumber:(001-029-1298) > > >> > > > > >> > > Also keep in mind "+" means mandatory, and its an operator on just > > one > > >> > > field. So in the above you're requiring description and product > > number > > >> > > match the provided terms. > > >> > > > > >> > > Further, you may need to escape the "-" as that means "NOT". You > can > > >> do > > >> > > that with the following: > > >> > > mfgname2:(Ben & Jerry's) +descript1:(Strawberry Shortcake Ice > Cream > > >> > 1.5pt) > > >> > > + > > >> > > productnumber:(001\-029\-1298) > > >> > > > > >> > > You can read more in the article on Solr query syntax > > >> > > https://wiki.apache.org/solr/SolrQuerySyntax > > >> > > > > >> > > Hope that helps, for all I know your cut and paste didn't work and > > I'm > > >> > > assuming you have syntax issues :) > > >> > > > > >> > > -Doug > > >> > > > > >> > > On Mon, May 18, 2015 at 2:25 PM, John Blythe <j...@curvolabs.com > > <javascript:;>> > > >> wrote: > > >> > > > > >> > > > Hey Doug, > > >> > > > > > >> > > > Thanks for the quick reply. > > >> > > > > > >> > > > No edismax just yet. Planning on getting there, but have been > > >> trying to > > >> > > > fine tune the 3 primary fields we use over the last week or so > > >> before > > >> > > > jumping into edismax and its nifty toolset to help push our > > accuracy > > >> > and > > >> > > > precision even further (aside: is this a good strategy?) > > >> > > > > > >> > > > For now I'm querying directly in the admin interface, doing > > >> something > > >> > > like > > >> > > > this: > > >> > > > mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake Ice > > Cream > > >> > > 1.5pt + > > >> > > > productnumber: 001-029-1298 > > >> > > > > > >> > > > versus > > >> > > > mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake Ice > > Cream > > >> > 1.5pt > > >> > > > > > >> > > > Another interesting and likely related factor is the > description's > > >> lack > > >> > > of > > >> > > > help. With the product number in place it gets nailed even with > > >> stray > > >> > > > zeros, 4's instead of 1's, etc. > > >> > > > > > >> > > > Without it, though, the querying just flat out sucks. For > > instance, > > >> I > > >> > > just > > >> > > > saw something akin to this: > > >> > > > mfgname2: Ben & Jerry's + descript1: Straw Shortcake Ice Cream > > 1.5pt > > >> > > > > > >> > > > that got nowhere near what it should have. Straw would have a > > >> synonym > > >> > to > > >> > > > map to strawberry and would match the document's description > > >> *exactly, > > >> > > *yet > > >> > > > Solr would push out all sorts of peripheral suggestions that > > didn't > > >> > match > > >> > > > strawberry or was a different amount (.75pt, for instance). I > know > > >> I'm > > >> > no > > >> > > > expert, but I was thinking my analyzer was a bit better than > that > > :p > > >> > > > > > >> > > > -- > > >> > > > *John Blythe* > > >> > > > Product Manager & Lead Developer > > >> > > > > > >> > > > 251.605.3071 | j...@curvolabs.com <javascript:;> > > >> > > > www.curvolabs.com > > >> > > > > > >> > > > 58 Adams Ave > > >> > > > Evansville, IN 47713 > > >> > > > > > >> > > > On Mon, May 18, 2015 at 2:18 PM, Doug Turnbull < > > >> > > > dturnb...@opensourceconnections.com <javascript:;>> wrote: > > >> > > > > > >> > > > > > The maxScore is 772 when I remove the > > >> > > > > description. > > >> > > > > > I suppose the actual question, then, is if a low relevancy > > >> score on > > >> > > one > > >> > > > > field > > >> > > > > hurts the rest of them / the cumulative score, > > >> > > > > > > >> > > > > This depends a lot on how you're searching over these fields. > Is > > >> > this a > > >> > > > > (e)dismax query? Or a lucene query? Something else? > > >> > > > > > > >> > > > > Across fields there's query normalization, which attempts to > > take > > >> a > > >> > sum > > >> > > > of > > >> > > > > squares of IDFs of the search terms across the fields being > > >> searched. > > >> > > > > Adding/removing a field could impact query normalization. > > >> > > > > > > >> > > > > By removing a field, you also likely remove a boolean clause. > By > > >> > > removing > > >> > > > > the clause, there's less of a chance the coordinating factor > > >> (known > > >> > as > > >> > > > > coord) would punish your relevancy score. > > >> > > > > > > >> > > > > Otherwise, don't know -- perhaps you could give us more > > >> information > > >> > on > > >> > > > how > > >> > > > > you're searching your documents? Perhaps a sample Solr URL > that > > >> shows > > >> > > how > > >> > > > > you're querying? > > >> > > > > > > >> > > > > Cheers, > > >> > > > > -- > > >> > > > > *Doug Turnbull **| *Search Relevance Consultant | OpenSource > > >> > > Connections, > > >> > > > > LLC | 240.476.9983 | http://www.opensourceconnections.com > > >> > > > > Author: Relevant Search <http://manning.com/turnbull> from > > >> Manning > > >> > > > > Publications > > >> > > > > This e-mail and all contents, including attachments, is > > >> considered to > > >> > > be > > >> > > > > Company Confidential unless explicitly stated otherwise, > > >> regardless > > >> > > > > of whether attachments are marked as such. > > >> > > > > On Mon, May 18, 2015 at 1:57 PM, John Blythe < > > j...@curvolabs.com <javascript:;>> > > >> > > wrote: > > >> > > > > > > >> > > > > > Background: > > >> > > > > > I'm using Solr as a mechanism for search for users, but > before > > >> even > > >> > > > > getting > > >> > > > > > to that point as a means of intelligent inference more or > > less. > > >> > > Product > > >> > > > > > data comes in and we're hoping to match it to the correct > > known > > >> > > product > > >> > > > > > without having to use the user for confirmation/search. > > >> > > > > > > > >> > > > > > Problem: > > >> > > > > > I get a maxScore (with the correct result at the top) of > > >> 618.22626 > > >> > > > using > > >> > > > > > the manufacturer's name, the product number, and the product > > >> > > > description. > > >> > > > > > All of these items are coming from a previous purchaser so > we > > >> have > > >> > to > > >> > > > > > account for manufacturer name variations, miskeying of > product > > >> > > numbers, > > >> > > > > and > > >> > > > > > variances of descriptions. The maxScore is 772 when I remove > > the > > >> > > > > > description. > > >> > > > > > > > >> > > > > > My initial question is regarding relevancy scoring ( > > >> > > > > > https://wiki.apache.org/solr/SolrRelevancyFAQ). I get that > > >> many of > > >> > > the > > >> > > > > > description's tokens will be found throughout the other > > >> documents, > > >> > > thus > > >> > > > > > keeping the relevancy at bay per the IDF portion of the > > >> relevancy > > >> > > > score. > > >> > > > > I > > >> > > > > > suppose the actual question, then, is if a low relevancy > score > > >> on > > >> > one > > >> > > > > field > > >> > > > > > hurts the rest of them / the cumulative score, or if it > simply > > >> keep > > >> > > > that > > >> > > > > > field's contribution lower than it'd otherwise be. I thought > > it > > >> was > > >> > > the > > >> > > > > > latter, but the results I mention above are making me think > > that > > >> > the > > >> > > > > first > > >> > > > > > scenario is actually the case. > > >> > > > > > > > >> > > > > > Based on what I hear about the above, a follow up question > may > > >> be > > >> > > what > > >> > > > in > > >> > > > > > the world is wrong with my analyzer :) > > >> > > > > > > > >> > > > > > Thanks for any thoughts! > > >> > > > > > > > >> > > > > > Best, > > >> > > > > > John > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > >> > > > > >> > > -- > > >> > > *Doug Turnbull **| *Search Relevance Consultant | OpenSource > > >> Connections, > > >> > > LLC | 240.476.9983 | http://www.opensourceconnections.com > > >> > > Author: Relevant Search <http://manning.com/turnbull> from > Manning > > >> > > Publications > > >> > > This e-mail and all contents, including attachments, is considered > > to > > >> be > > >> > > Company Confidential unless explicitly stated otherwise, > regardless > > >> > > of whether attachments are marked as such. > > >> > > > > >> > > > >> > > >> > > >> > > >> -- > > >> *Doug Turnbull **| *Search Relevance Consultant | OpenSource > > Connections, > > >> LLC | 240.476.9983 | http://www.opensourceconnections.com > > >> Author: Relevant Search <http://manning.com/turnbull> from Manning > > >> Publications > > >> This e-mail and all contents, including attachments, is considered to > be > > >> Company Confidential unless explicitly stated otherwise, regardless > > >> of whether attachments are marked as such. > > >> > > > > > > > > > > > -- > *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections, > LLC | 240.476.9983 | http://www.opensourceconnections.com > Author: Relevant Search <http://manning.com/turnbull> from Manning > Publications > This e-mail and all contents, including attachments, is considered to be > Company Confidential unless explicitly stated otherwise, regardless > of whether attachments are marked as such. >