In a relevancy problem I would repeat what my colleagues already pointed out : Data is key. We need to understand first of all our data before we can understand what is relevant and what is not. Once we specify a groundfloor which make sense ( and your basic approach + proper schema configuration as suggested + properly configured request handler , seems a good start to me ) .
At this point if you are still not happy with the relevancy (i.e. you are not happy with the different boosts you assigned ) my strongest suggestion at this time is to move to machine learning. You need a good amount of data to feed the learner and make it your Super Business Expert) . I have been recently working with the Learn To Rank Bloomberg Plugin [1] . In my opinion will be key for all the business that have many features in the game, that can help to evaluate a proper ranking. For that you need to be able to collect and process signals, and you need to carefully tune the features of your interest. But the results could be surprising . [1] https://issues.apache.org/jira/browse/SOLR-8542 [2] Learning to Rank in Solr <https://www.youtube.com/watch?v=M7BKwJoh96s> Cheers On Thu, Mar 17, 2016 at 10:15 AM, Robert Brown <r...@intelcompute.com> wrote: > Thanks Scott and John, > > As luck would have it I've got a PhD graduate coming for an interview > today, who just happened to do her research thesis on information retrieval > with quantum theory and machine learning :) > > John, it sounds like you're describing my system! Shopping products from > multiple sources. (De-duplication is going to be fun soon). > > I already copy fields like merchant, brand, category, to string fields to > use them as facets/filters. I was contemplating removing the description > due to the spammy issue you mentioned, I didn't know about the > RemoveDuplicatesTokenFilterFactory, so I'm sure that's going to be a huge > help. > > Thanks a lot, > Rob > > > > On 03/17/2016 10:01 AM, John Smith wrote: > >> Hi, >> >> For once I might be of some help: I've had a similar configuration >> (large set of products from various sources). It's very difficult to >> find the right balance between all parameters and requires a lot of >> tweaking, most often in the dark unfortunately. >> >> What I've found is that omitNorms=true is a real breakthrough: without >> it results tend to favor small texts, which is not what's wanted for >> product names. I also added a RemoveDuplicatesTokenFilterFactory for the >> name as it's a common practice for spammers to repeat some key words in >> order to be better placed in results. Stemming and custom stop words >> (e.g. "cheap", "sale", ...) are other potential ideas. >> >> I've also ended up in removing the description field as it's often too >> broad, and name is now the only field left: brand, category and merchant >> (as well as other fields) are offered as additional filters using >> facets. Note that you'd have to re-index them as plain strings. >> >> It's more difficult to achieve but popularity boost can also be useful: >> you can measure it by sales or by number of clicks. I use a combination >> of both, and store those values using partial updates. >> >> Hope it helps, >> John >> >> >> On 17/03/16 09:36, Robert Brown wrote: >> >>> Hi, >>> >>> I currently have an index of ~50m docs representing shopping products: >>> name, description, brand, category, etc. >>> >>> Our "qf" is currently setup as: >>> >>> name^5 >>> brand^2 >>> category^3 >>> merchant^2 >>> description^1 >>> >>> mm: 100% >>> ps: 5 >>> >>> I'm getting complaints from the business concerning relevancy, and was >>> hoping to get some constructive ideas/thoughts on whether these boosts >>> look semi-sensible or not, I think they were put in place pretty much >>> at random. >>> >>> I know it's going to be a case of rounds upon rounds of testing, but >>> maybe there's a good starting point that will save me some time? >>> >>> My initial thoughts right now are to actually just search on the name >>> field, and maybe the brand (for things like "Apple Ipod"). >>> >>> Has anyone got a similar setup that could share some direction? >>> >>> Many Thanks, >>> Rob >>> >>> > -- -------------------------- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England