To followup on Charlie’s points.

Looks like your primary source is web or site crawl with Nutch. Once you are in 
the territory of unstructured text mixed with PDF/Word docs, spread across 
multiple sub domains, and perhaps lots of old «garbage» content, then you are 
looking at a very different search problem than a clean and pure structured DB 
search.

You need to deal with everything from HTML cleansing, false last-updated dates 
from various webservers, bad or missing metadata, content using other 
terminology than users of your search, strange old documents popping up to the 
top of your result list for apparently no good reason (other than perhaps some 
IDF boost of a word in title) etc etc. If you go multi lingual you face even 
more challenges.

And from where do you collect «page rank» data, i..e the authority of a page? 
From where do you collect link text? Do you have enough quality link texts to 
even start boosting on them? What should be deemed a landing page and how much 
boost to assign to it versus a page that fits the content very well by hundreds 
of keyword matches….

The first thing to understand is that it will cost you considerable time and 
also skills to build and maintain such a web crawl index.
Next, you must realize you’ll never achieve «Google quality» within a typical 
budget.
But that does not mean it’s never a good idea to build such a search inhouse. 
Over time you will be able to address many of the issues you face, but some of 
the issues may require custom tooling.

I’m not familiar with the commercial systems on the market today. Some of them 
may of course have a toolbox that gets you much quicker to an acceptable level. 
But then when you hit the wall with what the product can do you are likely 
stuck :)

Jan


> 21. apr. 2020 kl. 15:13 skrev Charlie Hull <char...@flax.co.uk>:
> 
> Hi Matt,
> 
> On 21/04/2020 13:41, matthew sporleder wrote:
>> Sorry for the vague question and I appreciate the book recommendations
>> -- I actually think I am mostly confused about suggest vs spellcheck
>> vs morelikethis as they relate to what I referred to as "expected"
>> behavior (like from a typed-in search bar).
> Suggest - here's some results that might match based on what you've typed so 
> far (usually powered by a behind-the-scenes search of the index with some 
> restrictions). Note the difference between this and autocompletion, which 
> suggests complete search terms from the index based on the partial word 
> you've typed so far.
> Spellcheck - The word you typed isn't anywhere in the index, so I've used an 
> edit distance algorithm to suggest a few words you might have meant that are 
> in the index (note this isn't spelling correction as the engine doesn't 
> necessarily have the corrected form in its index)
> Morelikethis - here's some results that share some characteristics with the 
> document you're looking at, e.g. they're indexed by some of the same terms
>> 
>> For reference we have been using solr as search in some form for
>> almost 10 years and it's always been great in finding things based on
>> clear keywords, programmatic-type discovery, a nosql/distrtibuted k:v
>> (actually really really good at this) but has always fallen short
>> (imho and also our fault, obviously) in the "typed in a search query"
>> experience.
> I'm guessing you're bumping into the problem that most people type very 
> little into a search bar, and expect the engine to magically know what they 
> meant. It doesn't of course, so it has to suggest some ways for the user to 
> tell it more specific information - facets for example, or some of the 
> features above.
>> 
>> We are in the midst of re-developing our internal content ranking
>> system and it has me grasping on how to *really* elevate our game in
>> terms of giving an excellent human-driven discovery vs our current
>> behavior of: "here is everything we have that contains those words,
>> minus ones I took out".
> 
> I think you need to look at several angles:
> 
> - What defines a 'good' result in your world/for your content?
> - Who judges this? How do you record this? Human/clicks/both?
> - What Solr features *could* help - and how are you going to test that they 
> actually do using the two lines above?
> 
> We think that building up this measurement-driven, experimental process is 
> absolutely key to improving relevance.
> 
> Cheers
> 
> Charlie
> 
>> 
>> 
>> On Tue, Apr 21, 2020 at 5:35 AM Charlie Hull <char...@flax.co.uk> wrote:
>>> Hi Matt,
>>> 
>>> Are you looking for a good, general purpose schema and config for Solr?
>>> Well, there's the problem: you need to define what you mean by general
>>> purpose. Every search application will have its own requirements and
>>> they'll be slightly different to every other application. Yes, there
>>> will be some commonalities too. I guess by "as a human might expect one
>>> to behave" you mean "a bit like how Google works" but unfortunately
>>> Google is a poor example: you won't have Google's money or staff or
>>> platform in your company, nor are you likely to be building a
>>> massive-scale web search engine, so at best you can just take
>>> inspiration from it, not replicate it.
>>> 
>>> In practice, what a lot of people do is start with an example setup
>>> (perhaps from one of the examples supplied with Solr, e.g.
>>> 'techproducts') and adapt it: or they might start with the Solr
>>> configset provided by another framework, e.g. Drupal (yay! Pink
>>> Ponies!). Unfortunately the standard example configsets are littered
>>> with comments that say things like 'Here is how you *could* do XYZ but
>>> please don't actually attempt it this way' and other config sections
>>> that if you un-comment them may just get you into further trouble. It's
>>> grown rather than been built, and to my mind there's a good argument for
>>> starting with an absolutely minimal Solr configset and only adding
>>> things in as you need them and understand them (see
>>> https://lucene.472066.n3.nabble.com/minimal-solrconfig-example-td4322977.html
>>> for some background and a great presentation from Alex Rafalovitch on
>>> the examples).
>>> 
>>> You're also going to need some background on *why* all these features
>>> should be used, and for that I'd recommend my colleague Doug's book
>>> Relevant Search https://www.manning.com/books/relevant-search - or maybe
>>> our training (quick plug: we're running some online training in a couple
>>> of weeks
>>> https://opensourceconnections.com/blog/2020/05/05/tlre-solr-remote/ )
>>> 
>>> Hope this helps,
>>> 
>>> Cheers
>>> 
>>> Charlie
>>> 
>>> On 20/04/2020 23:43, matthew sporleder wrote:
>>>> Is there a comprehensive/big set of tips for making solr into a
>>>> search-engine as a human would expect one to behave?  I poked around
>>>> in the nutch github for a minute and found this:
>>>> https://github.com/apache/nutch/blob/9e5ae7366f7dd51eaa76e77bee6eb69f812bd29b/src/plugin/indexer-solr/schema.xml
>>>>   but I was wondering if I was missing a very obvious document
>>>> somewhere.
>>>> 
>>>> I guess I'm looking for things like:
>>>> use suggester here, use spelling there, use DocValues around here, DIY
>>>> pagerank, etc
>>>> 
>>>> Thanks,
>>>> Matt
>>> 
>>> --
>>> Charlie Hull
>>> OpenSource Connections, previously Flax
>>> 
>>> tel/fax: +44 (0)8700 118334
>>> mobile:  +44 (0)7767 825828
>>> web: www.o19s.com
>>> 
> 
> -- 
> Charlie Hull
> OpenSource Connections, previously Flax
> 
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.o19s.com <http://www.o19s.com/>

Reply via email to