Looking for tips on improving full-text search quality in Postgres

2022-03-04 Thread Bayer, Samuel

Hi all -

When I have a need for both sophisticated database querying and full-text 
search, I'd rather not stand up a technology stack with multiple tools (e.g., 
Postgres and Apache Solr, or Postgres and ElasticSearch with a zomboDB bridge). 
So I've been looking at the Postgres full-text search capability, and comparing 
it to Apache Solr. My experience so far - which has not been entirely 
anecdotal, but hasn't amounted to a formal TREC-style evaluation - is that 
Postgres full-text search, in any ranking/normalization configuration I can 
create, is reliably worse than Solr. Now, I understand that the whole point of 
Solr is search, and this is a sideline for Postgres, but I'd like to figure out 
how close Postgres can get, and while I'm knowledgeable about search 
technologies, I'm not an expert. And I've looked for information on the Web 
about comparing Postgres search to other search capabilities, and everything 
I've found so far is extremely basic.

Does anybody have any pointers to resources (people, sites, journal articles, 
blogs, etc.) which are deeply knowledgeable about this comparison?

Thanks in advance -
Sam Bayer
The MITRE Corporation
s...@mitre.org




Re: [EXT] Re: Looking for tips on improving full-text search quality in Postgres

2022-03-04 Thread Bayer, Samuel

Thanks for replying. My problem is that I can't provide enough guidance on what 
isn't working, because (a) I don't have good enough intuitions about how the 
normalization options are expected to affect the results, and (b) I can't 
identify a specific missing function - I'm just observing that I can't make the 
results as high-quality as Solr.

My apologies.

Sam

On 3/4/22 10:25 AM, Bruce Momjian wrote:

On Fri, Mar 4, 2022 at 08:10:48AM -0500, Bayer, Samuel wrote:

Hi all -

When I have a need for both sophisticated database querying and
full-text search, I'd rather not stand up a technology stack with
multiple tools (e.g., Postgres and Apache Solr, or Postgres and
ElasticSearch with a zomboDB bridge). So I've been looking at the
Postgres full-text search capability, and comparing it to Apache
Solr. My experience so far - which has not been entirely anecdotal,
but hasn't amounted to a formal TREC-style evaluation - is that
Postgres full-text search, in any ranking/normalization configuration
I can create, is reliably worse than Solr. Now, I understand that the
whole point of Solr is search, and this is a sideline for Postgres,
but I'd like to figure out how close Postgres can get, and while I'm
knowledgeable about search technologies, I'm not an expert. And I've
looked for information on the Web about comparing Postgres search
to other search capabilities, and everything I've found so far is
extremely basic.

Does anybody have any pointers to resources (people, sites, journal
articles, blogs, etc.) which are deeply knowledgeable about this
comparison?


Uh, most of our full text seach is done by Russian developers, who are
obviously very good at it.  It would be helpful if you could list
exactly what is missing and then we can have a discussion the hackers
list to see what is possible.  I think it would be helpful  if we just
document what we _don't_ have.






Re: [EXT] Re: Looking for tips on improving full-text search quality in Postgres

2022-03-04 Thread Bayer, Samuel

Fair question. Not worried so much about speed. Looking, essentially, at 
precision by rank (i.e., average precision and variants). I have not explored 
the contrasts between the default English language configuration in Postgres 
and the one in Solr - I have no reason to believe that there's anything odd 
going on there. My problem is that I can't provide specific performance 
numbers, or the corpus in question, but my overall impression is that the top N 
(10, 20) results from Postgres, no matter how I configure the ranking, aren't 
as relevant to the query, as a group, than the ones from Solr.

Example anecdote: the documents I'm searching come with metadata (e.g., title), which I'm 
not indexing specially (not a separate field, just part of the raw text of the document). 
When I search even for single terms, and look at the titles of the results, the titles in 
the Solr results more frequently contain that term than the titles in the Postgres 
results. I also FEEL like I've noticed that the problem is more apparent in 
"OR" queries; if I search for a disjunction of terms, the documents that 
contain all the terms are more likely to be high in the Solr rankings than in the 
Postgres rankings.

I apologize for not being able to be more specific.

Thanks in advance, again.

On 3/4/22 10:30 AM, Atri Sharma wrote:

Can you define what "high quality" is?

Are you referring to precision? Or recall? Or speed? Or query dialect?

On Fri, Mar 4, 2022 at 8:59 PM Bayer, Samuel  wrote:


Thanks for replying. My problem is that I can't provide enough guidance on what 
isn't working, because (a) I don't have good enough intuitions about how the 
normalization options are expected to affect the results, and (b) I can't 
identify a specific missing function - I'm just observing that I can't make the 
results as high-quality as Solr.

My apologies.

Sam

On 3/4/22 10:25 AM, Bruce Momjian wrote:

On Fri, Mar 4, 2022 at 08:10:48AM -0500, Bayer, Samuel wrote:

Hi all -

When I have a need for both sophisticated database querying and
full-text search, I'd rather not stand up a technology stack with
multiple tools (e.g., Postgres and Apache Solr, or Postgres and
ElasticSearch with a zomboDB bridge). So I've been looking at the
Postgres full-text search capability, and comparing it to Apache
Solr. My experience so far - which has not been entirely anecdotal,
but hasn't amounted to a formal TREC-style evaluation - is that
Postgres full-text search, in any ranking/normalization configuration
I can create, is reliably worse than Solr. Now, I understand that the
whole point of Solr is search, and this is a sideline for Postgres,
but I'd like to figure out how close Postgres can get, and while I'm
knowledgeable about search technologies, I'm not an expert. And I've
looked for information on the Web about comparing Postgres search
to other search capabilities, and everything I've found so far is
extremely basic.

Does anybody have any pointers to resources (people, sites, journal
articles, blogs, etc.) which are deeply knowledgeable about this
comparison?


Uh, most of our full text seach is done by Russian developers, who are
obviously very good at it.  It would be helpful if you could list
exactly what is missing and then we can have a discussion the hackers
list to see what is possible.  I think it would be helpful  if we just
document what we _don't_ have.












Re: [EXT] Re: Looking for tips on improving full-text search quality in Postgres

2022-03-04 Thread Bayer, Samuel

I've tried both ranking functions. I've tried a variety of the normalization 
settings. I'm using the standard English language configuration. Postgres 13.

I do understand your FTS philosophy - I suppose I'm looking for guidance about 
how best to approximate the search capability in Solr using the FTS pieces you 
have. One concrete question, I suppose, is: the classic TF/IDF search strategy 
relies on inverse document frequency, which looks across the corpus. I can't 
tell whether that corpus-wide frequency information is taken into account in 
either ranking function.

I don't know if Solr weights earlier tokens more heavily, but I wouldn't be 
surprised if it does.

On 3/4/22 11:09 AM, Tom Lane wrote:

Bruce Momjian  writes:

On Fri, Mar 4, 2022 at 10:41:16AM -0500, Bayer, Samuel wrote:

I apologize for not being able to be more specific.



I know it is hard to quantify.  Is it possible that Postgres is treating
all the terms equally, while Solr is prioritizing terms that are earlier
in the document?


A few basic questions:

* which ranking function are you using?

* with what options?

* which PG version exactly?

As far as I can see from a quick look at the docs, neither
ts_rank() nor ts_rank_cd() consider "earlier in the document"
to be an interesting consideration.  They do have the ability
to prefer terms that have been marked as having a higher weight,
but you'd need to do some setup work to make that useful ---
basically, you'd have to separate out the title or other metadata
and apply setweight() to it while building the tsvectors.

I wouldn't be surprised if Solr has some well-tuned default
heuristics that mean that you don't have to work hard to get
good results from it.  The current state of our FTS features
is more like "here's all the parts, but you have to build the
behavior you want".

ISTM that our FTS features have basically been on autopilot
since they went in.  I'd sort of hoped that we'd see more
parsers, more ranking functions, etc, over time ... but nothing
like that has happened.  I'm not sure if that's just lack of
interest, or if people find the code too difficult to work with.

regards, tom lane