from:"Matthew Hall"

insert and query performance on big string table with pg_trgm

2017-11-20 Thread Matthew Hall

 (actual time=2888.846..2888.846 rows=16434 loops=1)
 Index Cond: ((value)::text ~~* '%keyword%'::text)
 Planning time: 0.252 ms
 Execution time: 58442.413 ms
(8 rows)

Thanks for reading this and letting me know any recommendations.

Sincerely,
Matthew Hall

Re: insert and query performance on big string table with pg_trgm

2017-11-21 Thread Matthew Hall

Hi Jeff,

Thanks so much for writing. You've got some great points.

> On Nov 20, 2017, at 5:42 PM, Jeff Janes  wrote:
> While I have not done exhaustive testing, from the tests I have done I've 
> never found gist to be better than gin with trgm indexes.

Thanks, this helps considerably, as the documentation was kind of confusing and 
I didn't want to get it wrong if I could avoid it.

> Do you really need the artificial primary key, when you already have another 
> column that would be used as the primary key?  If you need to use this it a 
> foreign key in another type, then very well might.  But maintaining two 
> unique indexes doesn't come free.

OK, fair enough, I'll test with it removed and see what happens.

> Are all indexes present at the time you insert?  It will probably be much 
> faster to insert without the gin index (at least) and build it after the load.

There is some flexibility on the initial load, but the updates in the future 
will require the de-duplication capability. I'm willing to accept that might be 
somewhat slower on the load process, to get the accurate updates, provided we 
could try meeting the read-side goal I wrote about, or at least figure out why 
it's impossible, so I can understand what I need to fix to make it possible.

> Without knowing this key fact, it is hard to interpret the rest of your data.

I'm assuming you're referring to the part about the need for the primary key, 
and the indexes during loading? I did try to describe that in the earlier mail, 
but obviously I'm new at writing these, so sorry if I didn't make it more 
clear. I can get rid of the bigserial PK and the indexes could be made 
separately, but I would need a way to de-duplicate on future reloading... 
that's why I had the ON CONFLICT DO NOTHING expression on the INSERT. So we'd 
still want to learn why the INSERT is slow to fix up the update processes that 
would happen in the future.

> * maintenance_work_mem 512 MB
> 
> Building a gin index in bulk could benefit from more memory here. 

Fixed it; I will re-test w/ 1 GB. Have you got any recommended values so I 
don't screw it up?

> * synchronous_commit off
> 
> If you already are using unlogged tables, this might not be so helpful, but 
> does increase the risk of the rest of your system.

Fixed it; the unlogged mode change came later than this did.

>   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
> 29578 postgres  20   0 6575672 6.149g 6.139g R  86.0 39.7  45:24.97 postgres
> 
> You should expand the command line (by hitting 'c', at least in my version of 
> top) so we can see which postgres process this is.

Good point, I'll write back once I retry w/ your other advice.

> explain (analyze, buffers), please.  And hopefully with track_io_timing=on.

track_io_timing was missing because sadly I had only found it in one document 
at the very end of the investigation, after doing the big job which generated 
all of the material posted. It's there now, so here is some better output on 
the query:

explain (analyze, buffers) select * from huge_table where value ilike 
'%canada%';

 Bitmap Heap Scan on huge_table  (cost=273.44..61690.09 rows=16702 width=33) 
(actual time=5701.511..76469.688 rows=110166 loops=1)
   Recheck Cond: ((value)::text ~~* '%canada%'::text)
   Rows Removed by Index Recheck: 198
   Heap Blocks: exact=66657
   Buffers: shared hit=12372 read=56201 dirtied=36906
   I/O Timings: read=74195.734
   ->  Bitmap Index Scan on huge_table_value_trgm  (cost=0.00..269.26 
rows=16702 width=0) (actual time=5683.032..5683.032 rows=110468 loops=1)
 Index Cond: ((value)::text ~~* '%canada%'::text)
 Buffers: shared hit=888 read=1028
 I/O Timings: read=5470.839
 Planning time: 0.271 ms
 Execution time: 76506.949 ms

I will work some more on the insert piece.

> If you repeat the same query, is it then faster, or is it still slow?

If you keep the expression exactly the same, it still takes a few seconds as 
could be expected for such a torture test query, but it's still WAY faster than 
the first such query. If you change it out to a different expression, it's 
longer again of course. There does seem to be a low-to-medium correlation 
between the number of rows found and the query completion time.

> Cheers,
> Jeff

Thanks,
Matthew.

Re: pg_dump 3 times as slow after 8.4 -> 9.5 upgrade

2017-11-22 Thread Matthew Hall

> On Nov 21, 2017, at 10:18 PM, Henrik Cednert (Filmlance) 
>  wrote:
> 
> WHat's the normal way to deal with compression? Dump uncompressed and use 
> something that threads better to compress the dump?

I would say most likely your zlib is screwed up somehow, like maybe it didn't 
get optimized right by the C compiler or something else sucks w/ the 
compression settings. The CPU should easily blast away at that faster than 
disks can read.

I did do some studies of this previously some years ago, and I found gzip -6 
offered the best ratio between size reduction and CPU time out of a very wide 
range of formats, but at the time xz was also not yet available.

If I were you I would first pipe the uncompressed output through a separate 
compression command, then you can experiment with the flags and threads, and 
you already get another separate process for the kernel to put on other CPUs as 
an automatic bonus for multi-core with minimal work.

After that, xz is GNU standard now and has xz -T for cranking up some threads, 
with little extra effort for the user. But it can be kind of slow so probably 
need to lower the compression level somewhat depending a bit on some time 
testing. I would try on some medium sized DB table, like a bit over the size of 
system RAM, instead of dumping this great big DB, in order to benchmark a 
couple times until it looks happy.

Matthew

Re: pg_dump 3 times as slow after 8.4 -> 9.5 upgrade

2017-11-22 Thread Matthew Hall

On Nov 22, 2017, at 5:06 AM, Henrik Cednert (Filmlance) 
 wrote:
> 
> When investigating the zlib lead I looked at 8.4 installation and 9.5 
> installation. 9.5 includes zlib.h (/Library/PostgreSQL//9.5/include/zlib.h), 
> but 8.4 doesn't. But that's a header file and I have no idea how that really 
> works and if that's the one used by pgres9.5 or not. The version in it says 
> 1.2.8 and that's what the Instruments are showing when I monitor pg_dump 
> while running. 
> 
> Guess I'll have to install instruments in a dev env and do a pg_dump with 8.4 
> to see the difference. Tedious. =/ 

I would also check the library linkages of the pg_dump binaries.

See if one thing is using an embedded zlib and the other a system zlib.

Then you could imagine one didn't get compiled with the best-performing CFLAGS, 
etc.

Matthew.

Re: insert and query performance on big string table with pg_trgm

2017-12-05 Thread Matthew Hall

On Nov 21, 2017, at 12:05 AM, Matthew Hall  wrote:
>> Do you really need the artificial primary key, when you already have another 
>> column that would be used as the primary key?  If you need to use this it a 
>> foreign key in another type, then very well might.  But maintaining two 
>> unique indexes doesn't come free.
> 
> OK, fair enough, I'll test with it removed and see what happens.

With the integer primary key removed, it still takes ~9 hours to load the 
table, so it didn't seem to make a big difference.

> Fixed it; I will re-test w/ 1 GB. Have you got any recommended values so I 
> don't screw it up?

I also took this step for maintenance_work_mem.

Queries on the table still take a long time with the PK removed:

# explain (analyze, buffers) select * from huge_table where value ilike 
'%yahoo%';

 Bitmap Heap Scan on huge_table  (cost=593.72..68828.97 rows=18803 width=25) 
(actual time=3224.100..70059.839 rows=20909 loops=1)
   Recheck Cond: ((value)::text ~~* '%yahoo%'::text)
   Rows Removed by Index Recheck: 17
   Heap Blocks: exact=6682
   Buffers: shared hit=544 read=6760 dirtied=4034
   I/O Timings: read=69709.611
   ->  Bitmap Index Scan on huge_table_value_trgm_idx  (cost=0.00..589.02 
rows=18803 width=0) (actual time=3216.545..3216.545 rows=20926 loops=1)
 Index Cond: ((value)::text ~~* '%yahoo%'::text)
 Buffers: shared hit=352 read=270
 I/O Timings: read=3171.872
 Planning time: 0.283 ms
 Execution time: 70065.157 ms
(12 rows)

The slow process during inserts is:

postgres: username dbname [local] INSERT

The slow statement example is:

2017-12-06 04:27:11 UTC [16085-10378] username@dbname LOG:  duration: 5028.190 
ms  plan:
Query Text: INSERT INTO huge_table (value) VALUES
 5000 values at once ...
ON CONFLICT (value) DO NOTHING
Insert on huge_table  (cost=0.00..75.00 rows=5000 width=40)
  Conflict Resolution: NOTHING
  Conflict Arbiter Indexes: huge_table_value_idx
  ->  Values Scan on "*VALUES*"  (cost=0.00..75.00 rows=5000 width=40)

> What is the size of the table and the gin index?

The table is 10 GB. The gin index is 5.8 GB.

> [From Gabor Szucs] [H]ow about adding a hash value column and creating the 
> unique index on that one? May block some false duplicates but the unique 
> index would be way smaller, speeding up inserts.

The mean length of the input items is about 18 bytes. The max length of the 
input items is about 67 bytes. The size of the md5 would of course be 16 bytes. 
I'm testing it now, and I'll write another update.

Matthew.

Re: insert and query performance on big string table with pg_trgm

2017-12-06 Thread Matthew Hall

> On Dec 5, 2017, at 11:23 PM, Sergei Kornilov  wrote:
> You has very slow (or busy) disks, not postgresql issue. Reading 6760 * 8KB 
> in 70 seconds is very bad result.
> 
> For better performance you need better disks, at least raid10 (not raid5). 
> Much more memory in shared_buffers can help with read performance and so 
> reduce disk utilization, but write operations still will be slow.
> 
> Sergei

Sergei,

Thanks so much for confirming, this really helps a lot to know what to do. I 
thought the disk could be some of my issue, but I wanted to make sure I did all 
of the obvious tuning first. I have learned some very valuable things which 
I'll be able to use on future challenges like this which I didn't learn 
previously.

Based on this advice from everyone, I'm setting up a box with more RAM, lots of 
SSDs, and RAID 10. I'll write back in a few more days after I've completed it.

I can also confirm that the previous advice about using a hash / digest based 
unique index seemed to make the loading process slower for me, not faster, 
which is an interesting result to consider for future users following this 
thread (if any). I don't yet have specific data how much slower, because it's 
actually still going!

Sincerely,
Matthew.

Re: Latest advice on SSD?

2018-04-10 Thread Matthew Hall

The most critical bit of advice I've found is setting this preference:

https://amplitude.engineering/how-a-single-postgresql-config-change-improved-slow-query-performance-by-50x-85593b8991b0

I'm using 4 512GB Samsung 850 EVOs in a hardware RAID 10 on a 1U server with 
about 144 GB RAM and 8 Xeon cores. I usually burn up CPU more than I burn up 
disks or RAM as compared to using magnetic where I had horrible IO wait 
percentages, so it seems to be performing quite well so far. 

Matthew Hall

> On Apr 9, 2018, at 7:36 PM, Craig James  wrote:
> 
> One of our four "big iron" (spinning disks) servers went belly up today. 
> (Thanks, Postgres and pgbackrest! Easy recovery.) We're planning to move to a 
> cloud service at the end of the year, so bad timing on this. We didn't want 
> to buy any more hardware, but now it looks like we have to.
> 
> I followed the discussions about SSD drives when they were first becoming 
> mainstream; at that time, the Intel devices were king. Can anyone recommend 
> what's a good SSD configuration these days? I don't think we want to buy a 
> new server with spinning disks.
> 
> We're replacing:
>   8 core (Intel)
>   48GB memory
>   12-drive 7200 RPM 500GB
>  RAID1 (2 disks, OS and WAL log)
>  RAID10 (8 disks, postgres data dir)
>  2 spares
>   Ubuntu 16.04
>   Postgres 9.6
> 
> The current system peaks at about 7000 TPS from pgbench.
> 
> Our system is a mix of non-transactional searching (customers) and 
> transactional data loading (us).
> 
> Thanks!
> Craig
> 
> -- 
> -
> Craig A. James
> Chief Technology Officer
> eMolecules, Inc.
> -

Re: Simple Query Elapsed Time ~ 3 hrs Using Bitmap Index/Heap Scan

2018-06-05 Thread Matthew Hall

Probably the cardinality of "vclf_number" is really bad. So the scan on that 
index is returning many million or billion rows and then you get a recheck 
which takes semi-forever. So you need an index on cl_value or both vclf_number 
and cl_value. If you know some properties of the values actually stored inside 
of those that will help. 

Matthew Hall

> On Jun 5, 2018, at 7:17 AM, Fred Habash  wrote:
> 
> Trying to optimize the Elapsed Time (ET) of this query. Currently, it is 
> hovering around 3 hrs.
> 
> Running a 'vaccum analyse' had no effect on ET. Even forcing an 'indexonly' 
> scan by disabling 'enable_seqscan', still around the 3 hrs. 
> The table is around 4.6B rows, 
>  explain select cit_id, cl_value from reflink.citation_locators where 
> cl_value = '1507617681' and vclf_number = 1 ;
>QUERY PLAN 
>
> -
>  Bitmap Heap Scan on citation_locators  (cost=5066559.01..50999084.79 
> rows=133 width=23)
>Recheck Cond: (vclf_number = 1)
>Filter: (cl_value = '1507617681'::text)
>->  Bitmap Index Scan on cl_indx_fk02  (cost=0.00..5066558.97 
> rows=493984719 width=0)
>  Index Cond: (vclf_number = 1)
> (5 rows)
> 
> reflink.citation_locators 
> Table "reflink.citation_locators"
>   Column  |   Type   | Modifiers | Storage  | Stats 
> target | Description 
> --+--+---+--+--+-
>  cl_id| bigint   | not null  | plain| 
>  | 
>  cl_value | text | not null  | extended | 
>  | 
>  vclf_number  | integer  | not null  | plain| 
>  | 
>  cit_id   | bigint   | not null  | plain| 
>  | 
>  cl_date_created  | timestamp with time zone | not null  | plain| 
>  | 
>  cl_date_modified | timestamp with time zone |   | plain| 
>  | 
> Indexes:
> "cl_pk" PRIMARY KEY, btree (cl_id)
> "cl_cnst_uk01" UNIQUE CONSTRAINT, btree (cit_id, vclf_number, cl_value)
> "cl_indx_fk01" btree (cit_id)
> "cl_indx_fk02" btree (vclf_number)
> Foreign-key constraints:
> "cl_cnst_fk01" FOREIGN KEY (cit_id) REFERENCES citations(cit_id) NOT 
> VALID"cl_cnst_fk02" FOREIGN KEY (vclf_number) REFERENCES 
> valid_cit_locator_fields(vclf_number)

Re: Slow query when pg_trgm is in inner lopp

2018-06-20 Thread Matthew Hall

Is there a reason you used GIST on your pg_trgm indices and not GIN? In my 
tests and previous posts on here, it nearly always performs worse. Also, did 
you make sure if it's really SSD and set the random_page_cost accordingly?

Matthew Hall

> On Jun 20, 2018, at 8:21 AM, Sasa Vilic  wrote:
> 
> Hi everyone,
> 
> we have a new query that performs badly with specific input parameters. We
> get worst performance when input data is most restrictive. I have partially
> identified a problem: it always happens when index scan is done in inner loop
> and index type is pg_trgm. We also noticed that for simple query
> (
> select * from point where identifier = 'LOWW' vs
> select * from point where identifier LIKE 'LOWW'
> )
> the difference between btree index and pg_trgm index can be quite high:
> 0.009 ms vs 32.0 ms.
> 
> What I would like to know is whenever query planner is aware that some index
> types are more expensive the the others and whenever it can take that into
> account?
> 
> I will describe background first, then give you query and its analysis for
> different parameters and in the end I will write about all required 
> information
> regarding setup (Postgres version, Schema, metadata, hardware, etc.)
> 
> I would like to know whenever this is a bug in query planner or not and what
> could we do about it.
> 
> 
> # Background
> 
> 
> We have a database with navigational data for civil aviation.
> Current query is working on two tables: point and route.
> Point represents a navigational point on Earth and route describes a route 
> between two points.
> 
> Query that we have finds all routes between two set of points. A set is a
> dynamically/loosely defined by pattern given by the user input. So for 
> example 
> if user wants to find all routes between international airports in Austria 
> toward London Heathrow, he or she would use 'LOW%' as :from_point_identifier 
> and 'EGLL' as :to_point_identifier. Please keep in mind that is a simple case,
> and that user is allowed to define search term any way he/she see it fit,
> i.e. '%OW%', 'EG%'.
> 
> SELECT
>   r.*
> FROM navdata.route r
>   INNER JOIN navdata.point op ON r.frompointguid = op.guid
>   INNER JOIN navdata.point dp ON r.topointguid = dp.guid
> WHERE
>   r.routeidentifier ILIKE :route_identifier
>   AND tsrange(r.startvalid, r.endvalid) @> :validity :: TIMESTAMP
>   AND (NOT :use_sources :: BOOLEAN OR r.source = ANY (:sources :: VARCHAR []))
>   AND CONCAT(op.identifier, '') ILIKE :from_point_identifier
>   AND op.type = ANY (:point_types :: VARCHAR [])
>   AND tsrange(op.startvalid, op.endvalid) @> :validity :: TIMESTAMP
>   AND dp.identifier ILIKE :to_point_identifier :: VARCHAR
>   AND dp.type = ANY (:point_types :: VARCHAR [])
>   AND tsrange(dp.startvalid, dp.endvalid) @> :validity :: TIMESTAMP
> ORDER BY r.routeidentifier
> LIMIT 1000
> 
> 
> Most of the tables we have follows this layout principle:
> * uid - is primary key
> * guid - is globally unique key (i.e. London Heathrow could for example 
>change it identifier EGLL, but our internal guid will stay same)
> * startvalid, endvalid - defines for which period is entry valid. Entires 
> with 
>   same guid should not have overlapping validity. 
> 
> We don't use foreign keys for two reasons:
> * We need to do live migration without downtime. Creating a foreign key on 
>huge dataset could take quite some time
> * Relationship between entities are defined based on guid and not on uid 
> (primary key). 
> 
> 
> # Query analysis
> 
> 
> 
> # Case 1 : We search for all outgoing routes from Vienna International Airport
> 
> 
> EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
> SELECT
>   r.*
> FROM navdata.route r
>   INNER JOIN navdata.point op ON r.frompointguid = op.guid
>   INNER JOIN navdata.point dp ON r.topointguid = dp.guid
> WHERE
>   r.routeidentifier ILIKE '%'
>   AND tsrange(r.startvalid, r.endvalid) @> now() :: TIMESTAMP
>   AND (NOT false :: BOOLEAN OR r.source = ANY (ARRAY[] :: VARCHAR []))
>   AND op.identifier ILIKE '%LOWW%'
>   AND op.type = ANY (ARRAY['PA'] :: VA

Re: Query with "ILIKE ALL" does not use the index

2018-07-26 Thread Matthew Hall

On Jul 26, 2018, at 9:44 AM, Tom Lane  wrote:
> 
> Nicolas Even  writes:
>> However when I run the same (as far as I understand it) query but with
>> the ALL operator, the index is not used:
>> explain analyze select name from totoz where name ilike all(array['%tot%']);
> 
> There's only index support for "op ANY (array)", not "op ALL (array)".
> 
>   regards, tom lane

Nicolas,

Could you work around the limitation with a two-clause WHERE?

First clause ANY, second clause ALL.

I've done some similar tricks on similar sorts of queries.

Matthew.

Re: Would SSD improve Index Only Scan performance by a lot?

2019-10-08 Thread Matthew Hall

For indexes the SSDs are at least 4X faster but you won't get that to happen 
unless you fix the planner tunable for the random page fetch cost first. Super 
important change for SSDs. 

Matthew Hall

> On Oct 8, 2019, at 5:12 PM, Rick Otten  wrote:
> 
> 
>> On Tue, Oct 8, 2019 at 7:37 PM Arya F  wrote:
>> As my table has gotten bigger, it takes longer to get a single row back when 
>> querying a row by its btree index.
>> 
>> Right now the database is running on a traditional HDD. SSDs have a much 
>> faster seek time than traditional HDDs. 
>> 
>> Would switching to an SSD improve "Index Only Scan" time greatly? by at 
>> least 3-4 times?
> 
> *If* your query is disk I/O bound, SSD can help a lot.
> 
> If your data is already in memory, or file system cache, and your query is 
> bound by CPU or bloated/corrupted indexes, or some query inefficiency, then 
> faster disks really won't do anything.
> 
> Depending on the data type and size of the data you may be able to help your 
> query performance by choosing an index type other than the out-of-the-box 
> btree as well (such as a hash or brin index) or maybe even a different sort 
> order on the index, or a partial index.
> 
>

insert and query performance on big string table with pg_trgm

Re: insert and query performance on big string table with pg_trgm

Re: pg_dump 3 times as slow after 8.4 -> 9.5 upgrade

Re: pg_dump 3 times as slow after 8.4 -> 9.5 upgrade

Re: insert and query performance on big string table with pg_trgm

Re: insert and query performance on big string table with pg_trgm

Re: Latest advice on SSD?

Re: Simple Query Elapsed Time ~ 3 hrs Using Bitmap Index/Heap Scan

Re: Slow query when pg_trgm is in inner lopp

Re: Query with "ILIKE ALL" does not use the index

Re: Would SSD improve Index Only Scan performance by a lot?

11 matches

Site Navigation

Mail list logo

Footer information