insert and query performance on big string table with pg_trgm
(actual time=2888.846..2888.846 rows=16434 loops=1) Index Cond: ((value)::text ~~* '%keyword%'::text) Planning time: 0.252 ms Execution time: 58442.413 ms (8 rows) Thanks for reading this and letting me know any recommendations. Sincerely, Matthew Hall
Re: insert and query performance on big string table with pg_trgm
Hi Jeff, Thanks so much for writing. You've got some great points. > On Nov 20, 2017, at 5:42 PM, Jeff Janes wrote: > While I have not done exhaustive testing, from the tests I have done I've > never found gist to be better than gin with trgm indexes. Thanks, this helps considerably, as the documentation was kind of confusing and I didn't want to get it wrong if I could avoid it. > Do you really need the artificial primary key, when you already have another > column that would be used as the primary key? If you need to use this it a > foreign key in another type, then very well might. But maintaining two > unique indexes doesn't come free. OK, fair enough, I'll test with it removed and see what happens. > Are all indexes present at the time you insert? It will probably be much > faster to insert without the gin index (at least) and build it after the load. There is some flexibility on the initial load, but the updates in the future will require the de-duplication capability. I'm willing to accept that might be somewhat slower on the load process, to get the accurate updates, provided we could try meeting the read-side goal I wrote about, or at least figure out why it's impossible, so I can understand what I need to fix to make it possible. > Without knowing this key fact, it is hard to interpret the rest of your data. I'm assuming you're referring to the part about the need for the primary key, and the indexes during loading? I did try to describe that in the earlier mail, but obviously I'm new at writing these, so sorry if I didn't make it more clear. I can get rid of the bigserial PK and the indexes could be made separately, but I would need a way to de-duplicate on future reloading... that's why I had the ON CONFLICT DO NOTHING expression on the INSERT. So we'd still want to learn why the INSERT is slow to fix up the update processes that would happen in the future. > * maintenance_work_mem 512 MB > > Building a gin index in bulk could benefit from more memory here. Fixed it; I will re-test w/ 1 GB. Have you got any recommended values so I don't screw it up? > * synchronous_commit off > > If you already are using unlogged tables, this might not be so helpful, but > does increase the risk of the rest of your system. Fixed it; the unlogged mode change came later than this did. > PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND > 29578 postgres 20 0 6575672 6.149g 6.139g R 86.0 39.7 45:24.97 postgres > > You should expand the command line (by hitting 'c', at least in my version of > top) so we can see which postgres process this is. Good point, I'll write back once I retry w/ your other advice. > explain (analyze, buffers), please. And hopefully with track_io_timing=on. track_io_timing was missing because sadly I had only found it in one document at the very end of the investigation, after doing the big job which generated all of the material posted. It's there now, so here is some better output on the query: explain (analyze, buffers) select * from huge_table where value ilike '%canada%'; Bitmap Heap Scan on huge_table (cost=273.44..61690.09 rows=16702 width=33) (actual time=5701.511..76469.688 rows=110166 loops=1) Recheck Cond: ((value)::text ~~* '%canada%'::text) Rows Removed by Index Recheck: 198 Heap Blocks: exact=66657 Buffers: shared hit=12372 read=56201 dirtied=36906 I/O Timings: read=74195.734 -> Bitmap Index Scan on huge_table_value_trgm (cost=0.00..269.26 rows=16702 width=0) (actual time=5683.032..5683.032 rows=110468 loops=1) Index Cond: ((value)::text ~~* '%canada%'::text) Buffers: shared hit=888 read=1028 I/O Timings: read=5470.839 Planning time: 0.271 ms Execution time: 76506.949 ms I will work some more on the insert piece. > If you repeat the same query, is it then faster, or is it still slow? If you keep the expression exactly the same, it still takes a few seconds as could be expected for such a torture test query, but it's still WAY faster than the first such query. If you change it out to a different expression, it's longer again of course. There does seem to be a low-to-medium correlation between the number of rows found and the query completion time. > Cheers, > Jeff Thanks, Matthew.
Re: pg_dump 3 times as slow after 8.4 -> 9.5 upgrade
> On Nov 21, 2017, at 10:18 PM, Henrik Cednert (Filmlance) > wrote: > > WHat's the normal way to deal with compression? Dump uncompressed and use > something that threads better to compress the dump? I would say most likely your zlib is screwed up somehow, like maybe it didn't get optimized right by the C compiler or something else sucks w/ the compression settings. The CPU should easily blast away at that faster than disks can read. I did do some studies of this previously some years ago, and I found gzip -6 offered the best ratio between size reduction and CPU time out of a very wide range of formats, but at the time xz was also not yet available. If I were you I would first pipe the uncompressed output through a separate compression command, then you can experiment with the flags and threads, and you already get another separate process for the kernel to put on other CPUs as an automatic bonus for multi-core with minimal work. After that, xz is GNU standard now and has xz -T for cranking up some threads, with little extra effort for the user. But it can be kind of slow so probably need to lower the compression level somewhat depending a bit on some time testing. I would try on some medium sized DB table, like a bit over the size of system RAM, instead of dumping this great big DB, in order to benchmark a couple times until it looks happy. Matthew
Re: pg_dump 3 times as slow after 8.4 -> 9.5 upgrade
On Nov 22, 2017, at 5:06 AM, Henrik Cednert (Filmlance) wrote: > > When investigating the zlib lead I looked at 8.4 installation and 9.5 > installation. 9.5 includes zlib.h (/Library/PostgreSQL//9.5/include/zlib.h), > but 8.4 doesn't. But that's a header file and I have no idea how that really > works and if that's the one used by pgres9.5 or not. The version in it says > 1.2.8 and that's what the Instruments are showing when I monitor pg_dump > while running. > > Guess I'll have to install instruments in a dev env and do a pg_dump with 8.4 > to see the difference. Tedious. =/ I would also check the library linkages of the pg_dump binaries. See if one thing is using an embedded zlib and the other a system zlib. Then you could imagine one didn't get compiled with the best-performing CFLAGS, etc. Matthew.
Re: insert and query performance on big string table with pg_trgm
On Nov 21, 2017, at 12:05 AM, Matthew Hall wrote: >> Do you really need the artificial primary key, when you already have another >> column that would be used as the primary key? If you need to use this it a >> foreign key in another type, then very well might. But maintaining two >> unique indexes doesn't come free. > > OK, fair enough, I'll test with it removed and see what happens. With the integer primary key removed, it still takes ~9 hours to load the table, so it didn't seem to make a big difference. > Fixed it; I will re-test w/ 1 GB. Have you got any recommended values so I > don't screw it up? I also took this step for maintenance_work_mem. Queries on the table still take a long time with the PK removed: # explain (analyze, buffers) select * from huge_table where value ilike '%yahoo%'; Bitmap Heap Scan on huge_table (cost=593.72..68828.97 rows=18803 width=25) (actual time=3224.100..70059.839 rows=20909 loops=1) Recheck Cond: ((value)::text ~~* '%yahoo%'::text) Rows Removed by Index Recheck: 17 Heap Blocks: exact=6682 Buffers: shared hit=544 read=6760 dirtied=4034 I/O Timings: read=69709.611 -> Bitmap Index Scan on huge_table_value_trgm_idx (cost=0.00..589.02 rows=18803 width=0) (actual time=3216.545..3216.545 rows=20926 loops=1) Index Cond: ((value)::text ~~* '%yahoo%'::text) Buffers: shared hit=352 read=270 I/O Timings: read=3171.872 Planning time: 0.283 ms Execution time: 70065.157 ms (12 rows) The slow process during inserts is: postgres: username dbname [local] INSERT The slow statement example is: 2017-12-06 04:27:11 UTC [16085-10378] username@dbname LOG: duration: 5028.190 ms plan: Query Text: INSERT INTO huge_table (value) VALUES 5000 values at once ... ON CONFLICT (value) DO NOTHING Insert on huge_table (cost=0.00..75.00 rows=5000 width=40) Conflict Resolution: NOTHING Conflict Arbiter Indexes: huge_table_value_idx -> Values Scan on "*VALUES*" (cost=0.00..75.00 rows=5000 width=40) > What is the size of the table and the gin index? The table is 10 GB. The gin index is 5.8 GB. > [From Gabor Szucs] [H]ow about adding a hash value column and creating the > unique index on that one? May block some false duplicates but the unique > index would be way smaller, speeding up inserts. The mean length of the input items is about 18 bytes. The max length of the input items is about 67 bytes. The size of the md5 would of course be 16 bytes. I'm testing it now, and I'll write another update. Matthew.
Re: insert and query performance on big string table with pg_trgm
> On Dec 5, 2017, at 11:23 PM, Sergei Kornilov wrote: > You has very slow (or busy) disks, not postgresql issue. Reading 6760 * 8KB > in 70 seconds is very bad result. > > For better performance you need better disks, at least raid10 (not raid5). > Much more memory in shared_buffers can help with read performance and so > reduce disk utilization, but write operations still will be slow. > > Sergei Sergei, Thanks so much for confirming, this really helps a lot to know what to do. I thought the disk could be some of my issue, but I wanted to make sure I did all of the obvious tuning first. I have learned some very valuable things which I'll be able to use on future challenges like this which I didn't learn previously. Based on this advice from everyone, I'm setting up a box with more RAM, lots of SSDs, and RAID 10. I'll write back in a few more days after I've completed it. I can also confirm that the previous advice about using a hash / digest based unique index seemed to make the loading process slower for me, not faster, which is an interesting result to consider for future users following this thread (if any). I don't yet have specific data how much slower, because it's actually still going! Sincerely, Matthew.
Re: Latest advice on SSD?
The most critical bit of advice I've found is setting this preference: https://amplitude.engineering/how-a-single-postgresql-config-change-improved-slow-query-performance-by-50x-85593b8991b0 I'm using 4 512GB Samsung 850 EVOs in a hardware RAID 10 on a 1U server with about 144 GB RAM and 8 Xeon cores. I usually burn up CPU more than I burn up disks or RAM as compared to using magnetic where I had horrible IO wait percentages, so it seems to be performing quite well so far. Matthew Hall > On Apr 9, 2018, at 7:36 PM, Craig James wrote: > > One of our four "big iron" (spinning disks) servers went belly up today. > (Thanks, Postgres and pgbackrest! Easy recovery.) We're planning to move to a > cloud service at the end of the year, so bad timing on this. We didn't want > to buy any more hardware, but now it looks like we have to. > > I followed the discussions about SSD drives when they were first becoming > mainstream; at that time, the Intel devices were king. Can anyone recommend > what's a good SSD configuration these days? I don't think we want to buy a > new server with spinning disks. > > We're replacing: > 8 core (Intel) > 48GB memory > 12-drive 7200 RPM 500GB > RAID1 (2 disks, OS and WAL log) > RAID10 (8 disks, postgres data dir) > 2 spares > Ubuntu 16.04 > Postgres 9.6 > > The current system peaks at about 7000 TPS from pgbench. > > Our system is a mix of non-transactional searching (customers) and > transactional data loading (us). > > Thanks! > Craig > > -- > - > Craig A. James > Chief Technology Officer > eMolecules, Inc. > -
Re: Simple Query Elapsed Time ~ 3 hrs Using Bitmap Index/Heap Scan
Probably the cardinality of "vclf_number" is really bad. So the scan on that index is returning many million or billion rows and then you get a recheck which takes semi-forever. So you need an index on cl_value or both vclf_number and cl_value. If you know some properties of the values actually stored inside of those that will help. Matthew Hall > On Jun 5, 2018, at 7:17 AM, Fred Habash wrote: > > Trying to optimize the Elapsed Time (ET) of this query. Currently, it is > hovering around 3 hrs. > > Running a 'vaccum analyse' had no effect on ET. Even forcing an 'indexonly' > scan by disabling 'enable_seqscan', still around the 3 hrs. > The table is around 4.6B rows, > explain select cit_id, cl_value from reflink.citation_locators where > cl_value = '1507617681' and vclf_number = 1 ; >QUERY PLAN > > - > Bitmap Heap Scan on citation_locators (cost=5066559.01..50999084.79 > rows=133 width=23) >Recheck Cond: (vclf_number = 1) >Filter: (cl_value = '1507617681'::text) >-> Bitmap Index Scan on cl_indx_fk02 (cost=0.00..5066558.97 > rows=493984719 width=0) > Index Cond: (vclf_number = 1) > (5 rows) > > reflink.citation_locators > Table "reflink.citation_locators" > Column | Type | Modifiers | Storage | Stats > target | Description > --+--+---+--+--+- > cl_id| bigint | not null | plain| > | > cl_value | text | not null | extended | > | > vclf_number | integer | not null | plain| > | > cit_id | bigint | not null | plain| > | > cl_date_created | timestamp with time zone | not null | plain| > | > cl_date_modified | timestamp with time zone | | plain| > | > Indexes: > "cl_pk" PRIMARY KEY, btree (cl_id) > "cl_cnst_uk01" UNIQUE CONSTRAINT, btree (cit_id, vclf_number, cl_value) > "cl_indx_fk01" btree (cit_id) > "cl_indx_fk02" btree (vclf_number) > Foreign-key constraints: > "cl_cnst_fk01" FOREIGN KEY (cit_id) REFERENCES citations(cit_id) NOT > VALID"cl_cnst_fk02" FOREIGN KEY (vclf_number) REFERENCES > valid_cit_locator_fields(vclf_number)
Re: Slow query when pg_trgm is in inner lopp
Is there a reason you used GIST on your pg_trgm indices and not GIN? In my tests and previous posts on here, it nearly always performs worse. Also, did you make sure if it's really SSD and set the random_page_cost accordingly? Matthew Hall > On Jun 20, 2018, at 8:21 AM, Sasa Vilic wrote: > > Hi everyone, > > we have a new query that performs badly with specific input parameters. We > get worst performance when input data is most restrictive. I have partially > identified a problem: it always happens when index scan is done in inner loop > and index type is pg_trgm. We also noticed that for simple query > ( > select * from point where identifier = 'LOWW' vs > select * from point where identifier LIKE 'LOWW' > ) > the difference between btree index and pg_trgm index can be quite high: > 0.009 ms vs 32.0 ms. > > What I would like to know is whenever query planner is aware that some index > types are more expensive the the others and whenever it can take that into > account? > > I will describe background first, then give you query and its analysis for > different parameters and in the end I will write about all required > information > regarding setup (Postgres version, Schema, metadata, hardware, etc.) > > I would like to know whenever this is a bug in query planner or not and what > could we do about it. > > > # Background > > > We have a database with navigational data for civil aviation. > Current query is working on two tables: point and route. > Point represents a navigational point on Earth and route describes a route > between two points. > > Query that we have finds all routes between two set of points. A set is a > dynamically/loosely defined by pattern given by the user input. So for > example > if user wants to find all routes between international airports in Austria > toward London Heathrow, he or she would use 'LOW%' as :from_point_identifier > and 'EGLL' as :to_point_identifier. Please keep in mind that is a simple case, > and that user is allowed to define search term any way he/she see it fit, > i.e. '%OW%', 'EG%'. > > SELECT > r.* > FROM navdata.route r > INNER JOIN navdata.point op ON r.frompointguid = op.guid > INNER JOIN navdata.point dp ON r.topointguid = dp.guid > WHERE > r.routeidentifier ILIKE :route_identifier > AND tsrange(r.startvalid, r.endvalid) @> :validity :: TIMESTAMP > AND (NOT :use_sources :: BOOLEAN OR r.source = ANY (:sources :: VARCHAR [])) > AND CONCAT(op.identifier, '') ILIKE :from_point_identifier > AND op.type = ANY (:point_types :: VARCHAR []) > AND tsrange(op.startvalid, op.endvalid) @> :validity :: TIMESTAMP > AND dp.identifier ILIKE :to_point_identifier :: VARCHAR > AND dp.type = ANY (:point_types :: VARCHAR []) > AND tsrange(dp.startvalid, dp.endvalid) @> :validity :: TIMESTAMP > ORDER BY r.routeidentifier > LIMIT 1000 > > > Most of the tables we have follows this layout principle: > * uid - is primary key > * guid - is globally unique key (i.e. London Heathrow could for example >change it identifier EGLL, but our internal guid will stay same) > * startvalid, endvalid - defines for which period is entry valid. Entires > with > same guid should not have overlapping validity. > > We don't use foreign keys for two reasons: > * We need to do live migration without downtime. Creating a foreign key on >huge dataset could take quite some time > * Relationship between entities are defined based on guid and not on uid > (primary key). > > > # Query analysis > > > > # Case 1 : We search for all outgoing routes from Vienna International Airport > > > EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS) > SELECT > r.* > FROM navdata.route r > INNER JOIN navdata.point op ON r.frompointguid = op.guid > INNER JOIN navdata.point dp ON r.topointguid = dp.guid > WHERE > r.routeidentifier ILIKE '%' > AND tsrange(r.startvalid, r.endvalid) @> now() :: TIMESTAMP > AND (NOT false :: BOOLEAN OR r.source = ANY (ARRAY[] :: VARCHAR [])) > AND op.identifier ILIKE '%LOWW%' > AND op.type = ANY (ARRAY['PA'] :: VA
Re: Query with "ILIKE ALL" does not use the index
On Jul 26, 2018, at 9:44 AM, Tom Lane wrote: > > Nicolas Even writes: >> However when I run the same (as far as I understand it) query but with >> the ALL operator, the index is not used: >> explain analyze select name from totoz where name ilike all(array['%tot%']); > > There's only index support for "op ANY (array)", not "op ALL (array)". > > regards, tom lane Nicolas, Could you work around the limitation with a two-clause WHERE? First clause ANY, second clause ALL. I've done some similar tricks on similar sorts of queries. Matthew.
Re: Would SSD improve Index Only Scan performance by a lot?
For indexes the SSDs are at least 4X faster but you won't get that to happen unless you fix the planner tunable for the random page fetch cost first. Super important change for SSDs. Matthew Hall > On Oct 8, 2019, at 5:12 PM, Rick Otten wrote: > > >> On Tue, Oct 8, 2019 at 7:37 PM Arya F wrote: >> As my table has gotten bigger, it takes longer to get a single row back when >> querying a row by its btree index. >> >> Right now the database is running on a traditional HDD. SSDs have a much >> faster seek time than traditional HDDs. >> >> Would switching to an SSD improve "Index Only Scan" time greatly? by at >> least 3-4 times? > > *If* your query is disk I/O bound, SSD can help a lot. > > If your data is already in memory, or file system cache, and your query is > bound by CPU or bloated/corrupted indexes, or some query inefficiency, then > faster disks really won't do anything. > > Depending on the data type and size of the data you may be able to help your > query performance by choosing an index type other than the out-of-the-box > btree as well (such as a hash or brin index) or maybe even a different sort > order on the index, or a partial index. > >
