Re: How to use full-text search URL parser to filter query results by domain name?

2019-04-10 Thread Jess Wren
On 4/8/19 4:50 AM, Arthur Zakirov wrote:
> I think it is normal to use ts_parse(). And I suppose you might use
> windows functions.
>
> For example, you have table links:
>
> =# create table links (score int, link text);
> =# insert into links values
>   (1, 'http://www.foo.com/bar'),
>   (2, 'http://www.foo.com/foo'),
>   (2, 'http://www.bar.com/foo'),
>   (1, 'http://www.bar.com/bar');
>
> You can use the following query:
>
> =# with l as (
>   select score, token, link,
>     rank() over (partition by token order by score) as rank
>   from links,
>     lateral ts_parse('default', link)
>   where tokid = 6)
> select score, token, link from l where rank = 1;
>  score |    token    |  link
> ---+-+
>  1 | www.bar.com | http://www.bar.com/bar
>  1 | www.foo.com | http://www.foo.com/bar
>

Thank you very much Arthur. Your suggestion led me to a query that is at
least returning correct result set. I could not figure out how to get
your rank() function to work with my query, but building on your answer
(and others from IRC etc), I ended up with the following solution:

First I created the following views:

|CREATE VIEW scored_pages AS ( SELECT crawl_results.crawl_id,
crawl_results.score, crawl_results.page_id, pages.url FROM crawl_results
JOIN pages ON crawl_results.page_id = pages.id ); CREATE VIEW
scored_links AS ( SELECT scored_pages.score, links.source, links.target,
links.link_text FROM links JOIN scored_pages ON scored_pages.url =
links.source );|

Then, using these views, I did the following query to extract the links
from the lowest scored pages in the results:

||SELECTscore,host,target FROM(SELECTDISTINCTON(token)token
AShost,score,target FROMscored_links,LATERAL
ts_parse('default',target)WHEREtokid =6ORDERBYtoken,score )asx
WHERENOTEXISTS(SELECTpp.id FROMpages pp WHEREtarget=pp.url)ORDERBYscore; ||

Does this seem like a reasonable approach? When running EXPLAIN on this
query, I get the following:

    QUERY PLAN  
  
--
 Sort  (cost=1252927.46..1252927.47 rows=1 width=100)
   Sort Key: crawl_results.score
   ->  Hash Anti Join  (cost=1248297.18..1252927.45 rows=1 width=100)
 Hash Cond: ((links.target)::text = (pp.url)::text)
 ->  Unique  (cost=1247961.08..1252591.28 rows=5 width=100)
   ->  Sort  (cost=1247961.08..1250276.18 rows=926040 width=100)
 Sort Key: ts_parse.token, crawl_results.score
 ->  Gather  (cost=1449.79..1054897.20 rows=926040 
width=100)
   Workers Planned: 2
   ->  Hash Join  (cost=449.79..961293.20 rows=385850 
width=100)
 Hash Cond: ((links.source)::text = 
(pages.url)::text)
 ->  Nested Loop  (cost=0.00..955091.41 
rows=378702 width=144)
   ->  Parallel Seq Scan on links  
(cost=0.00..4554.40 rows=75740 width=112)
   ->  Function Scan on ts_parse  
(cost=0.00..12.50 rows=5 width=32)
 Filter: (tokid = 6)
 ->  Hash  (cost=404.67..404.67 rows=3609 
width=63)
   ->  Hash Join  (cost=336.10..404.67 
rows=3609 width=63)
 Hash Cond: (crawl_results.page_id 
= pages.id)
 ->  Seq Scan on crawl_results  
(cost=0.00..59.09 rows=3609 width=12)
 ->  Hash  (cost=291.60..291.60 
rows=3560 width=59)
   ->  Seq Scan on pages  
(cost=0.00..291.60 rows=3560 width=59)
 ->  Hash  (cost=291.60..291.60 rows=3560 width=55)
   ->  Seq Scan on pages pp  (cost=0.00..291.60 rows=3560 width=55)
(23 rows)


I am wondering if there is a more efficient way to do things? Some
people on IRC mentioned that it might be better to declare a scalar
function to return the host from ts_parse instead of the LATERAL query
... but I couldn't figure out how to do that, or if it was even
preferable to the above from a performance standpoint ... any ideas on
how I could improve the above.




pg_indexes doesn't show indexes for partitioned tables - bug or intended?

2019-04-10 Thread Thomas Kellerer
In Postgres 11.2, indexes defined on partitioned tables do not show up in 
pg_indexes (the actual indexes for the partitions however do show up).

E.g.: 

CREATE TABLE base_table
(
  column1 varchar(50) NOT NULL,
  column2 integer NOT NULL,
  column3 integer not null,
  part_key bigint NOT NULL
)
PARTITION BY HASH (part_key);

CREATE UNIQUE INDEX idx_one ON base_table (column1, column2, part_key);

The following select returns nothing:

  select *
  from pg_indexes
  where tablename = 'base_table';

This is caused by the fact that pg_indexes only returns information for regular 
tables and materialized views ("relkind in ('r','m')") and regular indexes 
(relkind = 'i')

If the conditions on the relkind for the "table class" to include 'p' as well, 
and the relkind for the "index class" is changed to return 'i' and 'I', then 
those indexes are listed in pg_indexes as well:

 SELECT n.nspname AS schemaname,
c.relname AS tablename,
i.relname AS indexname,
t.spcname AS tablespace,
pg_get_indexdef(i.oid) AS indexdef
   FROM pg_index x
 JOIN pg_class c ON c.oid = x.indrelid
 JOIN pg_class i ON i.oid = x.indexrelid
 LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
 LEFT JOIN pg_tablespace t ON t.oid = i.reltablespace
  WHERE (c.relkind in ('r','m','p')) --<< add 'p' to the list
AND i.relkind in ('i', 'I') --<< add 'I' to the list



Is leaving out the indexes defined on the partitioned table intended or a bug? 

Regards
Thomas





Re: PK and FK using Hash index

2019-04-10 Thread Олег Самойлов


> 22 марта 2019 г., в 22:38, PegoraroF10  написал(а):
> 
> On version 10 and 11 hash index was redesigned and now seems to be better
> than btree. 
> Their speed, size and memory consuption are better, so ... 
> Why is not possible to have all PK and FK just using hash indices ? The only
> thing I need on a PK and FK is search and join, so hash index responds very
> well.
> 
> I know a hash doesn´t accept unique index, so I cannot create a primary key
> on it. 
> But I think it would be great, isn´t it ?
> 
> There is something wrong with my thought ? 
> If not, there is a way to have PK/FH structure using Hash index ?
> 
> 
> 
> --
> Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html
> 

The hash index is not good as expected. Some strange behaviour is observed even 
in 10 version.
https://www.postgresql.org/message-id/A841C4BC-A878-497E-AD9B-4DE0830DCC68%40ya.ru
Also btree keeps values inside an index and the uniqueness check can be made by 
index only scan. The hash index does not.
Also hash index don’t support multicolumn without an obvious reason for me. 
This can be done simple, just creates hash for every column and XOR it.

I think the hash index is perfect for a natural primary key (text base, for 
instance), but only theoretically. PostgreSQL implementation still has too many 
problems.



Re: Performance of ByteA: ascii vs binary

2019-04-10 Thread Олег Самойлов
Just theoretically assumption. PostgreSQL sometimes may optimise internal 
format of data and can sometimes zip data. ASCII data can be zipped better, 
then binary random data. Also PostgreSQL sometimes take decision to keep a 
column in an external file, if the column is still too big after zip. I don’t 
know what exactly happens in your case, but here can be a reason.

> 18 марта 2019 г., в 17:33, Thomas Güttler  
> написал(а):
> 
> I did some benchmarking and in my setup there was major
> performance difference.
> 
> I tested a ByteA column.
> 
> If I used ascii data the tests took 52 seconds.
> If I used random binary data the test took 250 seconds.
> 
> binary data is (roughly) five times slower than ascii data?
> 
> Is this a know fact, or is there something wrong with my benchmark?
> 
> I used Python and psycopg2.
> 
> Regards,
>  Thomas Güttler
> 
> 
> -- 
> Thomas Guettler http://www.thomas-guettler.de/
> I am looking for feedback: https://github.com/guettli/programming-guidelines
> 





Re: New timeline when starting with a restored data dir

2019-04-10 Thread Олег Самойлов
It’s normal behaviour to create a new timeline, when a PostgreSQL slave become 
a new master. This is for what timelines were designed.

> 19 нояб. 2018 г., в 0:50, Yuri Kanivetsky  
> написал(а):
> 
> Hi,
> 
> I'm trying to set up Barman, and I've noticed that PostgreSQL 10
> creates a new timeline when starting with a restored data dir.
> Particularly, when WAL files are delivered via restore_command. When
> WAL files are copied to pg_wal dir, no new timeline is created. Is it
> to be expected? Can you explain?
> 
> I decided to ask here first, since you might know better how
> PostgreSQL operates.
> 
> Thanks in advance.
> 





Re: Invoking user of the function with SECURITY DEFINER

2019-04-10 Thread Олег Самойлов
Looked like a bug.

> 25 нояб. 2018 г., в 14:50, Madan Kumar  написал(а):
> 
> Got it..
> In my case i was getting session_user in declare section and trying to 
> validate later which always resulted in the function owner.
>  DECLARE
>   user text := SESSION_USER;
> 
> So using it within the BEGIN; ...; END; clause worked for me.
> Thanks.
> 
> 
> Warm Regards,
> Madan Kumar K
>  





Re: Does pg_stat_get_live_tuples() matter?

2019-04-10 Thread Sherrylyn Branchaw
> Hmmm ... what was in reltuples again?

Reltuples had the correct number, and the query plans were showing the
correct estimates.

> it's a bit hard to believe that it could get to be off by 1000X.  Have
you suppressed autovacuum on this table?

No, but here are some things I've observed:

1)  Any time pg_stat_reset() gets run, pg_stat_user_tables.n_live_tup gets
reset to 0. pg_class.reltuples is untouched.

2) If new tuples get inserted or deleted after pg_stat_reset(),
pg_stat_user_tables.n_live_tup will match pg_stat_user_tables.n_tup_ins -
pg_stat_user_tables.n_tup_del up until the next analyze or autoanalyze.

3) Once the next (auto)analyze happens, pg_stat_user_tables.n_live_tup will
be updated to match pg_class.reltuples.

4) If a table is very large, it may be a while before
autovacuum_analyze_scale_factor is reached. We have ours set to 0.01, which
is an order of magnitude larger than 1.8K on a 1.8M row table. (I would
like to tune our settings more, but I have a list of priorities from
management on which this isn't high.)

5) Given 1-4, pg_stat_user_tables.n_live_tup may spend quite a long time
matching pg_stat_user_tables.n_tup_ins - pg_stat_user_tables.n_tup_del
instead of pg_class.reltuples. For example, if a table has 1.8 million
rows, and you insert 5 and delete 4 after a stats reset, n_live_tup will
report that the table has 1 tuple.

6)  Failovers, at least in Aurora, apparently cause pg_stat_reset() to be
run, at least judging by the timestamp I'm seeing in
pg_stat_bgwriter.stat_reset. We haven't done a failover in the data center
in a while, and it's less trivial for me to test there atm, so I'm not
certain whether open-source Postgres failovers also reset statistics.

> I don't see anything in the current core code that pays attention to
n_live_tuples.  reltuples definitely does matter to the planner, and some
of the sibling counters like n_dead_tuples drive autovacuum, but nothing is
examining n_live_tuples AFAICS.

That's what I thought, but I wanted to make sure I wasn't missing anything
obvious. Thanks!

> some of the sibling counters like n_dead_tuples drive autovacuum

So that's interesting. I knew that, but hadn't thought about the
implications. If pg_stat_reset() is executed by failovers, and a failover
happens just before a table is ready to be vacuumed--say it has 0.009 *
reltuples dead tuples (I'm simplifying the formula here)--then n_dead_tup
gets reset to 0 and the counting starts all over again. Regular failovers
could thus increase bloat by delaying the autovacuum daemon from
recognizing that a table needs to be vacuumed, am I right?

Is it recommended practice to manually VACUUM ANALYZE the whole database
after a failover? Or is resetting stats after a failover just an Aurora
thing? I'm sorry I'm asking the latter question instead of testing, but
I've been ordered not to spend time on improving our vacuuming and
statistics until 5 other large projects are done, and I'm spending a
minimal amount of time anyway just to see how badly frequent failovers
might be affecting us and if there's any action we need to take.

Thanks,
Sherrylyn


Re: pg_indexes doesn't show indexes for partitioned tables - bug or intended?

2019-04-10 Thread David Rowley
On Thu, 11 Apr 2019 at 00:39, Thomas Kellerer  wrote:
>
> In Postgres 11.2, indexes defined on partitioned tables do not show up in 
> pg_indexes (the actual indexes for the partitions however do show up).

> Is leaving out the indexes defined on the partitioned table intended or a bug?

Overlooked for PG11. You'll see them in PG12 per [1]. It's not really
possible to backpatch a fix for that since these views are created
during initdb.

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=f94cec64476f2752e91b10d7928a2fcd105e9fc3

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: pg_indexes doesn't show indexes for partitioned tables - bug or intended?

2019-04-10 Thread Alvaro Herrera
On 2019-Apr-11, David Rowley wrote:

> On Thu, 11 Apr 2019 at 00:39, Thomas Kellerer  wrote:
> >
> > In Postgres 11.2, indexes defined on partitioned tables do not show up in 
> > pg_indexes (the actual indexes for the partitions however do show up).
> 
> > Is leaving out the indexes defined on the partitioned table intended or a 
> > bug?
> 
> Overlooked for PG11. You'll see them in PG12 per [1]. It's not really
> possible to backpatch a fix for that since these views are created
> during initdb.

(You can, of course, adjust the view definition yourself.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pg_indexes doesn't show indexes for partitioned tables - bug or intended?

2019-04-10 Thread Thomas Kellerer

David Rowley schrieb am 10.04.2019 um 17:57:

In Postgres 11.2, indexes defined on partitioned tables do not show up in 
pg_indexes (the actual indexes for the partitions however do show up).



Is leaving out the indexes defined on the partitioned table intended or a bug?


Overlooked for PG11. You'll see them in PG12 per [1]. It's not really
possible to backpatch a fix for that since these views are created
during initdb.


Thanks.

I don't undertand though why it's not back patched - at least that would fix 
the bug for new installations

Thomas




Re: stale WAL files?

2019-04-10 Thread Rob Sargent




As per your configuration :
max_wal_size = 50GB
this seems to be the cause for the WAL files piling up.

this has been declared twice, the last one is taking effect.
--
El genio es 1% inspiración y 99% transpiración.
Thomas Alva Edison
http://pglearn.blogspot.mx/

I've manage to generate another 359 WAL files in a 10 minute span 
yesterday (now only 357 remain and I suspect they will wither away as 
before).  Are these being held simply because of the high max_wal_size 
value?


This is a development environment, wherein I'm loading 4M+ records, 
first into 41 staging tables 100K rows per.  In a loop over each staging 
table, the data is then placed into application tables via selects. 
First select * into "matching table" then select id into intersection 
record (id, fixed groupId).  Each such iteration is in it's own 
transaction.  I have dropped and recreate this same database numerous 
times working my way up from 100K to 4M records, dialing in application 
parameters according to number of primary records. I have not, however, 
dropped the last incarnation.




Re: How to use full-text search URL parser to filter query results by domain name?

2019-04-10 Thread Michel Pelletier
On Wed, Apr 10, 2019 at 1:58 AM Jess Wren  wrote:

>->  Parallel Seq Scan on links
> (cost=0.00..4554.40 rows=75740 width=112)
>
>->  Function Scan on ts_parse  
> (cost=0.00..12.50 rows=5 width=32)
>  Filter: (tokid = 6)
> (23 rows)
>
>
>
> I am wondering if there is a more efficient way to do things? Some people
> on IRC mentioned that it might be better to declare a scalar function to
> return the host from ts_parse instead of the LATERAL query ... but I
> couldn't figure out how to do that, or if it was even preferable to the
> above from a performance standpoint ... any ideas on how I could improve
> the above.
>

May try indexing the parsed expression to avoid the seq scan on links,
something like:

create index on links (ts_parse('default', target));

and then run the explain (or explain analyze) to see if that improves
things.  Certainly as the links table gets bigger this should help.


both force order of evaluation and hit index

2019-04-10 Thread Justin Pryzby
I wrote a script to import CSV query logs as described here:
https://www.postgresql.org/docs/current/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-CSVLOG

I also created indices for searching:
|CREATE INDEX ON postgres_log_2019_04_09_2300 ((regexp_replace(message, 
'^duration: ([.0-9]+) ms.*', '\1')::float)) WHERE message~'^duration:'

I'm having an issue that queries on long duration 1) need to specify "WHERE
message~'^duration'" to hit the index; 2) need to *evaluate* that condition
before the next condition:
|regexp_replace(message, '^duration: ([.0-9]+) ms.*', '\1')::float))>

I've tried using CASE and OFFSET 0 and WITH, it seems like any effort I make to
force evaluation of "duration>" to happen before "message~'^duration'" also
causes a seq scan rather than index scan on query duration.

I'm hesitating to use ORDER BY just to encourage index scan, since it doesn't
even guarantee that it's going to avoid the error.

This is doing index scan on other/"wrong" index:
|postgres=# explain SELECT date_trunc('minute',log_time)log_time, 
error_severity, session_id, user_name, database, left(message,99), 
left(query,99)
|FROM postgres_log_2019_04_10_1000 WHERE log_time>'2019-04-10 10:51' AND 
|CASE WHEN message~'^duration:' THEN regexp_replace(message, '^duration: 
([.0-9]+) ms.*', '\1')::float ELSE NULL END >  AND
|(application_name!='psql' OR user_name='postgres') ORDER BY 1 DESC;

| Sort  (cost=74908.17..74955.87 rows=19080 width=101)
|   Sort Key: (date_trunc('minute'::text, log_time)) DESC
|   ->  Bitmap Heap Scan on postgres_log_2019_04_10_1000  (cost=22.24..73551.61 
rows=19080 width=101)
| Recheck Cond: (log_time > '2019-04-10 10:51:00-04'::timestamp with 
time zone)
| Filter: (((application_name <> 'psql'::text) OR (user_name = 
'postgres'::text)) AND (CASE WHEN (message ~ '^duration:'::text) THEN 
(regexp_replace(message, '^duration: ([.0-9]+) ms.*'::text, 
'\1'::text))::double precision ELSE NULL::double precision END > ''::double 
precision))
| ->  Bitmap Index Scan on postgres_log_2019_04_10_1000_log_time_idx  
(cost=0.00..17.47 rows=58197 width=0)
|   Index Cond: (log_time > '2019-04-10 10:51:00-04'::timestamp 
with time zone)

This gives a nice plan, but doesn't avoid errors due to casting to float lines
which don't match the duration regex.

postgres=# explain SELECT date_trunc('minute',log_time)log_time, dur, 
error_severity, session_id, user_name, database, left(message,99), 
left(query,99)
FROM (SELECT *,regexp_replace(message, '^duration: ([.0-9]+) ms.*', 
'\1')::float AS dur FROM postgres_log_2019_04_10_1000 WHERE 
log_time>'2019-04-10 10:50' AND message~'^duration:')x WHERE
dur> AND (application_name!='psql' OR user_name='postgres') ;

| Bitmap Heap Scan on postgres_log_2019_04_10_1000  (cost=5214.86..50141.68 
rows=19742 width=109)
|   Recheck Cond: ((log_time > '2019-04-10 10:50:00-04'::timestamp with time 
zone) AND ((regexp_replace(message, '^duration: ([.0-9]+) ms.*'::text, 
'\1'::text))::double precision > ''::double precision) AND (message ~ 
'^duration:'::text))
|   Filter: ((application_name <> 'psql'::text) OR (user_name = 
'postgres'::text))
|   ->  BitmapAnd  (cost=5214.86..5214.86 rows=19819 width=0)
| ->  Bitmap Index Scan on postgres_log_2019_04_10_1000_log_time_idx  
(cost=0.00..17.50 rows=59462 width=0)
|   Index Cond: (log_time > '2019-04-10 10:50:00-04'::timestamp 
with time zone)
| ->  Bitmap Index Scan on postgres_log_2019_04_10_1000_duration_idx  
(cost=0.00..5187.23 rows=238241 width=0)
|   Index Cond: ((regexp_replace(message, '^duration: ([.0-9]+) 
ms.*'::text, '\1'::text))::double precision > ''::double precision)
|(8 rows)

|ERROR:  invalid input syntax for type double precision: "temporary file: path 
"pg_tblspc/3292386340/PG_11_201809051/pgsql_tmp/pgsql_tmp16493.0", size 
603619328"

If I write as CTE or OFFSET 0 subq, that doesn't allow scanning on "duration"
index, and scans on time index instead:

|postgres=# explain analyze WITH x AS (SELECT *,regexp_replace(message, 
'^duration: ([.0-9]+) ms.*', '\1')::float AS dur FROM 
postgres_log_2019_04_10_1000 WHERE log_time>'2019-04-10 10:50' AND 
message~'^duration:') SELECT date_trunc('minute',log_time)log_time, dur, 
error_severity, session_id, user_name, database, left(message,99), 
left(query,99)
FROM x WHERE dur> AND (application_name!='psql' OR user_name='postgres') ;

|postgres=# explain analyze SELECT date_trunc('minute',log_time)log_time, dur, 
error_severity, session_id, user_name, database, left(message,99), 
left(query,99)
FROM (SELECT *,regexp_replace(message, '^duration: ([.0-9]+) ms.*', 
'\1')::float AS dur FROM postgres_log_2019_04_10_1000 WHERE 
log_time>'2019-04-10 10:50' AND message~'^duration:' OFFSET 0)x WHERE
dur> AND (application_name!='psql' OR user_name='postgres') ;

Thanks in advance for any advice.

Justin




Re: pg_indexes doesn't show indexes for partitioned tables - bug or intended?

2019-04-10 Thread Tom Lane
Thomas Kellerer  writes:
> David Rowley schrieb am 10.04.2019 um 17:57:
>> Overlooked for PG11. You'll see them in PG12 per [1]. It's not really
>> possible to backpatch a fix for that since these views are created
>> during initdb.

> I don't undertand though why it's not back patched - at least that would fix 
> the bug for new installations

David overstated the situation --- it'd be *possible* to back-patch a
fix for that, if we thought that the bug was of sufficient importance.
But the costs of such changes are way way higher than "change a couple
of lines in system_views.sql".  We would also have to write documentation
about how to fix it manually, along the lines of the first bullet point in
[1], and then users would either have to jump through that hoop or decide
that it wasn't worth it to them (which it wouldn't be, for most).  And,
having different behaviors in different "v11" installations is not really
all that nice, especially for something that's only debatably a bug.
So I concur with the decision not to back-patch.

regards, tom lane

[1] https://www.postgresql.org/docs/9.6/release-9-6-5.html




Re: stale WAL files?

2019-04-10 Thread Rene Romero Benavides
What's your current max_wal_size parameter?
SHOW max_wal_size;
If it's 8GB as your configuration's previous value, you would get a
constant share of 512 WAL files. If it's a development environment set it
to the desired size, the smaller the value, the more frequent the
checkpoints, but your checkpoint_timeout value is 300 (5 minutes) which is
likely to be happening first, and thus being the one triggering checkpoints
that often.

On Wed, Apr 10, 2019 at 1:12 PM Rob Sargent  wrote:

>
>
> As per your configuration :
> max_wal_size = 50GB
> this seems to be the cause for the WAL files piling up.
>
> this has been declared twice, the last one is taking effect.
> --
> El genio es 1% inspiración y 99% transpiración.
> Thomas Alva Edison
> http://pglearn.blogspot.mx/
>
> I've manage to generate another 359 WAL files in a 10 minute span
> yesterday (now only 357 remain and I suspect they will wither away as
> before).  Are these being held simply because of the high max_wal_size
> value?
>
> This is a development environment, wherein I'm loading 4M+ records, first
> into 41 staging tables 100K rows per.  In a loop over each staging table,
> the data is then placed into application tables via selects. First select *
> into "matching table" then select id into intersection record (id, fixed
> groupId).  Each such iteration is in it's own transaction.  I have dropped
> and recreate this same database numerous times working my way up from 100K
> to 4M records, dialing in application parameters according to number of
> primary records.  I have not, however, dropped the last incarnation.
>


-- 
El genio es 1% inspiración y 99% transpiración.
Thomas Alva Edison
http://pglearn.blogspot.mx/


Re: stale WAL files?

2019-04-10 Thread Rob Sargent


On 4/10/19 6:03 PM, Rene Romero Benavides wrote:

What's your current max_wal_size parameter?
SHOW max_wal_size;
If it's 8GB as your configuration's previous value, you would get a 
constant share of 512 WAL files. If it's a development environment set 
it to the desired size, the smaller the value, the more frequent the 
checkpoints, but your checkpoint_timeout value is 300 (5 minutes) 
which is likely to be happening first, and thus being the one 
triggering checkpoints that often.


On Wed, Apr 10, 2019 at 1:12 PM Rob Sargent > wrote:





As per your configuration :
max_wal_size = 50GB
this seems to be the cause for the WAL files piling up.

this has been declared twice, the last one is taking effect.
-- 
El genio es 1% inspiración y 99% transpiración.

Thomas Alva Edison
http://pglearn.blogspot.mx/


I've manage to generate another 359 WAL files in a 10 minute span
yesterday (now only 357 remain and I suspect they will wither away
as before).  Are these being held simply because of the high
max_wal_size value?

This is a development environment, wherein I'm loading 4M+
records, first into 41 staging tables 100K rows per.  In a loop
over each staging table, the data is then placed into application
tables via selects. First select * into "matching table" then
select id into intersection record (id, fixed groupId).  Each such
iteration is in it's own transaction.  I have dropped and recreate
this same database numerous times working my way up from 100K to
4M records, dialing in application parameters according to number
of primary records.  I have not, however, dropped the last
incarnation.



We have not yet reconfigured the max_wal_size parameter, it is still 50GB

postgres-# show max_wal_size
postgres-# ;
 max_wal_size
--
 50GB
(1 row)

I'm sorry, I  don't follow your thesis that this setting would lead to 
"a constant" 512 WAL files.  There was a distinct burst of WALs in a 10 
minute period yesterday, and no accumulation since then (though a manual 
checkpoint does generate another WAL)






os upgrade 7.3 to 7.5 (postgres version 10.5)

2019-04-10 Thread Prakash Ramakrishnan
Hi Team,

We are planned to upgrade the os version rhel 7.3 to 7.5 and now currently
postgresql community version in 10.5 and there is some extensions working
in database need your inputs if we upgrade the os mean it will impact
anything in current settings and extensions please do the needful.Thanks in
advance.

-- 


Regards,
Prakash.R
Postgresql DBA


Re: os upgrade 7.3 to 7.5 (postgres version 10.5)

2019-04-10 Thread Thomas Munro
On Thu, Apr 11, 2019 at 2:47 PM Prakash Ramakrishnan
 wrote:
> We are planned to upgrade the os version rhel 7.3 to 7.5 and now currently 
> postgresql community version in 10.5 and there is some extensions working in 
> database need your inputs if we upgrade the os mean it will impact anything 
> in current settings and extensions please do the needful.Thanks in advance.

One thing to look into when upgrading the OS is libc changes that
might affect collations.  I don't have specific information on the
glibc versions that ship with 7.3 and 7.5 and which collations might
have changed.  Unfortunately the compatibility of collation
definitions is not directly exposed to us (something I'd personally
really like to fix, somehow) so we don't have great information about
when you need to run REINDEX on your btrees.  The word on the street
is, for example, that glibc 2.28 corrupts indexes even for English
language collations (because the sort order of ' ', '-' and some other
symbols moved around), and other versions have affected individual
particular languages (I forget which one corrupted German language
indexes a few years back).  It'll probably just work, though.

-- 
Thomas Munro
https://enterprisedb.com




Re: os upgrade 7.3 to 7.5 (postgres version 10.5)

2019-04-10 Thread Prakash Ramakrishnan
Thanks for the information. So we can upgrade the os right thomas

On Thu, Apr 11, 2019, 09:39 Thomas Munro  wrote:

> On Thu, Apr 11, 2019 at 2:47 PM Prakash Ramakrishnan
>  wrote:
> > We are planned to upgrade the os version rhel 7.3 to 7.5 and now
> currently postgresql community version in 10.5 and there is some extensions
> working in database need your inputs if we upgrade the os mean it will
> impact anything in current settings and extensions please do the
> needful.Thanks in advance.
>
> One thing to look into when upgrading the OS is libc changes that
> might affect collations.  I don't have specific information on the
> glibc versions that ship with 7.3 and 7.5 and which collations might
> have changed.  Unfortunately the compatibility of collation
> definitions is not directly exposed to us (something I'd personally
> really like to fix, somehow) so we don't have great information about
> when you need to run REINDEX on your btrees.  The word on the street
> is, for example, that glibc 2.28 corrupts indexes even for English
> language collations (because the sort order of ' ', '-' and some other
> symbols moved around), and other versions have affected individual
> particular languages (I forget which one corrupted German language
> indexes a few years back).  It'll probably just work, though.
>
> --
> Thomas Munro
> https://enterprisedb.com
>


Re: Recommendation to run vacuum FULL in parallel

2019-04-10 Thread Perumal Raj
Thanks Kevin for the inputs,

In my Case there are 500+ Tables and biggest chunk 30GB ( Table only) + its
indexes.
So i have created 6 batches and executed in parallel . All my scripts
completed in 2 Hours and my DB size came down from 500GB to 300GB.

Yes i do see CPU spike, But i did whole activity with full apps down time.

Going forward i am going to run vacuum daily basis to maintain the DB size.

Also Table/DB Age came down drastically.

Thanks
Raj

On Thu, Apr 4, 2019 at 12:53 PM Kevin Brannen  wrote:

> *From:* Perumal Raj 
>
> So conclude the requirement here , The only way to parallelism is multiple
> script. And no need to do REINDEX exclusively.
>
> Question : Do we need to consider  Table dependencies while preparing
> script in order to avoid table locks during vacuum full ?
>
>
>
> We have a small bash script (see below) that get the list of tables and
> their sizes, sorted smallest to largest, and do “vacuum full” one at a time
> because (as someone else pointed out) this is very I/O intensive. That
> order also helps to ensure we finish because some of our installs are at
> the edge of running out of space (an issue we’re dealing with). I probably
> wouldn’t have a problem doing 2 at a time, but we do this in the middle of
> the night when activity is lowest and it only takes 1-2 hours, so we’re
> good with it. It sounds like you have a lot more data though.
>
>
>
> You might also consider putting the data into different tablespaces which
> are spread over multiple disks to help I/O. If you can, use SSD drives,
> they help with speed quite a bit. 😊
>
>
>
> Don’t worry about table dependencies. This is a physical operation, not a
> data operation.
>
>
>
> HTH,
>
> Kevin
>
>
>
> $PGPATH/psql -t -c "
>
> WITH s AS (SELECT nspname || '.' || relname AS TABLE_NAME,
> pg_total_relation_size(c.oid) AS total_bytes
>
>   FROM pg_class c
>
>   LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
>
>   WHERE relkind = 'r' AND nspname NOT IN ( 'pg_catalog',
> 'information_schema' )
>
>   ORDER BY 2 )
>
> SELECT table_name FROM s
>
> " |
>
> while read t ; do echo "" ; echo $t; $PGPATH/vacuumdb -w -z -f -t $t ;
> done
>
> ###
> This e-mail transmission, and any documents, files or previous e-mail
> messages attached to it, may contain confidential information. If you are
> not the intended recipient, or a person responsible for delivering it to
> the intended recipient, you are hereby notified that any disclosure,
> distribution, review, copy or use of any of the information contained in or
> attached to this message is STRICTLY PROHIBITED. If you have received this
> transmission in error, please immediately notify us by reply e-mail, and
> destroy the original transmission and its attachments without reading them
> or saving them to disk. Thank you.
>