Re: Loading 500m json files to database

2020-03-24 Thread Peter J. Holzer
On 2020-03-23 17:18:45 -0700, pinker wrote:
> Christopher Browne-3 wrote
> > Well, you're paying for a lot of overhead in that, as you're
> > establishing a psql command, connecting to a database, spawning a
> > backend process, starting a transactions, committing a transaction,
> > closing the backend process, disconnecting from the database, and
> > cleaning up after the launching of the psql command.  And you're
> > doing that 500 million times.
> > 
> > The one thing I left off that was the loading of a single tuple into
> > json_parts.
[...]
> > Reducing that overhead is the single most important thing you can do.
> 
> Yes, I was thinking about that but no idea now how to do it right now.

Do you know any programming language (Python, Perl, ...)? You'll
probably get a huge gain from writing a script that just opens the
connection once and then inserts each file. 

Copy usually is even faster by a fair amount, but since you have to read
the data for each row from a different file (and - if I understood you
correctly, a remote one at that), the additional speedup is probably not
that great in this case.

Splitting the work int batches and executing several batches in parallel
probably helps.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature


avoid WAL for refresh of materialized view

2020-03-24 Thread Remund Alain
Hi all

We have PostgreSql 9.6 running and started to work with materialized views. To 
refresh the materialized views, we set up a cron job that refreshes the 
materialized views on a fix schedule.
Since our materialized views cache quite some data, we noticed a considerable 
increase in WAL files. It seems, that every refresh of a materialized view is 
logged in the WAL.

We tried to figure out how we can alter the materialized view to set it to 
"UNLOGGED" but this does not seem possible.
--> "alter materialized view xyz set UNLOGGED;" leads to "ERROR:  "xyz" is not 
a table, SQL state: 42809"

Is there another way to avoid logging a refresh of a materialized view in the 
WAL?

Kind regards,
Alain Remund


Re: avoid WAL for refresh of materialized view

2020-03-24 Thread Guillaume Lelarge
Le mar. 24 mars 2020 à 17:00, Remund Alain  a
écrit :

> Hi all
>
>
>
> We have PostgreSql 9.6 running and started to work with materialized
> views. To refresh the materialized views, we set up a cron job that
> refreshes the materialized views on a fix schedule.
>
> Since our materialized views cache quite some data, we noticed a
> considerable increase in WAL files. It seems, that every refresh of a
> materialized view is logged in the WAL.
>
>
>
> We tried to figure out how we can alter the materialized view to set it to
> "UNLOGGED" but this does not seem possible.
>
> --> "alter materialized view xyz set UNLOGGED;" leads to "ERROR:  "xyz" is
> not a table, SQL state: 42809"
>
>
>
> Is there another way to avoid logging a refresh of a materialized view in
> the WAL?
>
>
>

As you say, there is no unlogged materialized view. So, no, it will always
log to the WAL during refresh.

>

-- 
Guillaume.


Re: avoid WAL for refresh of materialized view

2020-03-24 Thread Jerry Sievers
Remund Alain  writes:

> Hi all
>
>  
>
> We have PostgreSql 9.6 running and started to work with materialized
> views. To refresh the materialized views, we set up a cron job that
> refreshes the materialized views on a fix schedule.
>
> Since our materialized views cache quite some data, we noticed a
> considerable increase in WAL files. It seems, that every refresh of a
> materialized view is logged in the WAL.
>
>  
>
> We tried to figure out how we can alter the materialized view to set
> it to "UNLOGGED" but this does not seem possible.
>
> --> "alter materialized view xyz set UNLOGGED;" leads to "ERROR: 
> "xyz" is not a table, SQL state: 42809"
>
>  
>
> Is there another way to avoid logging a refresh of a materialized
> view in the WAL?
>

The workaround for this is to not use mat view at all but instead
materialize the output into an unlogged table that you trunc before
every refresh.

HTH

>  
>
> Kind regards,
>
> Alain Remund
>
>
>

-- 
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consult...@comcast.net




RE: Loading 500m json files to database

2020-03-24 Thread Kevin Brannen
From: pinker 

> it's a cloud and no plpythonu extension avaiable unfortunately


You're misunderstanding him. See David's post for an example, but the point was 
that you can control all of this from an *external* Perl, Python, Bash, 
whatever program on the command line at the shell.

In pseudo-code, probably fed by a "find" command piping filenames to it:

while more files
do { read in a file name & add to list } while (list.length < 1000);
process entire list with \copy commands to 1 psql command

I've left all kinds of checks out of that, but that's the basic thing that you 
need, implement in whatever scripting language you're comfortable with.

HTH,
Kevin
This e-mail transmission, and any documents, files or previous e-mail messages 
attached to it, may contain confidential information. If you are not the 
intended recipient, or a person responsible for delivering it to the intended 
recipient, you are hereby notified that any disclosure, distribution, review, 
copy or use of any of the information contained in or attached to this message 
is STRICTLY PROHIBITED. If you have received this transmission in error, please 
immediately notify us by reply e-mail, and destroy the original transmission 
and its attachments without reading them or saving them to disk. Thank you.




Re: Loading 500m json files to database

2020-03-24 Thread Rob Sargent




On 3/24/20 11:29 AM, Kevin Brannen wrote:

From: pinker 


it's a cloud and no plpythonu extension avaiable unfortunately


You're misunderstanding him. See David's post for an example, but the point was 
that you can control all of this from an *external* Perl, Python, Bash, 
whatever program on the command line at the shell.

In pseudo-code, probably fed by a "find" command piping filenames to it:

while more files
do { read in a file name & add to list } while (list.length < 1000);
process entire list with \copy commands to 1 psql command

I've left all kinds of checks out of that, but that's the basic thing that you 
need, implement in whatever scripting language you're comfortable with.

HTH,
Kevin


Sorry if I missed it, but have we seen the size range of these json files?

This e-mail transmission, and any documents, files or previous e-mail messages 
attached to it, may contain confidential information. If you are not the 
intended recipient, or a person responsible for delivering it to the intended 
recipient, you are hereby notified that any disclosure, distribution, review, 
copy or use of any of the information contained in or attached to this message 
is STRICTLY PROHIBITED. If you have received this transmission in error, please 
immediately notify us by reply e-mail, and destroy the original transmission 
and its attachments without reading them or saving them to disk. Thank you.








RE: Loading 500m json files to database

2020-03-24 Thread Kevin Brannen
From: Rob Sargent 

> Sorry if I missed it, but have we seen the size range of these json files?

Not that I've seen, but that's an implementation detail for whoever is doing 
the work. As someone else pointed out, pick the value as needed, whether that's 
10, 100, 1000, or whatever. But issuing 1000 lines of "\copy file" sort of 
commands at a time isn't a big deal by itself. OTOH, you have a good point that 
1000 could be too much work for the server to handle, especially if the "-1" 
flag is also used. As always: test, test, test... 😊

Kevin
This e-mail transmission, and any documents, files or previous e-mail messages 
attached to it, may contain confidential information. If you are not the 
intended recipient, or a person responsible for delivering it to the intended 
recipient, you are hereby notified that any disclosure, distribution, review, 
copy or use of any of the information contained in or attached to this message 
is STRICTLY PROHIBITED. If you have received this transmission in error, please 
immediately notify us by reply e-mail, and destroy the original transmission 
and its attachments without reading them or saving them to disk. Thank you.


Re: Loading 500m json files to database

2020-03-24 Thread Rob Sargent




On 3/24/20 11:53 AM, Kevin Brannen wrote:

From: Rob Sargent 


Sorry if I missed it, but have we seen the size range of these json files?

Not that I've seen, but that's an implementation detail for whoever is doing the work. As someone 
else pointed out, pick the value as needed, whether that's 10, 100, 1000, or whatever. But issuing 
1000 lines of "\copy file" sort of commands at a time isn't a big deal by itself. OTOH, 
you have a good point that 1000 could be too much work for the server to handle, especially if the 
"-1" flag is also used. As always: test, test, test... 😊



My fear is this:  the community helps OP load 500M "records" in a 
reasonable timeframe then OP's organization complains postgres is slow 
once they start using it... because the transition from file system to 
rdbms was ill conceived.  Are the json files large documents or 
arbitrary content or are they small data structures of recognizable 
types. And so on





Re: PG12 autovac issues

2020-03-24 Thread Andres Freund
Hi,

On 2020-03-24 15:12:38 +0900, Michael Paquier wrote:
> > Well, there's no logging of autovacuum launchers that don't do anything
> > due to the "skipping redundant" logic, with normal log level. If somehow
> > the horizon logic of autovacuum workers gets out of whack with what
> > vacuumlazy.c does, then you'd not get any vacuums. But a usage level
> > triggered analyze could still happen on such a table, I think.
> 
> What surprised me the most is that the same table happened to be
> analyzed again and again after the launcher began its blackout.

Well, if there's an issue with the "redundant" logic, that would be a
not too surprising outcome. It's quite plausible that one or two tables
in the database would get enough changes to occasionally need to be
analyzed. If the workload is steady, that could e.g. work out to every
~17 minutes. All tables that autovacuum things are not wraparound
threatened will be skipped, but ones that are will get both vacuum and
analyze queued. The redundant logic could then entirely skip all
vacuums - but there's no equivalent for analyze.

> > While looking at this issue I found a few problems, btw. That seems more
> > like a -hackers discussion, so I started:
> > https://postgr.es/m/20200323235036.6pje6usrjjx22zv3%40alap3.anarazel.de
> 
> Yes, let's discuss there.

Cool. Would also be good if you could expand on the thread introducing
the "redundant" logic.

Greetings,

Andres Freund




PG 12: Partitioning across a FDW?

2020-03-24 Thread Chris Morris
Is it even possible to use PG partitioning across a Foreign Server?


PostgreSQL 13: native JavaScript Procedural Language support ?

2020-03-24 Thread Marius Andreiana
Hi and hope this finds you well,

PostgreSQL has state of the art JSON(B) data type and query support, and
basic updates as well, eg jsonb_set().

How about enabling users to do more advanced processing of JSON
data inside the database itself, using JavaScript?

There's the PLV8 extension https://github.com/plv8/plv8
but it's not available for many major db providers (eg Heroku), making it
hard to adopt it. Having it supported in the standard PostgreSQL
distribution, besides the existing 4 languages (
https://www.postgresql.org/docs/12/xplang.html), would be awesome!

Would you kindly consider this?

Thanks & stay safe,
Marius