invalid memory alloc request size 576460752438159360

2017-12-31 Thread Ibrahim Edib Kokdemir
Hi,

We are getting same error a lot for more than 1 days from different schemas
in the same db.
< user=myuser db=mydb host=mydbip pid=18883 app=[unknown] time=2017-12-31
14:28:16.056 +03 > ERROR:  invalid memory alloc request size
576460752438159360

CentOS Linux release 7.4.1708 (Core)
DB version: 9.6.6
Memory: 256GB
CPU 2.2 Ghz 24 core
Disk: SAN Multipath

* write_cache is disabled
* there is no incorrect work_mem parameter setting.
* logical dump is working, (maybe) no curruption in data.
* there is streaming replication, we do not repeat the error in the
replicas. (replicas in different minor versions, 9.6.4, 9.6.3 accordingly)
* we have large_object field, logical_dump also works with large_objects
fields.

Any idea?


Re: invalid memory alloc request size 576460752438159360

2017-12-31 Thread Peter Geoghegan
On Sun, Dec 31, 2017 at 1:50 PM, Ibrahim Edib Kokdemir
 wrote:> * write_cache is disabled
> * there is no incorrect work_mem parameter setting.
> * logical dump is working, (maybe) no curruption in data.
> * there is streaming replication, we do not repeat the error in the
> replicas. (replicas in different minor versions, 9.6.4, 9.6.3 accordingly)
> * we have large_object field, logical_dump also works with large_objects
> fields.
>
> Any idea?

This is very likely to be corruption. It's important to determine the
cause and extent of this corruption. I suggest using amcheck for this,
which is available for those Postgres versions from:

https://github.com/petergeoghegan/amcheck

Note that there are Debian and Redhat packages available.

You'll definitely want to use the "heapallindexed" option here, at
least for primary key indexes (pass "pg_index.indisprimary" as
"heapallindexed" argument, while generalizing from the example SQL
query for bt_index_check()). This process has a good chance of
isolating the problem, especially if you let this list see any errors
raised by the tool.

-- 
Peter Geoghegan



Re: Does PostgreSQL check database integrity at startup?

2017-12-31 Thread Stephen Frost
Jan, all,

* Jan Wieck (j...@wi3ck.info) wrote:
> On Sat, Dec 30, 2017 at 10:27 AM, Stephen Frost  wrote:
> > The checksums included in PG are page-level and therefore there simply
> > isn't one to look at if the file is zero bytes.
> 
> And even if the file wasn't zero bytes you can't tell from the per page
> CRCs if you have all the pages you should have. You could have extra pages
> that aren't supposed to  be there or missing some (or any mix of the two).
> A per page CRC is useless for those cases.

Right, which is why it makes sense to have whole-file checksums when
doing things like backups, when you're already reading all of the file
and can grab a checksum in stream, to detect if anything bad happens
after the backup has compeleted to the file that's been backed up.

It's unclear if that would have helped here or not since we don't know
when the file ended up being zero'd out, as I understand it.

Thanks!

Stephen


signature.asc
Description: PGP signature


Find duplicates in a column then print Information to a file

2017-12-31 Thread Sherman Willden
Development Platform: Ubuntu 17.10 mainly command line work

Tools: perl 5.26 and postgresql 9.6

Goal: Display duplicate aria titles on screen and to a local file

Database name: arias

Table name: aria_precis

csv delimiter: the # symbol

arias=# \d aria_precis
   Table "public.aria_precis"
   Column| Type | Modifiers
-+--+---
 id  | text | not null
 aria| text |
 artist  | text |
 a_artist| text |
 album_title | text |
Indexes:
"aria_precis_pkey" PRIMARY KEY, btree (id)

Problems:
  1. Can't connect aria title to id
  2. Can't write discovered information to file

I know about this link but I probably didn't understand what I read there.
https://www.postgresql.org/docs/current/static/sql-select.html#SQL-WITH

What I know displays the 46 duplicate aria titles of 413 entries but not
the id or artist:
SELECT aria FROM aria_precis WHERE aria IN (SELECT aria FROM aria_precis
GROUP BY aria  HAVING COUNT(aria)>1);

When I perform the following I get (0 rows):
SELECT aria FROM aria_precis WHERE aria IN (SELECT aria FROM aria_precis
GROUP BY id, aria  HAVING COUNT(aria)>1);
 aria
--
(0 rows)

After I get the above information how do I write that information to a
file? The only thing I know writes the entire database to a file:
 \COPY aria_precis TO '/home/sherman/aria_precis_2.csv' WITH DELIMITER '#,'
CSV HEADER;

Thank you;

Sherman


Re: Find duplicates in a column then print Information to a file

2017-12-31 Thread Martin Moore
 

From: Sherman Willden 
Date: Sunday, 31 December 2017 at 18:19
To: 
Subject: Find duplicates in a column then print Information to a file

 

Development Platform: Ubuntu 17.10 mainly command line work

Tools: perl 5.26 and postgresql 9.6

Goal: Display duplicate aria titles on screen and to a local file

Database name: arias

Table name: aria_precis

csv delimiter: the # symbol

arias=# \d aria_precis
   Table "public.aria_precis"
   Column| Type | Modifiers
-+--+---
 id  | text | not null
 aria| text |
 artist  | text |
 a_artist| text |
 album_title | text |
Indexes:
"aria_precis_pkey" PRIMARY KEY, btree (id)

Problems:
  1. Can't connect aria title to id
  2. Can't write discovered information to file

I know about this link but I probably didn't understand what I read there.
https://www.postgresql.org/docs/current/static/sql-select.html#SQL-WITH

What I know displays the 46 duplicate aria titles of 413 entries but not the id 
or artist:
SELECT aria FROM aria_precis WHERE aria IN (SELECT aria FROM aria_precis GROUP 
BY aria  HAVING COUNT(aria)>1);

When I perform the following I get (0 rows):
SELECT aria FROM aria_precis WHERE aria IN (SELECT aria FROM aria_precis GROUP 
BY id, aria  HAVING COUNT(aria)>1);
 aria
--
(0 rows)

After I get the above information how do I write that information to a file? 
The only thing I know writes the entire database to a file:
 \COPY aria_precis TO '/home/sherman/aria_precis_2.csv' WITH DELIMITER '#,' CSV 
HEADER;

Thank you;

Sherman



 

 

On the cmdline just do:

 

psql arias  -c ‘SELECT aria FROM aria_precis WHERE aria IN (SELECT aria 
FROM aria_precis GROUP BY aria  HAVING COUNT(aria)>1)‘   > outfile.txt

 

 

Martin.

 

 

 

 



Re: Find duplicates in a column then print Information to a file

2017-12-31 Thread Francisco Olarte
M, I notice a faint homework smell here ;-> , but the question is
nicely asked so:

On Sun, Dec 31, 2017 at 7:19 PM, Sherman Willden
 wrote:
...
> SELECT aria FROM aria_precis WHERE aria IN (SELECT aria FROM aria_precis
> GROUP BY aria  HAVING COUNT(aria)>1);

The outer select is fully redundant. Just use the inner one. In fact,
not redundnat, harmful, as it will spit many copies of aria for the
duplicated ones. If you add id (and/or artist) to the outer query then
it is useful again.


> When I perform the following I get (0 rows):
> SELECT aria FROM aria_precis WHERE aria IN (SELECT aria FROM aria_precis
> GROUP BY id, aria  HAVING COUNT(aria)>1);

As expected. See the inner query, you are grouping by ID which is the
primary key, by PK definition all counts are going to be one, so no
results, nothing goes to the outer query.

Use the first query, but adding id and artist to the OUTER level.

> After I get the above information how do I write that information to a file?
> The only thing I know writes the entire database to a file:
>  \COPY aria_precis TO '/home/sherman/aria_precis_2.csv' WITH DELIMITER '#,'
> CSV HEADER;

If copy format is ok to you, knowing \copy look at the docs, looking
at https://www.postgresql.org/docs/9.6/static/app-psql.html you will
read:

>>>
\copy { table [ ( column_list ) ] | ( query ) } { from | to } {
'filename' | program 'command' | stdin | stdout | pstdin | pstdout } [
[ with ] ( option [, ...] ) ]

Performs a frontend (client) copy. This is an operation that runs an
SQL COPY command, but instead of the server reading or writing the
specified file, psql reads or writes the file and routes the data
between the server and the local file system. This means that file
accessibility and privileges are those of the local user, not the
server, and no SQL superuser privileges are required.
<<<

Notice the (query) option? Your copy is using the table+optional
column list format, (aria_precis), just change it to the "( query )"
format ( NOTICE THE MANDATORY PARENS, this is how copy knows what to
do. "t" => table, "t(c1)"=>table+columns, "(xx)" => query ( nothing
before the opening parens ). )

Happy new year.

Francisco Olarte.



Re: Find duplicates in a column then print Information to a file

2017-12-31 Thread Sherman Willden
Thank you for the replies. I will start working on them now. Not a student
but since I now have the time I may look into it. I am 71 retired working
at Home Depot. I have a collection of CDs by various artists and I have the
time to create and maintain my own database concerning these subjects. I
retired from USAF, SUN Microsystems, and HP where I worked with the UNIX
systems. As I said my main question right now is which duplicate songs do I
have and who performs them. I really appreciate the support I get here.

Sherman

On Sun, Dec 31, 2017 at 11:53 AM, Francisco Olarte 
wrote:

> M, I notice a faint homework smell here ;-> , but the question is
> nicely asked so:
>
> On Sun, Dec 31, 2017 at 7:19 PM, Sherman Willden
>  wrote:
> ...
> > SELECT aria FROM aria_precis WHERE aria IN (SELECT aria FROM aria_precis
> > GROUP BY aria  HAVING COUNT(aria)>1);
>
> The outer select is fully redundant. Just use the inner one. In fact,
> not redundnat, harmful, as it will spit many copies of aria for the
> duplicated ones. If you add id (and/or artist) to the outer query then
> it is useful again.
>
>
> > When I perform the following I get (0 rows):
> > SELECT aria FROM aria_precis WHERE aria IN (SELECT aria FROM aria_precis
> > GROUP BY id, aria  HAVING COUNT(aria)>1);
>
> As expected. See the inner query, you are grouping by ID which is the
> primary key, by PK definition all counts are going to be one, so no
> results, nothing goes to the outer query.
>
> Use the first query, but adding id and artist to the OUTER level.
>
> > After I get the above information how do I write that information to a
> file?
> > The only thing I know writes the entire database to a file:
> >  \COPY aria_precis TO '/home/sherman/aria_precis_2.csv' WITH DELIMITER
> '#,'
> > CSV HEADER;
>
> If copy format is ok to you, knowing \copy look at the docs, looking
> at https://www.postgresql.org/docs/9.6/static/app-psql.html you will
> read:
>
> >>>
> \copy { table [ ( column_list ) ] | ( query ) } { from | to } {
> 'filename' | program 'command' | stdin | stdout | pstdin | pstdout } [
> [ with ] ( option [, ...] ) ]
>
> Performs a frontend (client) copy. This is an operation that runs an
> SQL COPY command, but instead of the server reading or writing the
> specified file, psql reads or writes the file and routes the data
> between the server and the local file system. This means that file
> accessibility and privileges are those of the local user, not the
> server, and no SQL superuser privileges are required.
> <<<
>
> Notice the (query) option? Your copy is using the table+optional
> column list format, (aria_precis), just change it to the "( query )"
> format ( NOTICE THE MANDATORY PARENS, this is how copy knows what to
> do. "t" => table, "t(c1)"=>table+columns, "(xx)" => query ( nothing
> before the opening parens ). )
>
> Happy new year.
>
> Francisco Olarte.
>


Re: Find duplicates in a column then print Information to a file

2017-12-31 Thread Rob Sargent
Keep in mind there is a quick write-to-file in psql with ‘\o ’. And 
don’t forget to turn it off with ‘\o’

> On Dec 31, 2017, at 12:04 PM, Sherman Willden  wrote:
> 
> Thank you for the replies. I will start working on them now. Not a student 
> but since I now have the time I may look into it. I am 71 retired working at 
> Home Depot. I have a collection of CDs by various artists and I have the time 
> to create and maintain my own database concerning these subjects. I retired 
> from USAF, SUN Microsystems, and HP where I worked with the UNIX systems. As 
> I said my main question right now is which duplicate songs do I have and who 
> performs them. I really appreciate the support I get here.
> 
> Sherman
> 
>> On Sun, Dec 31, 2017 at 11:53 AM, Francisco Olarte  
>> wrote:
>> M, I notice a faint homework smell here ;-> , but the question is
>> nicely asked so:
>> 
>> On Sun, Dec 31, 2017 at 7:19 PM, Sherman Willden
>>  wrote:
>> ...
>> > SELECT aria FROM aria_precis WHERE aria IN (SELECT aria FROM aria_precis
>> > GROUP BY aria  HAVING COUNT(aria)>1);
>> 
>> The outer select is fully redundant. Just use the inner one. In fact,
>> not redundnat, harmful, as it will spit many copies of aria for the
>> duplicated ones. If you add id (and/or artist) to the outer query then
>> it is useful again.
>> 
>> 
>> > When I perform the following I get (0 rows):
>> > SELECT aria FROM aria_precis WHERE aria IN (SELECT aria FROM aria_precis
>> > GROUP BY id, aria  HAVING COUNT(aria)>1);
>> 
>> As expected. See the inner query, you are grouping by ID which is the
>> primary key, by PK definition all counts are going to be one, so no
>> results, nothing goes to the outer query.
>> 
>> Use the first query, but adding id and artist to the OUTER level.
>> 
>> > After I get the above information how do I write that information to a 
>> > file?
>> > The only thing I know writes the entire database to a file:
>> >  \COPY aria_precis TO '/home/sherman/aria_precis_2.csv' WITH DELIMITER '#,'
>> > CSV HEADER;
>> 
>> If copy format is ok to you, knowing \copy look at the docs, looking
>> at https://www.postgresql.org/docs/9.6/static/app-psql.html you will
>> read:
>> 
>> >>>
>> \copy { table [ ( column_list ) ] | ( query ) } { from | to } {
>> 'filename' | program 'command' | stdin | stdout | pstdin | pstdout } [
>> [ with ] ( option [, ...] ) ]
>> 
>> Performs a frontend (client) copy. This is an operation that runs an
>> SQL COPY command, but instead of the server reading or writing the
>> specified file, psql reads or writes the file and routes the data
>> between the server and the local file system. This means that file
>> accessibility and privileges are those of the local user, not the
>> server, and no SQL superuser privileges are required.
>> <<<
>> 
>> Notice the (query) option? Your copy is using the table+optional
>> column list format, (aria_precis), just change it to the "( query )"
>> format ( NOTICE THE MANDATORY PARENS, this is how copy knows what to
>> do. "t" => table, "t(c1)"=>table+columns, "(xx)" => query ( nothing
>> before the opening parens ). )
>> 
>> Happy new year.
>> 
>> Francisco Olarte.
> 


Re: Find duplicates in a column then print Information to a file

2017-12-31 Thread Tom Lane
Rob Sargent  writes:
> Keep in mind there is a quick write-to-file in psql with ‘\o ’. And 
> don’t forget to turn it off with ‘\o’

See also "\g filename", for one-shot output.

regards, tom lane



Re: invalid memory alloc request size 576460752438159360

2017-12-31 Thread Ibrahim Edib Kokdemir
Hi Peter,
I just installed and used amcheck_next, I have used your sample query on
the git page (changed the schema name) and that listed all indexes
different schemes and produced same outputs like yours with bt_index_check
field as empty, that means no error.
Am I doing right?


2017-12-31 16:58 GMT+03:00 Peter Geoghegan :

> On Sun, Dec 31, 2017 at 1:50 PM, Ibrahim Edib Kokdemir
>  wrote:> * write_cache is disabled
> > * there is no incorrect work_mem parameter setting.
> > * logical dump is working, (maybe) no curruption in data.
> > * there is streaming replication, we do not repeat the error in the
> > replicas. (replicas in different minor versions, 9.6.4, 9.6.3
> accordingly)
> > * we have large_object field, logical_dump also works with large_objects
> > fields.
> >
> > Any idea?
>
> This is very likely to be corruption. It's important to determine the
> cause and extent of this corruption. I suggest using amcheck for this,
> which is available for those Postgres versions from:
>
> https://github.com/petergeoghegan/amcheck
>
> Note that there are Debian and Redhat packages available.
>
> You'll definitely want to use the "heapallindexed" option here, at
> least for primary key indexes (pass "pg_index.indisprimary" as
> "heapallindexed" argument, while generalizing from the example SQL
> query for bt_index_check()). This process has a good chance of
> isolating the problem, especially if you let this list see any errors
> raised by the tool.
>
> --
> Peter Geoghegan
>


Re: invalid memory alloc request size 576460752438159360

2017-12-31 Thread Peter Geoghegan
On Sun, Dec 31, 2017 at 1:10 PM, Ibrahim Edib Kokdemir
 wrote:
> I just installed and used amcheck_next, I have used your sample query on the
> git page (changed the schema name) and that listed all indexes different
> schemes and produced same outputs like yours with bt_index_check field as
> empty, that means no error.
> Am I doing right?

You should give the argument of heapallindexed as 'true'. So:

  SELECT bt_index_check(index => c.oid, heapallindexed => true),
  c.relname,
  c.relpages
  FROM pg_index i
  JOIN pg_opclass op ON i.indclass[0] = op.oid
  JOIN pg_am am ON op.opcmethod = am.oid
  JOIN pg_class c ON i.indexrelid = c.oid
  JOIN pg_namespace n ON c.relnamespace = n.oid
  WHERE am.amname = 'btree' AND n.nspname = 'pg_catalog'
  -- Don't check temp tables, which may be from another session:
  AND c.relpersistence != 't'
  -- Function may throw an error when this is omitted:
  AND i.indisready AND i.indisvalid
  ORDER BY c.relpages DESC LIMIT 10;

As I mentioned earlier, if this takes too long, you could only do
heapallindexed checking once per table (not once per index) by giving
"indisprimary" as the heapallindexed argument. That way, only primary
keys would be verified against the heap, which is potentially a lot
faster.

-- 
Peter Geoghegan



Re: invalid memory alloc request size 576460752438159360

2017-12-31 Thread Peter Geoghegan
On Sun, Dec 31, 2017 at 1:39 PM, Peter Geoghegan  wrote:
>   SELECT bt_index_check(index => c.oid, heapallindexed => true),
>   c.relname,
>   c.relpages
>   FROM pg_index i
>   JOIN pg_opclass op ON i.indclass[0] = op.oid
>   JOIN pg_am am ON op.opcmethod = am.oid
>   JOIN pg_class c ON i.indexrelid = c.oid
>   JOIN pg_namespace n ON c.relnamespace = n.oid
>   WHERE am.amname = 'btree' AND n.nspname = 'pg_catalog'
>   -- Don't check temp tables, which may be from another session:
>   AND c.relpersistence != 't'
>   -- Function may throw an error when this is omitted:
>   AND i.indisready AND i.indisvalid
>   ORDER BY c.relpages DESC LIMIT 10;
>
> As I mentioned earlier, if this takes too long, you could only do
> heapallindexed checking once per table (not once per index) by giving
> "indisprimary" as the heapallindexed argument. That way, only primary
> keys would be verified against the heap, which is potentially a lot
> faster.

Oh, and I think that you should remove the "ORDER BY c.relpages DESC
LIMIT 10", too.


-- 
Peter Geoghegan



Re: invalid memory alloc request size 576460752438159360

2017-12-31 Thread Ibrahim Edib Kokdemir
>
> > As I mentioned earlier, if this takes too long, you could only do
> > heapallindexed checking once per table (not once per index) by giving
> > "indisprimary" as the heapallindexed argument. That way, only primary
> > keys would be verified against the heap, which is potentially a lot
> > faster.
>
> Oh, and I think that you should remove the "ORDER BY c.relpages DESC
> LIMIT 10", too.


My db is not big enough right now. Queries did not take too long. I
definitely did all the things you said. And I got only long lists without
errors. That is actually a good thing for me.
Thanks a lot.


is single row update improved with function

2017-12-31 Thread Rob Sargent
I must update 3M of 100M records, with tuple specific modifications.  I can 
generate the necessary sql, but I’m wondering if files of simple update 
statements affecting a single row is more effective than files of a function 
call doing the same update given the necessary values, including where clause 
restrictions?  The query plan set by the first should be decent for the 
remainder.

Alternatively, would a bulk load into a table of replacement values and join 
info be the fastest way?  

Either way I can break the updates into roughly 393 transactions (7500 rows 
affected per tx) or 8646 transactions (350 rows  per tx) if less is more in 
this world.

I’ll be the only user during this work.

OS=centos 7, 4core virtual 64G memory; pg=10.0;


Re: is single row update improved with function

2017-12-31 Thread David Rowley
On 1 January 2018 at 12:06, Rob Sargent  wrote:
> I must update 3M of 100M records, with tuple specific modifications.  I can 
> generate the necessary sql, but I’m wondering if files of simple update 
> statements affecting a single row is more effective than files of a function 
> call doing the same update given the necessary values, including where clause 
> restrictions?  The query plan set by the first should be decent for the 
> remainder.
>
> Alternatively, would a bulk load into a table of replacement values and join 
> info be the fastest way?

It's probably also worth thinking about this table's usage pattern. If
this table is an otherwise static table, then you may wish to think
about the little bit of bloat that doing the UPDATEs in a single
transaction would cause.

> Either way I can break the updates into roughly 393 transactions (7500 rows 
> affected per tx) or 8646 transactions (350 rows  per tx) if less is more in 
> this world.

If you were to perform the UPDATEs in batches it would allow you to
run a VACUUM between the UPDATEs. However, it might not be so
important as 3 million rows in 100 million is just 3%, so assuming all
your rows are the same size, then even doing this as a single
transaction would only cause 3% churn on the table. Possibly some of
the UPDATEs would reuse existing free space within the table, but if
they don't then it would only mean an extra 3% bloat.

As for which is faster. It's most likely going to depend on the query
plan for the UPDATE statements. If you need to perform 3 million seq
scans on the table, by doing 3 million individual statements, that's
likely not going to perform well. 3 million statements is likely not a
good option in any case as it means parsing and planning 3 million
UPDATE statements. Even your 393 statements might not be very good if
each of those UPDATEs must perform a seq scans on the 100 million row
table, but if each of those 393 statements can make use of an index to
easily get those 7500 rows, then that might be a better option than
doing the single UPDATE join method you mentioned.

It does sound like something you could take offline and benchmark if
performance is that critical. It's not really possible for us to tell
which is faster without seeing the schema, UPDATE statements and query
plans chosen.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: is single row update improved with function

2017-12-31 Thread Rob Sargent

> On Dec 31, 2017, at 4:31 PM, David Rowley  
> wrote:
> 
> On 1 January 2018 at 12:06, Rob Sargent  wrote:
>> I must update 3M of 100M records, with tuple specific modifications.  I can 
>> generate the necessary sql, but I’m wondering if files of simple update 
>> statements affecting a single row is more effective than files of a function 
>> call doing the same update given the necessary values, including where 
>> clause restrictions?  The query plan set by the first should be decent for 
>> the remainder.
>> 
>> Alternatively, would a bulk load into a table of replacement values and join 
>> info be the fastest way?
> 
> It's probably also worth thinking about this table's usage pattern. If
> this table is an otherwise static table, then you may wish to think
> about the little bit of bloat that doing the UPDATEs in a single
> transaction would cause.
> 
>> Either way I can break the updates into roughly 393 transactions (7500 rows 
>> affected per tx) or 8646 transactions (350 rows  per tx) if less is more in 
>> this world.
> 
> If you were to perform the UPDATEs in batches it would allow you to
> run a VACUUM between the UPDATEs. However, it might not be so
> important as 3 million rows in 100 million is just 3%, so assuming all
> your rows are the same size, then even doing this as a single
> transaction would only cause 3% churn on the table. Possibly some of
> the UPDATEs would reuse existing free space within the table, but if
> they don't then it would only mean an extra 3% bloat.
> 
> As for which is faster. It's most likely going to depend on the query
> plan for the UPDATE statements. If you need to perform 3 million seq
> scans on the table, by doing 3 million individual statements, that's
> likely not going to perform well. 3 million statements is likely not a
> good option in any case as it means parsing and planning 3 million
> UPDATE statements. Even your 393 statements might not be very good if
> each of those UPDATEs must perform a seq scans on the 100 million row
> table, but if each of those 393 statements can make use of an index to
> easily get those 7500 rows, then that might be a better option than
> doing the single UPDATE join method you mentioned.
> 
> It does sound like something you could take offline and benchmark if
> performance is that critical. It's not really possible for us to tell
> which is faster without seeing the schema, UPDATE statements and query
> plans chosen.
> 
Thank you, and understood.  Small clarification: the batches (393 or 8646) are 
not single update statements, rather groupings of separate updates sharing 
common restriction values. I don’t have the hardware to replicate the problem. 

The table being updated is described below.  Due to an embarrassing cut/paste 
error, I’ll be updating the three ‘events’ columns based on probandset_id, 
chrom, startbase, endbase (matching the unique constraint).  The first two 
fields (probandset_id and probandset_id+chrom) are the possible partitions - 
strictly from a file generation point of view.  The updates are unique and 
atomic and could be clumped into transactions arbitrarily. The function would 
have to take seven args, the key plus the replacement values.

Ultimate sql: update segment set events_less = i, events_equal = j, 
events_greater = k
where probanset_id = id and chromosomes = ch and startbase = sb 
and endbase = eb;
/* I might check that an update is needed be _less != i and _equal != j */

Table "segment"
 Column |  Type   | Collation | Nullable | Default 
+-+---+--+-
 id | uuid|   | not null | 
 chrom  | integer |   | not null | 
 markerset_id   | uuid|   | not null | 
 probandset_id  | uuid|   | not null | 
 startbase  | integer |   | not null | 
 endbase| integer |   | not null | 
 firstmarker| integer |   | not null | 
 lastmarker | integer |   | not null | 
 events_less| bigint  |   | not null | 0
 events_equal   | bigint  |   | not null | 0
 events_greater | bigint  |   | not null | 0
Indexes:
"segment_pkey" PRIMARY KEY, btree (id)
"useg" UNIQUE CONSTRAINT, btree (probandset_id, chrom, startbase, endbase)
"segment_markerset_id_chrom_firstmarker_idx" btree (markerset_id, chrom, 
firstmarker)
Foreign-key constraints:
"segment_probandset_id_fkey" FOREIGN KEY (probandset_id) REFERENCES 
probandset(id)
Referenced by:
TABLE "segmentset_member" CONSTRAINT "segmentset_member_segment_id_fkey" 
FOREIGN KEY (segment_id) REFERENCES segment(id)


> -- 
> David Rowley   http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services




Re: is single row update improved with function

2017-12-31 Thread Rob Sargent

> On Dec 31, 2017, at 4:31 PM, David Rowley  
> wrote:
> 
> On 1 January 2018 at 12:06, Rob Sargent  wrote:
>> I must update 3M of 100M records, with tuple specific modifications.  I can 
>> generate the necessary sql, but I’m wondering if files of simple update 
>> statements affecting a single row is more effective than files of a function 
>> call doing the same update given the necessary values, including where 
>> clause restrictions?  The query plan set by the first should be decent for 
>> the remainder.
>> 
>> Alternatively, would a bulk load into a table of replacement values and join 
>> info be the fastest way?
> 
> It's probably also worth thinking about this table's usage pattern. If
> this table is an otherwise static table, then you may wish to think
> about the little bit of bloat that doing the UPDATEs in a single
> transaction would cause.
> 
Sorry, I didn’t address the question about the table's usage.  Currently we’re 
in a data loading phase and this table is almost completed.  Thereafter (post 
vacuum analyze) it will be 99.99% read-only.  The remainder will be updated in 
much the same fashion as described early (set events_x = events_x + increment).

This table will be analyzed a couple ways, mainly determining significance 
threshold across various slices, each of which examines roughly one tenth of 
the records.

>> Either way I can break the updates into roughly 393 transactions (7500 rows 
>> affected per tx) or 8646 transactions (350 rows  per tx) if less is more in 
>> this world.
> 
> If you were to perform the UPDATEs in batches it would allow you to
> run a VACUUM between the UPDATEs. However, it might not be so
> important as 3 million rows in 100 million is just 3%, so assuming all
> your rows are the same size, then even doing this as a single
> transaction would only cause 3% churn on the table. Possibly some of
> the UPDATEs would reuse existing free space within the table, but if
> they don't then it would only mean an extra 3% bloat.
> 
> As for which is faster. It's most likely going to depend on the query
> plan for the UPDATE statements. If you need to perform 3 million seq
> scans on the table, by doing 3 million individual statements, that's
> likely not going to perform well. 3 million statements is likely not a
> good option in any case as it means parsing and planning 3 million
> UPDATE statements. Even your 393 statements might not be very good if
> each of those UPDATEs must perform a seq scans on the 100 million row
> table, but if each of those 393 statements can make use of an index to
> easily get those 7500 rows, then that might be a better option than
> doing the single UPDATE join method you mentioned.
> 
> It does sound like something you could take offline and benchmark if
> performance is that critical. It's not really possible for us to tell
> which is faster without seeing the schema, UPDATE statements and query
> plans chosen.
> 
> -- 
> David Rowley   http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services