invalid memory alloc request size 576460752438159360
Hi, We are getting same error a lot for more than 1 days from different schemas in the same db. < user=myuser db=mydb host=mydbip pid=18883 app=[unknown] time=2017-12-31 14:28:16.056 +03 > ERROR: invalid memory alloc request size 576460752438159360 CentOS Linux release 7.4.1708 (Core) DB version: 9.6.6 Memory: 256GB CPU 2.2 Ghz 24 core Disk: SAN Multipath * write_cache is disabled * there is no incorrect work_mem parameter setting. * logical dump is working, (maybe) no curruption in data. * there is streaming replication, we do not repeat the error in the replicas. (replicas in different minor versions, 9.6.4, 9.6.3 accordingly) * we have large_object field, logical_dump also works with large_objects fields. Any idea?
Re: invalid memory alloc request size 576460752438159360
On Sun, Dec 31, 2017 at 1:50 PM, Ibrahim Edib Kokdemir wrote:> * write_cache is disabled > * there is no incorrect work_mem parameter setting. > * logical dump is working, (maybe) no curruption in data. > * there is streaming replication, we do not repeat the error in the > replicas. (replicas in different minor versions, 9.6.4, 9.6.3 accordingly) > * we have large_object field, logical_dump also works with large_objects > fields. > > Any idea? This is very likely to be corruption. It's important to determine the cause and extent of this corruption. I suggest using amcheck for this, which is available for those Postgres versions from: https://github.com/petergeoghegan/amcheck Note that there are Debian and Redhat packages available. You'll definitely want to use the "heapallindexed" option here, at least for primary key indexes (pass "pg_index.indisprimary" as "heapallindexed" argument, while generalizing from the example SQL query for bt_index_check()). This process has a good chance of isolating the problem, especially if you let this list see any errors raised by the tool. -- Peter Geoghegan
Re: Does PostgreSQL check database integrity at startup?
Jan, all, * Jan Wieck (j...@wi3ck.info) wrote: > On Sat, Dec 30, 2017 at 10:27 AM, Stephen Frost wrote: > > The checksums included in PG are page-level and therefore there simply > > isn't one to look at if the file is zero bytes. > > And even if the file wasn't zero bytes you can't tell from the per page > CRCs if you have all the pages you should have. You could have extra pages > that aren't supposed to be there or missing some (or any mix of the two). > A per page CRC is useless for those cases. Right, which is why it makes sense to have whole-file checksums when doing things like backups, when you're already reading all of the file and can grab a checksum in stream, to detect if anything bad happens after the backup has compeleted to the file that's been backed up. It's unclear if that would have helped here or not since we don't know when the file ended up being zero'd out, as I understand it. Thanks! Stephen signature.asc Description: PGP signature
Find duplicates in a column then print Information to a file
Development Platform: Ubuntu 17.10 mainly command line work Tools: perl 5.26 and postgresql 9.6 Goal: Display duplicate aria titles on screen and to a local file Database name: arias Table name: aria_precis csv delimiter: the # symbol arias=# \d aria_precis Table "public.aria_precis" Column| Type | Modifiers -+--+--- id | text | not null aria| text | artist | text | a_artist| text | album_title | text | Indexes: "aria_precis_pkey" PRIMARY KEY, btree (id) Problems: 1. Can't connect aria title to id 2. Can't write discovered information to file I know about this link but I probably didn't understand what I read there. https://www.postgresql.org/docs/current/static/sql-select.html#SQL-WITH What I know displays the 46 duplicate aria titles of 413 entries but not the id or artist: SELECT aria FROM aria_precis WHERE aria IN (SELECT aria FROM aria_precis GROUP BY aria HAVING COUNT(aria)>1); When I perform the following I get (0 rows): SELECT aria FROM aria_precis WHERE aria IN (SELECT aria FROM aria_precis GROUP BY id, aria HAVING COUNT(aria)>1); aria -- (0 rows) After I get the above information how do I write that information to a file? The only thing I know writes the entire database to a file: \COPY aria_precis TO '/home/sherman/aria_precis_2.csv' WITH DELIMITER '#,' CSV HEADER; Thank you; Sherman
Re: Find duplicates in a column then print Information to a file
From: Sherman Willden Date: Sunday, 31 December 2017 at 18:19 To: Subject: Find duplicates in a column then print Information to a file Development Platform: Ubuntu 17.10 mainly command line work Tools: perl 5.26 and postgresql 9.6 Goal: Display duplicate aria titles on screen and to a local file Database name: arias Table name: aria_precis csv delimiter: the # symbol arias=# \d aria_precis Table "public.aria_precis" Column| Type | Modifiers -+--+--- id | text | not null aria| text | artist | text | a_artist| text | album_title | text | Indexes: "aria_precis_pkey" PRIMARY KEY, btree (id) Problems: 1. Can't connect aria title to id 2. Can't write discovered information to file I know about this link but I probably didn't understand what I read there. https://www.postgresql.org/docs/current/static/sql-select.html#SQL-WITH What I know displays the 46 duplicate aria titles of 413 entries but not the id or artist: SELECT aria FROM aria_precis WHERE aria IN (SELECT aria FROM aria_precis GROUP BY aria HAVING COUNT(aria)>1); When I perform the following I get (0 rows): SELECT aria FROM aria_precis WHERE aria IN (SELECT aria FROM aria_precis GROUP BY id, aria HAVING COUNT(aria)>1); aria -- (0 rows) After I get the above information how do I write that information to a file? The only thing I know writes the entire database to a file: \COPY aria_precis TO '/home/sherman/aria_precis_2.csv' WITH DELIMITER '#,' CSV HEADER; Thank you; Sherman On the cmdline just do: psql arias -c ‘SELECT aria FROM aria_precis WHERE aria IN (SELECT aria FROM aria_precis GROUP BY aria HAVING COUNT(aria)>1)‘ > outfile.txt Martin.
Re: Find duplicates in a column then print Information to a file
M, I notice a faint homework smell here ;-> , but the question is nicely asked so: On Sun, Dec 31, 2017 at 7:19 PM, Sherman Willden wrote: ... > SELECT aria FROM aria_precis WHERE aria IN (SELECT aria FROM aria_precis > GROUP BY aria HAVING COUNT(aria)>1); The outer select is fully redundant. Just use the inner one. In fact, not redundnat, harmful, as it will spit many copies of aria for the duplicated ones. If you add id (and/or artist) to the outer query then it is useful again. > When I perform the following I get (0 rows): > SELECT aria FROM aria_precis WHERE aria IN (SELECT aria FROM aria_precis > GROUP BY id, aria HAVING COUNT(aria)>1); As expected. See the inner query, you are grouping by ID which is the primary key, by PK definition all counts are going to be one, so no results, nothing goes to the outer query. Use the first query, but adding id and artist to the OUTER level. > After I get the above information how do I write that information to a file? > The only thing I know writes the entire database to a file: > \COPY aria_precis TO '/home/sherman/aria_precis_2.csv' WITH DELIMITER '#,' > CSV HEADER; If copy format is ok to you, knowing \copy look at the docs, looking at https://www.postgresql.org/docs/9.6/static/app-psql.html you will read: >>> \copy { table [ ( column_list ) ] | ( query ) } { from | to } { 'filename' | program 'command' | stdin | stdout | pstdin | pstdout } [ [ with ] ( option [, ...] ) ] Performs a frontend (client) copy. This is an operation that runs an SQL COPY command, but instead of the server reading or writing the specified file, psql reads or writes the file and routes the data between the server and the local file system. This means that file accessibility and privileges are those of the local user, not the server, and no SQL superuser privileges are required. <<< Notice the (query) option? Your copy is using the table+optional column list format, (aria_precis), just change it to the "( query )" format ( NOTICE THE MANDATORY PARENS, this is how copy knows what to do. "t" => table, "t(c1)"=>table+columns, "(xx)" => query ( nothing before the opening parens ). ) Happy new year. Francisco Olarte.
Re: Find duplicates in a column then print Information to a file
Thank you for the replies. I will start working on them now. Not a student but since I now have the time I may look into it. I am 71 retired working at Home Depot. I have a collection of CDs by various artists and I have the time to create and maintain my own database concerning these subjects. I retired from USAF, SUN Microsystems, and HP where I worked with the UNIX systems. As I said my main question right now is which duplicate songs do I have and who performs them. I really appreciate the support I get here. Sherman On Sun, Dec 31, 2017 at 11:53 AM, Francisco Olarte wrote: > M, I notice a faint homework smell here ;-> , but the question is > nicely asked so: > > On Sun, Dec 31, 2017 at 7:19 PM, Sherman Willden > wrote: > ... > > SELECT aria FROM aria_precis WHERE aria IN (SELECT aria FROM aria_precis > > GROUP BY aria HAVING COUNT(aria)>1); > > The outer select is fully redundant. Just use the inner one. In fact, > not redundnat, harmful, as it will spit many copies of aria for the > duplicated ones. If you add id (and/or artist) to the outer query then > it is useful again. > > > > When I perform the following I get (0 rows): > > SELECT aria FROM aria_precis WHERE aria IN (SELECT aria FROM aria_precis > > GROUP BY id, aria HAVING COUNT(aria)>1); > > As expected. See the inner query, you are grouping by ID which is the > primary key, by PK definition all counts are going to be one, so no > results, nothing goes to the outer query. > > Use the first query, but adding id and artist to the OUTER level. > > > After I get the above information how do I write that information to a > file? > > The only thing I know writes the entire database to a file: > > \COPY aria_precis TO '/home/sherman/aria_precis_2.csv' WITH DELIMITER > '#,' > > CSV HEADER; > > If copy format is ok to you, knowing \copy look at the docs, looking > at https://www.postgresql.org/docs/9.6/static/app-psql.html you will > read: > > >>> > \copy { table [ ( column_list ) ] | ( query ) } { from | to } { > 'filename' | program 'command' | stdin | stdout | pstdin | pstdout } [ > [ with ] ( option [, ...] ) ] > > Performs a frontend (client) copy. This is an operation that runs an > SQL COPY command, but instead of the server reading or writing the > specified file, psql reads or writes the file and routes the data > between the server and the local file system. This means that file > accessibility and privileges are those of the local user, not the > server, and no SQL superuser privileges are required. > <<< > > Notice the (query) option? Your copy is using the table+optional > column list format, (aria_precis), just change it to the "( query )" > format ( NOTICE THE MANDATORY PARENS, this is how copy knows what to > do. "t" => table, "t(c1)"=>table+columns, "(xx)" => query ( nothing > before the opening parens ). ) > > Happy new year. > > Francisco Olarte. >
Re: Find duplicates in a column then print Information to a file
Keep in mind there is a quick write-to-file in psql with ‘\o ’. And don’t forget to turn it off with ‘\o’ > On Dec 31, 2017, at 12:04 PM, Sherman Willden wrote: > > Thank you for the replies. I will start working on them now. Not a student > but since I now have the time I may look into it. I am 71 retired working at > Home Depot. I have a collection of CDs by various artists and I have the time > to create and maintain my own database concerning these subjects. I retired > from USAF, SUN Microsystems, and HP where I worked with the UNIX systems. As > I said my main question right now is which duplicate songs do I have and who > performs them. I really appreciate the support I get here. > > Sherman > >> On Sun, Dec 31, 2017 at 11:53 AM, Francisco Olarte >> wrote: >> M, I notice a faint homework smell here ;-> , but the question is >> nicely asked so: >> >> On Sun, Dec 31, 2017 at 7:19 PM, Sherman Willden >> wrote: >> ... >> > SELECT aria FROM aria_precis WHERE aria IN (SELECT aria FROM aria_precis >> > GROUP BY aria HAVING COUNT(aria)>1); >> >> The outer select is fully redundant. Just use the inner one. In fact, >> not redundnat, harmful, as it will spit many copies of aria for the >> duplicated ones. If you add id (and/or artist) to the outer query then >> it is useful again. >> >> >> > When I perform the following I get (0 rows): >> > SELECT aria FROM aria_precis WHERE aria IN (SELECT aria FROM aria_precis >> > GROUP BY id, aria HAVING COUNT(aria)>1); >> >> As expected. See the inner query, you are grouping by ID which is the >> primary key, by PK definition all counts are going to be one, so no >> results, nothing goes to the outer query. >> >> Use the first query, but adding id and artist to the OUTER level. >> >> > After I get the above information how do I write that information to a >> > file? >> > The only thing I know writes the entire database to a file: >> > \COPY aria_precis TO '/home/sherman/aria_precis_2.csv' WITH DELIMITER '#,' >> > CSV HEADER; >> >> If copy format is ok to you, knowing \copy look at the docs, looking >> at https://www.postgresql.org/docs/9.6/static/app-psql.html you will >> read: >> >> >>> >> \copy { table [ ( column_list ) ] | ( query ) } { from | to } { >> 'filename' | program 'command' | stdin | stdout | pstdin | pstdout } [ >> [ with ] ( option [, ...] ) ] >> >> Performs a frontend (client) copy. This is an operation that runs an >> SQL COPY command, but instead of the server reading or writing the >> specified file, psql reads or writes the file and routes the data >> between the server and the local file system. This means that file >> accessibility and privileges are those of the local user, not the >> server, and no SQL superuser privileges are required. >> <<< >> >> Notice the (query) option? Your copy is using the table+optional >> column list format, (aria_precis), just change it to the "( query )" >> format ( NOTICE THE MANDATORY PARENS, this is how copy knows what to >> do. "t" => table, "t(c1)"=>table+columns, "(xx)" => query ( nothing >> before the opening parens ). ) >> >> Happy new year. >> >> Francisco Olarte. >
Re: Find duplicates in a column then print Information to a file
Rob Sargent writes: > Keep in mind there is a quick write-to-file in psql with ‘\o ’. And > don’t forget to turn it off with ‘\o’ See also "\g filename", for one-shot output. regards, tom lane
Re: invalid memory alloc request size 576460752438159360
Hi Peter, I just installed and used amcheck_next, I have used your sample query on the git page (changed the schema name) and that listed all indexes different schemes and produced same outputs like yours with bt_index_check field as empty, that means no error. Am I doing right? 2017-12-31 16:58 GMT+03:00 Peter Geoghegan : > On Sun, Dec 31, 2017 at 1:50 PM, Ibrahim Edib Kokdemir > wrote:> * write_cache is disabled > > * there is no incorrect work_mem parameter setting. > > * logical dump is working, (maybe) no curruption in data. > > * there is streaming replication, we do not repeat the error in the > > replicas. (replicas in different minor versions, 9.6.4, 9.6.3 > accordingly) > > * we have large_object field, logical_dump also works with large_objects > > fields. > > > > Any idea? > > This is very likely to be corruption. It's important to determine the > cause and extent of this corruption. I suggest using amcheck for this, > which is available for those Postgres versions from: > > https://github.com/petergeoghegan/amcheck > > Note that there are Debian and Redhat packages available. > > You'll definitely want to use the "heapallindexed" option here, at > least for primary key indexes (pass "pg_index.indisprimary" as > "heapallindexed" argument, while generalizing from the example SQL > query for bt_index_check()). This process has a good chance of > isolating the problem, especially if you let this list see any errors > raised by the tool. > > -- > Peter Geoghegan >
Re: invalid memory alloc request size 576460752438159360
On Sun, Dec 31, 2017 at 1:10 PM, Ibrahim Edib Kokdemir wrote: > I just installed and used amcheck_next, I have used your sample query on the > git page (changed the schema name) and that listed all indexes different > schemes and produced same outputs like yours with bt_index_check field as > empty, that means no error. > Am I doing right? You should give the argument of heapallindexed as 'true'. So: SELECT bt_index_check(index => c.oid, heapallindexed => true), c.relname, c.relpages FROM pg_index i JOIN pg_opclass op ON i.indclass[0] = op.oid JOIN pg_am am ON op.opcmethod = am.oid JOIN pg_class c ON i.indexrelid = c.oid JOIN pg_namespace n ON c.relnamespace = n.oid WHERE am.amname = 'btree' AND n.nspname = 'pg_catalog' -- Don't check temp tables, which may be from another session: AND c.relpersistence != 't' -- Function may throw an error when this is omitted: AND i.indisready AND i.indisvalid ORDER BY c.relpages DESC LIMIT 10; As I mentioned earlier, if this takes too long, you could only do heapallindexed checking once per table (not once per index) by giving "indisprimary" as the heapallindexed argument. That way, only primary keys would be verified against the heap, which is potentially a lot faster. -- Peter Geoghegan
Re: invalid memory alloc request size 576460752438159360
On Sun, Dec 31, 2017 at 1:39 PM, Peter Geoghegan wrote: > SELECT bt_index_check(index => c.oid, heapallindexed => true), > c.relname, > c.relpages > FROM pg_index i > JOIN pg_opclass op ON i.indclass[0] = op.oid > JOIN pg_am am ON op.opcmethod = am.oid > JOIN pg_class c ON i.indexrelid = c.oid > JOIN pg_namespace n ON c.relnamespace = n.oid > WHERE am.amname = 'btree' AND n.nspname = 'pg_catalog' > -- Don't check temp tables, which may be from another session: > AND c.relpersistence != 't' > -- Function may throw an error when this is omitted: > AND i.indisready AND i.indisvalid > ORDER BY c.relpages DESC LIMIT 10; > > As I mentioned earlier, if this takes too long, you could only do > heapallindexed checking once per table (not once per index) by giving > "indisprimary" as the heapallindexed argument. That way, only primary > keys would be verified against the heap, which is potentially a lot > faster. Oh, and I think that you should remove the "ORDER BY c.relpages DESC LIMIT 10", too. -- Peter Geoghegan
Re: invalid memory alloc request size 576460752438159360
> > > As I mentioned earlier, if this takes too long, you could only do > > heapallindexed checking once per table (not once per index) by giving > > "indisprimary" as the heapallindexed argument. That way, only primary > > keys would be verified against the heap, which is potentially a lot > > faster. > > Oh, and I think that you should remove the "ORDER BY c.relpages DESC > LIMIT 10", too. My db is not big enough right now. Queries did not take too long. I definitely did all the things you said. And I got only long lists without errors. That is actually a good thing for me. Thanks a lot.
is single row update improved with function
I must update 3M of 100M records, with tuple specific modifications. I can generate the necessary sql, but I’m wondering if files of simple update statements affecting a single row is more effective than files of a function call doing the same update given the necessary values, including where clause restrictions? The query plan set by the first should be decent for the remainder. Alternatively, would a bulk load into a table of replacement values and join info be the fastest way? Either way I can break the updates into roughly 393 transactions (7500 rows affected per tx) or 8646 transactions (350 rows per tx) if less is more in this world. I’ll be the only user during this work. OS=centos 7, 4core virtual 64G memory; pg=10.0;
Re: is single row update improved with function
On 1 January 2018 at 12:06, Rob Sargent wrote: > I must update 3M of 100M records, with tuple specific modifications. I can > generate the necessary sql, but I’m wondering if files of simple update > statements affecting a single row is more effective than files of a function > call doing the same update given the necessary values, including where clause > restrictions? The query plan set by the first should be decent for the > remainder. > > Alternatively, would a bulk load into a table of replacement values and join > info be the fastest way? It's probably also worth thinking about this table's usage pattern. If this table is an otherwise static table, then you may wish to think about the little bit of bloat that doing the UPDATEs in a single transaction would cause. > Either way I can break the updates into roughly 393 transactions (7500 rows > affected per tx) or 8646 transactions (350 rows per tx) if less is more in > this world. If you were to perform the UPDATEs in batches it would allow you to run a VACUUM between the UPDATEs. However, it might not be so important as 3 million rows in 100 million is just 3%, so assuming all your rows are the same size, then even doing this as a single transaction would only cause 3% churn on the table. Possibly some of the UPDATEs would reuse existing free space within the table, but if they don't then it would only mean an extra 3% bloat. As for which is faster. It's most likely going to depend on the query plan for the UPDATE statements. If you need to perform 3 million seq scans on the table, by doing 3 million individual statements, that's likely not going to perform well. 3 million statements is likely not a good option in any case as it means parsing and planning 3 million UPDATE statements. Even your 393 statements might not be very good if each of those UPDATEs must perform a seq scans on the 100 million row table, but if each of those 393 statements can make use of an index to easily get those 7500 rows, then that might be a better option than doing the single UPDATE join method you mentioned. It does sound like something you could take offline and benchmark if performance is that critical. It's not really possible for us to tell which is faster without seeing the schema, UPDATE statements and query plans chosen. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: is single row update improved with function
> On Dec 31, 2017, at 4:31 PM, David Rowley > wrote: > > On 1 January 2018 at 12:06, Rob Sargent wrote: >> I must update 3M of 100M records, with tuple specific modifications. I can >> generate the necessary sql, but I’m wondering if files of simple update >> statements affecting a single row is more effective than files of a function >> call doing the same update given the necessary values, including where >> clause restrictions? The query plan set by the first should be decent for >> the remainder. >> >> Alternatively, would a bulk load into a table of replacement values and join >> info be the fastest way? > > It's probably also worth thinking about this table's usage pattern. If > this table is an otherwise static table, then you may wish to think > about the little bit of bloat that doing the UPDATEs in a single > transaction would cause. > >> Either way I can break the updates into roughly 393 transactions (7500 rows >> affected per tx) or 8646 transactions (350 rows per tx) if less is more in >> this world. > > If you were to perform the UPDATEs in batches it would allow you to > run a VACUUM between the UPDATEs. However, it might not be so > important as 3 million rows in 100 million is just 3%, so assuming all > your rows are the same size, then even doing this as a single > transaction would only cause 3% churn on the table. Possibly some of > the UPDATEs would reuse existing free space within the table, but if > they don't then it would only mean an extra 3% bloat. > > As for which is faster. It's most likely going to depend on the query > plan for the UPDATE statements. If you need to perform 3 million seq > scans on the table, by doing 3 million individual statements, that's > likely not going to perform well. 3 million statements is likely not a > good option in any case as it means parsing and planning 3 million > UPDATE statements. Even your 393 statements might not be very good if > each of those UPDATEs must perform a seq scans on the 100 million row > table, but if each of those 393 statements can make use of an index to > easily get those 7500 rows, then that might be a better option than > doing the single UPDATE join method you mentioned. > > It does sound like something you could take offline and benchmark if > performance is that critical. It's not really possible for us to tell > which is faster without seeing the schema, UPDATE statements and query > plans chosen. > Thank you, and understood. Small clarification: the batches (393 or 8646) are not single update statements, rather groupings of separate updates sharing common restriction values. I don’t have the hardware to replicate the problem. The table being updated is described below. Due to an embarrassing cut/paste error, I’ll be updating the three ‘events’ columns based on probandset_id, chrom, startbase, endbase (matching the unique constraint). The first two fields (probandset_id and probandset_id+chrom) are the possible partitions - strictly from a file generation point of view. The updates are unique and atomic and could be clumped into transactions arbitrarily. The function would have to take seven args, the key plus the replacement values. Ultimate sql: update segment set events_less = i, events_equal = j, events_greater = k where probanset_id = id and chromosomes = ch and startbase = sb and endbase = eb; /* I might check that an update is needed be _less != i and _equal != j */ Table "segment" Column | Type | Collation | Nullable | Default +-+---+--+- id | uuid| | not null | chrom | integer | | not null | markerset_id | uuid| | not null | probandset_id | uuid| | not null | startbase | integer | | not null | endbase| integer | | not null | firstmarker| integer | | not null | lastmarker | integer | | not null | events_less| bigint | | not null | 0 events_equal | bigint | | not null | 0 events_greater | bigint | | not null | 0 Indexes: "segment_pkey" PRIMARY KEY, btree (id) "useg" UNIQUE CONSTRAINT, btree (probandset_id, chrom, startbase, endbase) "segment_markerset_id_chrom_firstmarker_idx" btree (markerset_id, chrom, firstmarker) Foreign-key constraints: "segment_probandset_id_fkey" FOREIGN KEY (probandset_id) REFERENCES probandset(id) Referenced by: TABLE "segmentset_member" CONSTRAINT "segmentset_member_segment_id_fkey" FOREIGN KEY (segment_id) REFERENCES segment(id) > -- > David Rowley http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services
Re: is single row update improved with function
> On Dec 31, 2017, at 4:31 PM, David Rowley > wrote: > > On 1 January 2018 at 12:06, Rob Sargent wrote: >> I must update 3M of 100M records, with tuple specific modifications. I can >> generate the necessary sql, but I’m wondering if files of simple update >> statements affecting a single row is more effective than files of a function >> call doing the same update given the necessary values, including where >> clause restrictions? The query plan set by the first should be decent for >> the remainder. >> >> Alternatively, would a bulk load into a table of replacement values and join >> info be the fastest way? > > It's probably also worth thinking about this table's usage pattern. If > this table is an otherwise static table, then you may wish to think > about the little bit of bloat that doing the UPDATEs in a single > transaction would cause. > Sorry, I didn’t address the question about the table's usage. Currently we’re in a data loading phase and this table is almost completed. Thereafter (post vacuum analyze) it will be 99.99% read-only. The remainder will be updated in much the same fashion as described early (set events_x = events_x + increment). This table will be analyzed a couple ways, mainly determining significance threshold across various slices, each of which examines roughly one tenth of the records. >> Either way I can break the updates into roughly 393 transactions (7500 rows >> affected per tx) or 8646 transactions (350 rows per tx) if less is more in >> this world. > > If you were to perform the UPDATEs in batches it would allow you to > run a VACUUM between the UPDATEs. However, it might not be so > important as 3 million rows in 100 million is just 3%, so assuming all > your rows are the same size, then even doing this as a single > transaction would only cause 3% churn on the table. Possibly some of > the UPDATEs would reuse existing free space within the table, but if > they don't then it would only mean an extra 3% bloat. > > As for which is faster. It's most likely going to depend on the query > plan for the UPDATE statements. If you need to perform 3 million seq > scans on the table, by doing 3 million individual statements, that's > likely not going to perform well. 3 million statements is likely not a > good option in any case as it means parsing and planning 3 million > UPDATE statements. Even your 393 statements might not be very good if > each of those UPDATEs must perform a seq scans on the 100 million row > table, but if each of those 393 statements can make use of an index to > easily get those 7500 rows, then that might be a better option than > doing the single UPDATE join method you mentioned. > > It does sound like something you could take offline and benchmark if > performance is that critical. It's not really possible for us to tell > which is faster without seeing the schema, UPDATE statements and query > plans chosen. > > -- > David Rowley http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services