How to keep format of views source code as entered?
Hi! Switching from Oracle SLQ to PostgreSQL I am facing the issue that the SQL code the system returns when I open a views source code is different from the code I entered. The code is formatted differently, comments are gone and e.g. all text constants got an explicit cast to ::text added. (see sample below). I want the SLQ code of my views stored as I entered it. Is there any way to achieve this? Or will I be forced to maintain my views SQL code outside of PostgreSQL views? Any hints welcome! Here is an example: I enter this code to define a simple view: create or replace view myview as select product_id, product_acronym from products -- my comment here where product_acronym = 'ABC' ; However, when I open the view my SQL client (DBeaver) again, this is what I get: CREATE OR REPLACE VIEW myview AS SELECT product_id, product_acronym FROM products WHERE product_acronym = 'ABC'::text; So, the formatting changed, keywords are capitalized, the comment I added in the from-part has gone and the text constant 'ABC' changed to 'ABC'::text. Verizon Deutschland GmbH - Sebrathweg 20, 44149 Dortmund, Germany - Amtsgericht Dortmund, HRB 14952 - Geschäftsführer: Detlef Eppig - Vorsitzender des Aufsichtsrats: Francesco de Maio
RE: How to keep format of views source code as entered?
So, it looks like PostgreSQL does support saving the original source code of a view. What's best practise to use as a code repository? I would expect support of multi-user access, access-right management and perhaps versioning as well…? Thanks for your help! Ingolf From: Markhof, Ingolf [mailto:ingolf.mark...@de.verizon.com] Sent: 07 January 2021 17:19 To: pgsql-general@lists.postgresql.org Subject: [E] How to keep format of views source code as entered? Hi! Switching from Oracle SLQ to PostgreSQL I am facing the issue that the SQL code the system returns when I open a views source code is different from the code I entered. The code is formatted differently, comments are gone and e.g. all text constants got an explicit cast to ::text added. (see sample below). I want the SLQ code of my views stored as I entered it. Is there any way to achieve this? Or will I be forced to maintain my views SQL code outside of PostgreSQL views? Any hints welcome! Here is an example: I enter this code to define a simple view: create or replace view myview as select product_id, product_acronym from products -- my comment here where product_acronym = 'ABC' ; However, when I open the view my SQL client (DBeaver) again, this is what I get: CREATE OR REPLACE VIEW myview AS SELECT product_id, product_acronym FROM products WHERE product_acronym = 'ABC'::text; So, the formatting changed, keywords are capitalized, the comment I added in the from-part has gone and the text constant 'ABC' changed to 'ABC'::text. Verizon Deutschland GmbH - Sebrathweg 20, 44149 Dortmund, Germany - Amtsgericht Dortmund, HRB 14952 - Geschäftsführer: Detlef Eppig - Vorsitzender des Aufsichtsrats: Francesco de Maio
RE: How to keep format of views source code as entered?
Thanks for your comments and thoughts. I am really surprised that PostgreSQL is unable to keep the source text of a view. Honestly, for me the looks like an implementation gap. Consider software development. You are writing code in C++ maybe on a UNIX host. And whenever you feed you source code into the compiler, it will delete it, keeping the resulting executable, only. And you could not even store your source code on the UNIX system. Instead, you'd be forced to do so in a separate system, like GitHub. Stupid, isn't it? Right. There are good reasons to store the source code on GitHub or alike anyhow. Especially when working on larger project and when collaborating with many people. But in case of rather small project with a few people only, this might be an overkill. It shouldn't be rocket science to enable PostgreSQL to store the original source code as well. It's weird PostgreSQL is not doing it. Regards, Ingolf Verizon Deutschland GmbH - Sebrathweg 20, 44149 Dortmund, Germany - Amtsgericht Dortmund, HRB 14952 - Geschäftsführer: Detlef Eppig - Vorsitzender des Aufsichtsrats: Francesco de Maio
RE: How to keep format of views source code as entered?
Tom, all, when I change a tables column name in Oracle SQL , SQLdeveloper (a SQL client) marks views that refer to the table using the old column name as erroneous. So, I can easily identify these cases. And of course I, as a user, I am acting in my context, i.e. my schema. So it is perfectly clear what table I am referring to. Please note: I am not developing any PL/SQL code. I don't have big development projects. I have more the role of an data analyst. I just create rather complex SQL queries which, from time to time, may need to be adopted to some new requirements. Or peers want to (re-) use (part of) my SQL queries. There is not really much versioning required. What I understood so far is: I can use e.g. DBeaver to interact with PostgreSQL, to develop my SQL code. But I finally need to copy the SQL code into e.g. Github. Which is a manual process. I'd mark the SQL code in the DBeaver editor window and copy&paste it into some file in e.g. GitHub. Using Github, I'd get version control and other enhanced collaboration features which I don't really need. At the price that code transfer from SQL (DBeaver) to the code repository and vice versa is complete manually?! This doesn't really look like an enhancement. Most likely, there are more professional ways to do that. I'd be glad to get advice. What I would like to have is something that would automatically update the SQL code in the software repository when I run a CREATE OR REPLACE VIEW. Ingolf Verizon Deutschland GmbH - Sebrathweg 20, 44149 Dortmund, Germany - Amtsgericht Dortmund, HRB 14952 - Geschäftsführer: Detlef Eppig - Vorsitzender des Aufsichtsrats: Francesco de Maio
Re: How to keep format of views source code as entered?
Hi! Today, I made an astonishing / disappointing experience related to that source code topic: You may not be able to delete tables / views that are referenced by other users objects, e.g. views. Unless you add the CASCADE option which will cause all depending views to be deleted as well. And the CASCASE will work and delete the other users view even when you don't have the permission to drop that other users view! Looks like the Oracle concept of marking views as invalid makes some sense... Regards, Ingolf -Original Message- From: Paul Förster [mailto:paul.foers...@gmail.com] Sent: 14 January 2021 07:16 To: Cybertec Schönig & Schönig GmbH Cc: Adam Brusselback ; David G. Johnston ; raf ; pgsql-generallists.postgresql.org Subject: [E] Re: How to keep format of views source code as entered? Hi Laurenz, > On 14. Jan, 2021, at 04:59, Laurenz Albe wrote: > > If PostgreSQL were to store the original text, either that text would become > wrong, or you would have to forbid renaming of anything that is referenced > by a view. this is why views, procedures, functions and packages can become invalid in Oracle, which I really hate because as a DBA, it's almost impossible to quickly see (or in fact see at all) why this happens, or having to debug applications that you don't know and/or can't fix anyway. Oracle's invalid object concept doesn't make sense. So, I'm not at all in favor of saving the original statement text. Cheers, Paul Verizon Deutschland GmbH - Sebrathweg 20, 44149 Dortmund, Germany - Amtsgericht Dortmund, HRB 14952 - Geschäftsführer: Detlef Eppig - Vorsitzender des Aufsichtsrats: Francesco de Maio
Foreign table performance issue / PostgreSQK vs. ORACLE
Hi! I am struggling with the slow performance when running queries referring to foreign tables. - Yes, I know... - Please read the whole story! The set-up basically is a production database and a reporting database. As names indicate, the production database is used for production, the reporting database is for analysis. On the reporting database, the only way to access product data is via foreign tables that link to the related production tables. Now, while some queries on the reporting service run fine, some don't even return any data after hours. However, the same set-up worked fine in Oracle before. Reporting wasn't always fast, but it delivered results in acceptable time. A query executed on the Oracle reporting server returns data in e.g. 30 seconds. But running the query translated to PostgreSQL on the PostgreSQL DB does not deliver a single row after hours (!) of run time. So, I wonder: Is there a fundamental difference between Oracle database links and foreign tables in PostgreSQL that could explain the different run times? Could there be some tuning option in PostgreSQL to make queries via foreign tables faster (e.g. I heard about option fetch_size)? Your pointes welcome! Verizon Deutschland GmbH - Sebrathweg 20, 44149 Dortmund, Germany - Amtsgericht Dortmund, HRB 14952 - Geschäftsführer: Detlef Eppig - Vorsitzender des Aufsichtsrats: Francesco de Maio
RE: Foreign table performance issue / PostgreSQK vs. ORACLE
Hi! My PostgreSQL version is 11.8. The query I am running is referring to a number of foreign tables. The first one (table1) has to IDs, let's say ID_A and ID_B. While ID_A is unique, ID_B is not. In my case, I am pulling formation for a value of IB_B for which about 800 rows (with unique ID_A) exist. I found: While select * from my_view where id_b='some value'; seemingly runs "forever" (I cancelled execution after a few hours), the following completes in about 1 hr: select * from my_view where ia_a in ( select id_a from table1 where id_b='some value' ); So, I tried smaller chunks of ID_a and found the execution time is non-linear with respect to number of IDs. For e.g. 50 ID_A's, it was completed in about 12 sec. Next I have split the ~800 ID_A's into chunks of 50 and submitted these 16 queries one after another. They all completed in about 12 secs, each. I then found the option fetch_size, e.g. ALTER SERVER some_server OPTIONS (fetch_size '5'). A chunk of 50 now executes in 2 seconds (instead of 12 before). So, I found the "size" of the query has a serious impact to the execution time. I don't really understand why execution 16*50 takes 16*2 secs only, but executing 1*800 takes about 3000 seconds... Regards, Ingolf -Original Message- From: Sebastian Dressler [mailto:sebast...@swarm64.com] Sent: 30 January 2021 11:45 To: Markhof, Ingolf Cc: pgsql-general@lists.postgresql.org Subject: [E] Re: Foreign table performance issue / PostgreSQK vs. ORACLE Hi Ingolf, > On 29. Jan 2021, at 13:56, Markhof, Ingolf > wrote: > > Hi! > > I am struggling with the slow performance when running queries referring to > foreign tables. – Yes, I know… - Please read the whole story! Done and it rings a bell or two. > The set-up basically is a production database and a reporting database. As > names indicate, the production database is used for production, the reporting > database is for analysis. On the reporting database, the only way to access > product data is via foreign tables that link to the related production tables. > > Now, while some queries on the reporting service run fine, some don't even > return any data after hours. > > However, the same set-up worked fine in Oracle before. Reporting wasn't > always fast, but it delivered results in acceptable time. A query executed on > the Oracle reporting server returns data in e.g. 30 seconds. But running the > query translated to PostgreSQL on the PostgreSQL DB does not deliver a single > row after hours (!) of run time. > > So, I wonder: Is there a fundamental difference between Oracle database links > and foreign tables in PostgreSQL that could explain the different run times? > Could there be some tuning option in PostgreSQL to make queries via foreign > tables faster (e.g. I heard about option fetch_size)? You did not explicitly mention it, but I assume you are using postgres_fdw to connect from reporting (R) to production (P). Thomas and Tom already mentioned incomplete/non-existing/non-applicable filter pushdowns. I want to add another probable root cause to the list explaining the behavior you experience. The postgres_fdw uses a CURSOR on P to execute the query. While this guarantees transaction safety, it also prohibits parallelism (PostgreSQL server-side cursors enforce a sequential plan). As a result, depending on the size of tables, indexes, and filters pushed down (or not), this probably results in slow-running queries. IMO, the worst-case scenario is that a sequential table scan without any filtering, and a single worker runs on the target. Of course, you can try to optimize schemas on P and queries on R, enabling more filter pushdown and eventually a faster execution. However, I believe this does not work with your entire workload, i.e. there will always be performance gaps. The parallelism issue is theoretically fixable by utilizing partitions on P. R then connects to P with multiple postgres_fdw-backed child tables. However, this will only work with a patch to postgres_fdw to implement "IsForeignScanParallelSafe" (see [1] for a possible implementation). Without this method, there will be no parallelism again. Without, the partitions scan occurs sequentially, not showing a performance gain. I want to mention there are proprietary options available (re-)enabling PostgreSQL parallelism with cursors. Such an extension can potentially fix your performance issue. However, I have not tried it so far with a setup similar to yours. Cheers, Sebastian [1]: https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_swarm64_parallel-2Dpostgres-2Dfdw-2Dpatch&d=DwIGaQ&c=udBTRvFvXC5Dhqg7UHpJlPps3mZ3LRxpb6__0PomBTQ&r=ivZWA-ECVj3XrXBe0obDwKY7Ui7K5Nj9oD2KKWLm0Bw&
dealing with dependencies
Hi! I recently switched from Oracle SQL to PostgreSQL. In Oracle, I can easily delete a table or view that is used by existing views. The system marks the affected views. I can then re-define the deleted table or view and have all dependent views easily re-compiled. Done. PostgreSQL instead is preventing inconsistency. It simply refuses to delete a view or table that is referenced by other views. Consequently, I need to delete all dependent views first, re-define the one I want to change and then create all dependent views deleted before... - Which is much more difficult to handle. What I especially dislike is that you cannot even insert a column into an existing view if that view is used by some other views. E.g.: create table my_table (col1 text, col2 text); create view my_view1 as select col1, col2 from my_table; create view my_view2 as select col1, col2 from my_view1; create or replace view my_view1 as select col1, col1||col2, col2 from my_table; --> ERROR: Cannot change name of view column "col2" to .. The create or replace of view 2 fails. Clear, the manual states about create or replace view: "the new query must generate the same columns that were generated by the existing view query (that is, the same column names in the same order and with the same data types), but it may add additional columns to the end of the list.". Obviously, the columns are internally referenced by index, not by name. But if I want my new column between to exiting ones, I need to deleted my_view2, first... I wonder how you deal with it in a professional way. Sounds like some type of "make" (that UNIX tool dealing with dependencies in the context of e.g. programming in C) would be helpful... So, in an environment of rapid prototyping, if you develop the data-base design and view for tables etc. and you then need to make changes to a base table that affect all the views using it, there should be another way than doing all this manually...?! Thx for your pointers! I. == Verizon Deutschland GmbH - Sebrathweg 20, 44149 Dortmund, Germany - Amtsgericht Dortmund, HRB 14952 - Geschäftsführer: Detlef Eppig - Vorsitzender des Aufsichtsrats: Francesco de Maio
Regexp_replace bug / does not terminate on long strings
BRIEF: regexp_replace(source,pattern,replacement,flags) needs very (!) long to complete or does not complete at all (?!) for big input strings (a few k characters). (Oracle SQL completes the same in a few ms) VERBOSE Given a comma-separated list of "words" (whereas a word is any sequence of characters not containing a comma) I want to delete sequences of duplicates, e.g. 'one,one,one,two,two,three' --> 'one,two,three' I use this regexp_replace to delete duplicate "words" in a comma separated string: select regexp_replace( 'one,one,one,two,two,three', -- input string '([^,]+)(,\1)*($|,)', -- pattern '\1\3', -- replacement 'g' -- apply globally (all matches) ); However, I found that this regexp_replace seemingly runs "endless" (I cancelled the query after a few minutes) for bigger input strings. Such as: '1/1,1/1,1/1,1/1,1/1,1/1,1/1,1/1,1/1,1/1,1/1,2/1,2/1,2/1,2/1,2/2,2/1,2/1,2/2,2/2,2/1,2/1,2/2,2/2,2/1,2/2,2/1,2/1,2/2,2/1,2/1,2/2,3/1,3/1,3/2,3/1,3/1,3/1,3/1,3/1,3/1,3/3,3/1,3/1,3/1,3/3,3/1,3/1,3/2,3/1,3/1,3/1,3/3,3/3,3/1,3/1,4/1,4/1,4/1,4/1,4/1,4/1,5/2,5/5,5/1,5/5,5/5,5/2,5/1,5/1,5/5,5/1,6/1,6/1,6/1,6/1,6/3,6/6,6/1,6/1,6/1,6/3,6/1,6/1,6/1,6/1,6/1,6/1,6/6,6/1,7/1,7/3,7/1,7/1,7/1,7/5,7/1,7/1,7/1,7/1,7/1,7/1,7/1,7/5,7/1,7/3,7/1,8/1,8/1,8/2,8/2,8/1,8/6,8/1,8/1,8/1,8/1,8/6,8/1,8/1,8/1,9/2,9/1,9/1,9/2,10/4,10/2,10/2,10/2,10/2,10/1,10/4,10/10,10/10,10/2,10/1,10/1,10/2,10/1,10/8,10/1,10/3,10/2,10/5,10/10,10/2,10/10,10/2,10/3,10/1,10/1,10/1,10/1,10/8,10/5,12/5,12/3,12/5,12/5,12/1,12/5,12/1,12/3,12/1,12/1,12/5,12/2,12/1,12/0.768,12/1,12/2,12/2,12/2,12/2,12/2,12/1,12/1,14/3,15/1,15/10,15/1,15/2,15/3,15/2,15/1,15/15,15/1,15/2,15/4,15/15,15/5,15/1,15/2,15/15,15/1,15/5,15/1,15/3,15/5,15/5,15/1,15/10,15/4,15/2,15/2,15/15,15/3,15/2,15/2,15/3,15/3,16/3,16/3,18/4,18/3,18/1,18/2,18/2,18/2,18/4,18/2,18/2,18/1,18/3,20/20,20/5,20/0.896,20/5,20/1,20/2,20/1,20/3,20/4,20/5,20/10,20/20,20/10,20/5,20/1,20/1,20/4,20/2,20/1,20/1,20/3,20/4,20/20,20/2,20/20,20/2,20/20,20/20,24/3,24/4,24/2,24/4,24/3,24/3,24/3,24/3,25/3,25/4,25/10,25/25,25/6,25/1,25/7,25/2,25/5,25/2,25/2,25/25,25/3,25/25,25/25,25/10,25/3,25/10,25/25,25/6,25/4,25/25,25/5,25/5,25/3,25/1,25/5,25/10,25/25,25/7,25/14,25/5,25/5,25/5,25/3,25/5,25/14,25/2,30/2,30/7,30/3,30/8,30/15,30/1,30/4,30/7,30/2,30/5,30/30,30/8,30/5,30/5,30/5,30/8,30/8,30/1,30/10,30/3,30/30,30/10,30/4,30/30,30/30,30/3,30/1,30/15,30/3,30/5,30/3,35/5,35/12,35/5,35/10,35/3,35/3,35/4,35/10,35/12,35/5,35/5,35/4,40/15,40/4,40/40,40/2,40/10,40/10,40/5,40/3,40/3,40/40,40/4,40/15,40/10,40/2,40/5,45/6,45/6,45/6,45/6,45/6,50/8,50/4,50/50,50/8,50/15,50/7,50/3,50/20,50/25,50/50,50/5,50/50,50/12,50/7,50/4,50/15,50/10,50/8,50/3,50/2,50/20,50/25,50/10,50/8,50/50,50/5,50/5,50/50,50/10,50/10,50/10,50/5,50/5,50/4,50/10,50/50,50/5,50/50,50/8,50/8,50/50,50/50,50/10,50/12,50/2,50/5,55/10,55/10,55/5,55/5,60/3,60/25,60/4,60/60,60/30,60/25,60/6,60/6,60/10,60/5,60/5,60/3,60/10,60/5,60/5,60/10,60/30,60/6,60/5,60/10,60/60,70/10,70/10,75/15,75/3,75/4,75/10,75/20,75/5,75/6,75/8,75/6,75/7,75/75,75/7,75/15,75/25,75/15,75/7,75/3,75/15,75/8,75/30,75/75,75/8,75/8,75/5,75/20,75/75,75/6,75/30,75/15,75/75,75/8,75/25,75/15,75/7,75/75,75/10,75/4,80/5,80/5,90/50,90/50,100/7,100/10,100/10,100/10,100/10,100/10,100/8,100/20,100/10,100/20,100/20,100/5,100/25,100/8,100/5,100/100,100/8,100/20,100/8,100/10,100/20,100/7,100/6,100/50,100/15,100/10,100/2,100/35,100/10,100/10,100/35,100/30,100/100,100/5,100/40,100/35,100/100,100/50,100/35,100/30,100/7,100/10,100/10,100/7,100/25,100/100,100/40,100/5,100/15,100/6,100/7,100/20,100/10,100/2,100/20,105/10,105/20,105/10,105/20,120/15,120/10,120/15,120/15,150/150,150/150,150/75,150/20,150/20,150/20,150/30,150/75,150/25,150/15,150/25,150/150,150/15,150/6,150/6,150/30,150/20,200/15,200/15,200/20,200/10,200/40,200/15,200/40,200/50,200/7,200/20,200/15,200/25,200/20,200/7,200/15,200/7,200/15,200/20,200/20,200/200,200/15,200/50,200/10,200/20,200/20,200/20,200/200,200/20,200/25,200/7,240/15,240/15,250/20,250/50,250/20,250/10,250/10,250/25,250/250,250/250,250/25,250/50,250/25,300/20,300/20,300/30,300/7,300/20,300/300,300/300,300/20,300/30,300/20,300/7,300/10,300/20,300/20,300/30,300/20,300/7,300/7,300/20,300/30,300/30,300/300,300/50,300/300,300/30,300/10,300/300,300/300,300/50,300/300,400/20,400/20,400/25,400/25,450/50,450/50,500/500,500/50,500/50,500/50,500/50,500/500,500/500,500/500,500/35,500/25,500/35,500/25,500/500,600/40,600/40,1000/20,1000/40,1000/20,1000/1000,1000/20,1000/35,1000/1000,1000/20,1000/50,1000/500,1000/50,1000/50,1000/1000,1000/500,1000/1000,1000/35,1000/40' I also run the same in Oracle SQL where the same completes "immediately" (with in a few ms). There seems to be some issue in the implementation of regexp_replace or regular expressions in general in PostgeSQL... FYI: I have: select version(); PostgreSQL 11.9 on aarch64-unknown-linux-gnu, compiled by aarch64-unknown-linux-gnu-gcc (GCC) 7.4.0, 64-bit Any comments? ==
string_agg distinct order by
I am looking for something like string_agg(distinct col_x order by col_y) Unfortunately, you can either have the distinct, but then the order by needs to be identical to what's aggregated, or you can have the order be determined by another column. Not both... Here is the playground Given: create table sites ( state text, city text, col_a text, col_b text ); insert into sites values ('Texas','Dallas','green','green'); insert into sites values ('Texas','Houston','green','green'); insert into sites values ('Texas','Austin','yellow','green'); insert into sites values ('Texas','Waco','yellow','yellow'); insert into sites values ('Texas','Midland','red','red'); insert into sites values ('Texas','Amarillo','red','yellow'); For each city, there is a status denoted by colour combination, e.g. 'green / green'. This is stored in two different columns in the table. There is an order in the colours: create table colours ( colour text, value integer ); insert into colours values ('red', 1); insert into colours values ('yellow', 2); insert into colours values ('green', 3); So, red first, yellow second, green last. I want an aggregated view showing for each state the list of existing status combinations, such as: with site_status as ( select state, city, col_a || '/' || col_b as status, ca.value as val_a, cb.value as val_b from sites a join colours ca on ca.colour=a.col_a join colours cb on cb.colour=a.col_b ) select state, string_agg(distinct status,',') as list from site_status group by state ; This results in: Texas green/green,red/red,red/yellow,yellow/green,yellow/yellow By using distinct in the string_agg, I avoid double entries. Fine. But now, I want the data ordered. E.g. in the order of the first colour. I SQL, this could read... string_agg(distinct status,',' order by val_a) as list but this doesn't work. I get: SQL Error [42P10]: ERROR: in an aggregate with DISTINCT, ORDER BY expressions must appear in argument list So, I could say: string_agg(distinct status,',' order by status) as list but this is not what I want: 'green' would get first, red second, yellow last... I could also drop the distinct and say: string_agg(status,',' order by val_a) as list This would return the list in correct order, but with double values ('green/green') showing up twice. I tried to delete the double entries via regexp_replace, but found this doesn't work for rather long strings (where a single regexp_replace can run many minutes!) Any pointers? Thank you very much for any idea. == Verizon Deutschland GmbH - Sebrathweg 20, 44149 Dortmund, Germany - Amtsgericht Dortmund, HRB 14952 - Geschäftsführer: Detlef Eppig - Vorsitzender des Aufsichtsrats: Francesco de Maio
Re: [E] Re: Regexp_replace bug / does not terminate on long strings
Hi Tom, thank you very much for your reply. Actually, I was assuming all these regular expressions are based on the same core implementation. Interestingly, this doesn't seem to be true... I am also surprised that you say the (\1)+ subpattern is computationally expensive. Regular expressions are greedy by default. I.e. in case of a* matching against a string of 1000 a's, the system will not try a, aa, aaa, ... and so on, right? Instead, it will consume all the a's in one go. Likewise, I was expecting that the system would eat all the repetitions of a word in one go. Your proposal to use (\1)? instead at the first glance seems to require more effort, because the same word and it's matching successor will need to be matched again and again and again. Roughly, 2N matches are to be done instead of just N. However, you are perfectly right: When I use (\1)? instead of (\1)+, the expression is evaluates quickly! Thank you very much for looking into this and for proposing the alternative approach which is working fine. Regards Ingolf On Fri, Aug 20, 2021 at 12:42 AM Tom Lane wrote: > "Markhof, Ingolf" writes: > > BRIEF: > > regexp_replace(source,pattern,replacement,flags) needs very (!) long to > > complete or does not complete at all (?!) for big input strings (a few k > > characters). (Oracle SQL completes the same in a few ms) > > Regexps containing backrefs are inherently hard --- every engine has > strengths and weaknesses. I doubt it'd be hard to find cases where > our engine is orders of magnitude faster than Oracle's; but you've > hit on a case where the opposite is true. > > The core of the problem is that it's hard to tell how much of the > string could be matched by the (,\1)* subpattern. In principle, *all* > of the remaining string could be, if it were N repetitions of the > initial word. Or it could be N-1 repetitions followed by one other > word, and so on. The difficulty is that since our engine guarantees > to find the longest feasible match, it tries these options from > longest to shortest. Usually the actual match (if any) will be pretty > short, so that you have O(N) wasted work per word, making the runtime > at least O(N^2). > > I think your best bet is to not try to eliminate multiple duplicates > at a time. Get rid of one dup at a time, say by > str := regexp_replace(str, '([^,]+)(,\1)?($|,)', '\1\3', 'g'); > and repeat till the string doesn't get any shorter. > > I did come across a performance bug [1] while poking at this, but > alas fixing it doesn't move the needle very much for this example. > > regards, tom lane > > [1] > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.postgresql.org_message-2Did_1808998.1629412269-2540sss.pgh.pa.us&d=DwIBAg&c=udBTRvFvXC5Dhqg7UHpJlPps3mZ3LRxpb6__0PomBTQ&r=ivZWA-ECVj3XrXBe0obDwKY7Ui7K5Nj9oD2KKWLm0Bw&m=q11bVTHCxVx8BQu2pjn6-3nOY8aN8hORXofVK38HqF8&s=hJxrzmTT6G7AUomoeFgh0IGDO3NcUP4gB9kvYHnt3m0&e= > == Verizon Deutschland GmbH - Sebrathweg 20, 44149 Dortmund, Germany - Amtsgericht Dortmund, HRB 14952 - Geschäftsführer: Detlef Eppig - Vorsitzender des Aufsichtsrats: Francesco de Maio
Re: [E] Re: Regexp_replace bug / does not terminate on long strings
Thank you very much for all your proposals! Ingolf == Verizon Deutschland GmbH - Sebrathweg 20, 44149 Dortmund, Germany - Amtsgericht Dortmund, HRB 14952 - Geschäftsführer: Detlef Eppig - Vorsitzender des Aufsichtsrats: Francesco de Maio
Re: [E] Re: Regexp_replace bug / does not terminate on long strings
Right. Considering a longer sequence of a's, "(a*)\1" allows a wide variety of matches. But in fact, this is not what I was trying to use. I was more looking at "(a)\1*" which shall match exactly what "a+" matches. As matching is greedy, "(a)\1*" shall consume all a's in a sequence in one go, just like "a+" does...?! Regards, Ingolf On Fri, Aug 20, 2021 at 6:52 PM Tom Lane wrote: > "Markhof, Ingolf" writes: > > thank you very much for your reply. Actually, I was assuming all these > > regular expressions are based on the same core implementation. > > They are not. There are at least three fundamentally different > implementation technologies (DFA, NFA, hybrid). Friedl's "Mastering > Regular Expressions" cites multiple different programs using each > of those, every one of which behaves a bit differently when you start > poking at corner cases. And that's just in the open-source world; > I don't know what Oracle is using, but I bet it ain't open source. > > > I am also surprised that you say the (\1)+ subpattern is computationally > > expensive. Regular expressions are greedy by default. I.e. in case of a* > > matching against a string of 1000 a's, the system will not try a, aa, > aaa, > > ... and so on, right? Instead, it will consume all the a's in one go. > > "a*" is easy. "(a*)\1" is less easy --- if you let the a* consume the > whole string, you will not get a match, even though one is possible. > In general, backrefs create a mess in what would otherwise be a pretty > straightforward concept :-(. > > regards, tom lane > == Verizon Deutschland GmbH - Sebrathweg 20, 44149 Dortmund, Germany - Amtsgericht Dortmund, HRB 14952 - Geschäftsführer: Detlef Eppig - Vorsitzender des Aufsichtsrats: Francesco de Maio
Re: [E] Re: string_agg distinct order by
Tom, I see your point. Which is valid considering there could be any value in col_y for some value in col_x. But in my case, col_y is a a function of col_x, i.e. two rows with the same value in row_x will have the same value in row_y as well. Consider, you need to store some length values. Like this: create table items ( id text, len integer, unit text ); insert into items values (1,1,'mm'); insert into items values (2,5,'mm'); insert into items values (3,5,'mm'); insert into items values (4,1,'cm'); insert into items values (5,1,'cm'); insert into items values (6,1,'m'); insert into items values (7,1,'m'); insert into items values (7,2,'m'); insert into items values (8,2,'m'); insert into items values (9,5,'m'); With the view... create view vu_items as select id, len || unit as descr, len*case unit when 'mm' then 1 when 'cm' then 10 when 'm' then 1000 end as len_mm from items; ...I now want to have a list of all distinct descr ordered by length. But... select string_agg(descr,',' order by len_mm) from vu_items; ...creates a list with duplicates, only: 1mm,5mm,5mm,1cm,1cm,1m,1m,2m,2m,5m And... select string_agg(distinct descr,',' order by descr) from vu_items; ...gives a list of distinct values, but in the wrong order: 1cm,1m,1mm,2m,5m,5mm My solution to this (now) is: select regexp_replace( string_agg(descr,',' order by len_mm), '([^,]+)(,\1)?($|,)', '\1\3', 'g' ) from vu_items; Thx again for your hint in the regexp_replacy issue in my other post... Regards, Ingolf On Thu, Aug 19, 2021 at 6:30 PM Tom Lane wrote: > "Markhof, Ingolf" writes: > > I am looking for something like > > string_agg(distinct col_x order by col_y) > > > Unfortunately, you can either have the distinct, but then the order by > > needs to be identical to what's aggregated, or you can have the order be > > determined by another column. Not both... > > The reason for that restriction is that the case you propose is > ill-defined. If we combine rows with the same col_x, which row's > value of col_y is to be used to sort the merged row? I think once > you answer that question, a suitable query will suggest itself. > > regards, tom lane > == Verizon Deutschland GmbH - Sebrathweg 20, 44149 Dortmund, Germany - Amtsgericht Dortmund, HRB 14952 - Geschäftsführer: Detlef Eppig - Vorsitzender des Aufsichtsrats: Francesco de Maio
Re: [E] Re: Regexp_replace bug / does not terminate on long strings
Argh... Yes, When I use (\1)? instead of (\1)+, the expression is evaluated quickly, but it doesn't return what I want. Once a word is written, it is not subject to matching again. i.e. select regexp_replace( --> remove double entries 'one,one,one,two,two,three,three', '([^,]+)(,\1)?($|,)', '\1\3', 'g' ) as res; returns 'one,one,two,three'. The first 'one' is found, followed by a ',one, so the 'one,one,' is replaced by a 'one,'. It looks like the system is 'reading' from an input string and writing to output string. And the 'one,' send to the output is not matched again. So the system instead finds just another 'one,' in the input string and writes another 'one,' to the output string. Honestly, this behaviour seems to be incorrect for me. Once the system replaces the first two 'one,one,' by a single 'one,', I'd expect to match this replaced one 'one,' with the next 'one,' following, replacing these two by another, single 'one,', again... Regards, Ingolf Ingolf Markhof International Network Product - International Access < inp-intlacc...@verizon.com> Sebrathweg 20, 44149 Dortmund, Germany Office: +49 231 972 1475 | Vnet: 317-1475 On Fri, Aug 20, 2021 at 6:11 PM Markhof, Ingolf < ingolf.mark...@de.verizon.com> wrote: > Hi Tom, > > thank you very much for your reply. Actually, I was assuming all these > regular expressions are based on the same core implementation. > Interestingly, this doesn't seem to be true... > > I am also surprised that you say the (\1)+ subpattern is computationally > expensive. Regular expressions are greedy by default. I.e. in case of a* > matching against a string of 1000 a's, the system will not try a, aa, aaa, > ... and so on, right? Instead, it will consume all the a's in one go. > Likewise, I was expecting that the system would eat all the repetitions of > a word in one go. Your proposal to use (\1)? instead at the first glance > seems to require more effort, because the same word and it's matching > successor will need to be matched again and again and again. Roughly, 2N > matches are to be done instead of just N. > > However, you are perfectly right: When I use (\1)? instead of (\1)+, the > expression is evaluates quickly! > > Thank you very much for looking into this and for proposing the > alternative approach which is working fine. > > Regards > Ingolf > > > > On Fri, Aug 20, 2021 at 12:42 AM Tom Lane wrote: > >> "Markhof, Ingolf" writes: >> > BRIEF: >> > regexp_replace(source,pattern,replacement,flags) needs very (!) long to >> > complete or does not complete at all (?!) for big input strings (a few k >> > characters). (Oracle SQL completes the same in a few ms) >> >> Regexps containing backrefs are inherently hard --- every engine has >> strengths and weaknesses. I doubt it'd be hard to find cases where >> our engine is orders of magnitude faster than Oracle's; but you've >> hit on a case where the opposite is true. >> >> The core of the problem is that it's hard to tell how much of the >> string could be matched by the (,\1)* subpattern. In principle, *all* >> of the remaining string could be, if it were N repetitions of the >> initial word. Or it could be N-1 repetitions followed by one other >> word, and so on. The difficulty is that since our engine guarantees >> to find the longest feasible match, it tries these options from >> longest to shortest. Usually the actual match (if any) will be pretty >> short, so that you have O(N) wasted work per word, making the runtime >> at least O(N^2). >> >> I think your best bet is to not try to eliminate multiple duplicates >> at a time. Get rid of one dup at a time, say by >> str := regexp_replace(str, '([^,]+)(,\1)?($|,)', '\1\3', 'g'); >> and repeat till the string doesn't get any shorter. >> >> I did come across a performance bug [1] while poking at this, but >> alas fixing it doesn't move the needle very much for this example. >> >> regards, tom lane >> >> [1] >> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.postgresql.org_message-2Did_1808998.1629412269-2540sss.pgh.pa.us&d=DwIBAg&c=udBTRvFvXC5Dhqg7UHpJlPps3mZ3LRxpb6__0PomBTQ&r=ivZWA-ECVj3XrXBe0obDwKY7Ui7K5Nj9oD2KKWLm0Bw&m=q11bVTHCxVx8BQu2pjn6-3nOY8aN8hORXofVK38HqF8&s=hJxrzmTT6G7AUomoeFgh0IGDO3NcUP4gB9kvYHnt3m0&e= >> >
Re: [E] Re: Regexp_replace bug / does not terminate on long strings
You are right, I also found the same behaviour when using e.g the UNIX sed command. Ingolf On Mon, Aug 23, 2021 at 4:24 PM Francisco Olarte wrote: > Ingolf: > > On Mon, Aug 23, 2021 at 2:39 PM Markhof, Ingolf > wrote: > > Yes, When I use (\1)? instead of (\1)+, the expression is evaluated > quickly, but it doesn't return what I want. Once a word is written, it is > not subject to matching again. i.e. > > select regexp_replace( --> remove double entries > > 'one,one,one,two,two,three,three', > > '([^,]+)(,\1)?($|,)', > > '\1\3', > > 'g' > >) as res; > > > ... > > Honestly, this behaviour seems to be incorrect for me. Once the system > replaces the first two 'one,one,' by a single 'one,', I'd expect to match > this replaced one 'one,' with the next 'one,' following, replacing these > two by another, single 'one,', again... > > I think your expectation is misguided. All the regexp engines I've > used do it this way, when asked to match "g"lobally they do > non-overlapping matches, they do not substitute and recurse with the > modified string. > > Also, your way opens the door to run-away or infinite loops ( > rr('a','a','aa','g') or rr('a','a','a','g'), not to speak of > r('x','','','g') ). Even a misguided r(str, '_+','_','g'), used > sometimes to normalize space runs and similar things, can go into a > loop. > > Francisco Olarte. > == Verizon Deutschland GmbH - Sebrathweg 20, 44149 Dortmund, Germany - Amtsgericht Dortmund, HRB 14952 - Geschäftsführer: Detlef Eppig - Vorsitzender des Aufsichtsrats: Francesco de Maio