Re: Please implement a catch-all error handler per row, for COPY

2025-03-02 Thread me nefcanto
Thank you all for responding. With respect, I don't understand why COPY
gets related to ETL. All of the requirements I mentioned above have nothing
to do with ETL. We have a table of categories. A hierarchical table. This
is a multitenant app. Hundreds of thousands of records are in it. We want
to calculate the hierarchical properties (nested set models, node depth,
determining leaves, materialized paths, etc.) and then update the entire
table. What does this have to do with ETL? Or as I said we create millions
of records for products, attributes of products, and pricing and media of
products to load test our system. Again, what does that have to do with ETL?

The point is, that there is already an `on_error ignore` clause there. This
means that somewhere there is a try/catch per row. If I'm wrong, please let
me know. But when the `on_error ignore` catches problem x for each row,
then it can catch all problems for each row without any performance problem.

Let me give you an example in C#:

try
{
}
catch (SomeException ex)
{
}

becomes:

try
{
}
catch (Exception ex)
{
if (ex is SomeException)
{
}
}

The last catch clause catches everything. How does it affect performance?
Running a simple if for hundreds of millions of iterations is literally
nothing in time complexity.

As I have specified in the bug thread, from 11 RDBMSs, 7 support this. Thus
it's not an uncommon weird request.

Regards
Saeed

On Sat, Mar 1, 2025 at 8:45 PM Greg Sabino Mullane 
wrote:

> FYI the -bugs thread in question:
>
>
> https://www.postgresql.org/message-id/flat/CAEHBEOBCweDWGNHDaUk4%3D10HG0QXXJJAGXbEnFLMB30M%2BQw%2Bdg%40mail.gmail.com
>
> seems to imply the primary blocker was a unique constraint.
>
> Cheers,
> Greg
>
> --
> Crunchy Data - https://www.crunchydata.com
> Enterprise Postgres Software Products & Tech Support
>
>


Quesion about querying distributed databases

2025-03-04 Thread me nefcanto
Hello

Consider this scenario:

   - 3 servers, 3 databases, each on a separate server:
  - *Products database*: Contains the *Products* table (with over
  100,000 records).
  - *Taxonomy database*: Contains the *Categories* and *ItemCategories
  (EAV)* tables.
  - *Attributes database*: Contains the *Attributes* and *ItemAttributes
  (EAV)* tables.

How do you find products based on the following criteria?

   1. A search in the title (e.g., "awesome shirts").
   2. Selected categories (e.g., "casual" and "sports").
   3. Selected attributes (e.g., "color: blue" and "size: large")


Regards
Saeed


Re: Quesion about querying distributed databases

2025-03-04 Thread me nefcanto
Adrian Klaver, thank you for the link. I asked the AI to create a query for
me using FDW.

This is the sample query:

with filtered_products as (
select p.product_id
from products.product p
where p.title ilike '%search_term%'
), category_filtered as (
select ic.product_id
from taxonomy.item_categories ic
where ic.category_id = any(array['category_id_1', 'category_id_2'])
), attribute_filtered as (
select ia.product_id
from attributes.item_attributes ia
where ia.attribute_id = any(array['attribute_id_1', 'attribute_id_2'])
), final_products as (
select f.product_id
from filtered_products f
join category_filtered c on f.product_id = c.product_id
join attribute_filtered a on f.product_id = a.product_id
order by f.product_id -- replace with relevant sorting column
limit 50 offset 0
)
select p.*
from products.product p
join final_products fp on p.product_id = fp.product_id;

The problem here is that it collects all of the product_id values from the
ItemCategories table. Let's say each product is put in one category only.
This means that we have 100 thousand records in the ItemCategories table.
Thus, to show a list of 20 products on the website, this query first
fetches 100 thousand product_id values from the remote server.

That's not scalable. Is there a workaround for this?

Thank you
Saeed

On Wed, Mar 5, 2025 at 8:12 AM Adrian Klaver 
wrote:

> On 3/4/25 20:40, me nefcanto wrote:
> > Hello
> >
> > Consider this scenario:
> >
> >   * 3 servers, 3 databases, each on a separate server:
> >   o *Products database*: Contains the *Products* table (with over
> > 100,000 records).
> >   o *Taxonomy database*: Contains the *Categories* and
> > *ItemCategories (EAV)* tables.
> >   o *Attributes database*: Contains the *Attributes* and
> > *ItemAttributes (EAV)* tables.
> >
> > How do you find products based on the following criteria?
>
> https://www.postgresql.org/docs/current/postgres-fdw.html
>
> >
> >  1. A search in the title (e.g., "awesome shirts").
> >  2. Selected categories (e.g., "casual" and "sports").
> >  3. Selected attributes (e.g., "color: blue" and "size: large")
> >
> >
> > Regards
> > Saeed
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>
>


Re: Quesion about querying distributed databases

2025-03-05 Thread me nefcanto
I appreciate your time guys. Thank you very much.

> Having 1 table per database per server is too ugly.

Our databases are not one table per database. They are mapped to DDD's
bounded contexts and usually by one table per domain entity.
For example, we have these databases:


   - Contacts
   - Courses
   - Seo
   - Payment
   - Forms
   - Geo
   - Sales
   - Media
   - Taxonomy
   - ...

These are the tables we have in the Contacts database:


   - Addresses
   - AddressTypes
   - Attributes
   - BankAccounts
   - ContactContents
   - Contacts
   - Emails
   - Genders
   - JobTitles
   - JuridicalPersons
   - NaturalPersonRelations
   - NaturalPersons
   - Persons
   - Phones
   - PhoneTypes
   - Relations
   - RelationTypes
   - SocialNetworks
   - SocialProfiles
   - Titles

And, these are the tables we have in the Geo database:


   - AdministrativeDivisions
   - AdministrativeDivisionTypes
   - Cities
   - CityDivisions
   - Countries
   - Locations
   - SpatialDataItems
   - TelephonePrefixes
   - TimeZones

But we also do have databases that only have one table in them. The number
of tables is not our criteria to break them. The business semantics is our
criteria.

> Cross-database on MSSQL is identical to the cross schema on Postgres.

Cross-database query in SQL Server is not equivalent to cross-schema
queries in Postgres. Because SQL Server also has the concept of schemas. In
other words, both SQL Server and Postgres let you create databases, create
schemas inside them, and create tables inside schemas. So SQL Server's
cross-schema query equals Postgres's cross-schema query.

> If you truly need cross server support (versus say beefier hardware) how
did you come to choose postgres?

We chose Postgres for these reasons that we did R&D about:


   - Native array per column support
   - Not having multiple storage engines like MariaDB to be confused about
   - Supporting expressions in unique constraints
   - It's usually considered one of the best when it comes to performance,
   especially in GIS we intend to develop more upon
   - As it claims on its website, it's the most advanced open-source
   database engine (but to be honest, we saw many serious drawbacks to that
   statement)

But here's the deal. We don't have one project only. We don't need
*cross-server
queries* for all of our projects. But we tend to keep our architecture the
same across projects as much as we can. We chose Postgres because we had
experience with SQL Server and MariaDB and assumed that cross-database
query on the same server is something natural. Both of them support that.
And both are very performant on that. On MariaDB all you have to do is to
use `db_name.table_name` and on SQL Server all you have to do is to use
`database_name.schema_name.table_name`. So we thought, for projects that do
not need more than one server, we keep databases on the same server. When
it needed more resources, we start by taking heavy databases onto their own
servers, and we start implementing table partitinong on them.

But we have experienced some amazing improvements too in our initial tests.
For example, creating all databases and tables and database objects on
MariaDB takes more than 400 seconds, while the same took 80 seconds on
Postgres. So amazing performance on DDL.
Also, 1 million records in bulk insertion take almost one-sixth to
on-fourth of the time on MariaDB. These are valuable numbers. They warmed
our hearts to keep digging as much as we can to see if we can perform this
migration.

Regards
Saeed

On Thu, Mar 6, 2025 at 7:14 AM Rob Sargent  wrote:

>
>
> On Mar 5, 2025, at 8:03 PM, Igor Korot jnit worked great for SQL Server.
> If you're small, we host them all on one server. If you get bigger, we can
> put heavy databases on separate machines.
>
>
>> However, I don't have experience working with other types of database
>> scaling. I have used table partitioning, but I have never used sharding.
>>
>> Anyway, that's why I asked you guys. However, encouraging me to go back
>> to monolith without giving solutions on how to scale, is not helping. To be
>> honest, I'm somehow disappointed by how the most advanced open source
>> database does not support cross-database querying just like how SQL Server
>> does. But if it doesn't, it doesn't. Our team should either drop it as a
>> choice or find a way (by asking the experts who built it or use it) how to
>> design based on its features. That's why I'm asking.
>>
>>
> Cross-database on MSSQL is identical to cross schema on postgres. If you
> truly need cross server support (versus say beefier hardware) how did you
> come to choose postgres?  The numbers you present are impressive but not
> unheard of on this list.
>
>


Re: Quesion about querying distributed databases

2025-03-05 Thread me nefcanto
I once worked with a monolithic SQL Server database with more than 10
billion records and about 8 Terabytes of data. A single backup took us more
than 21 days. It was a nightmare. Almost everybody knows that scaling up
has a ceiling, but scaling out has no boundaries.

Therefore I will never choose a monolithic database design unless it's a
small project. But my examples are just examples. We predict 100 million
records per year. So we have to design accordingly. And it's not just sales
records. Many applications have requirements that are cheap data but vast
in multitude. Consider a language-learning app that wants to store the
known words of any learner. 10 thousand learners each knowing 2 thousand
words means 20 million records. Convert that to 100 thousand learners each
knowing 7 thousand words and now you almost have a billion records. Cheap,
but necessary. Let's not dive into telemetry or time-series data.

We initially chose to break the database into smaller databases, because it
seemed natural for our modularized monolith architecture. And it worked
great for SQL Server. If you're small, we host them all on one server. If
you get bigger, we can put heavy databases on separate machines.

However, I don't have experience working with other types of database
scaling. I have used table partitioning, but I have never used sharding.

Anyway, that's why I asked you guys. However, encouraging me to go back to
monolith without giving solutions on how to scale, is not helping. To be
honest, I'm somehow disappointed by how the most advanced open source
database does not support cross-database querying just like how SQL Server
does. But if it doesn't, it doesn't. Our team should either drop it as a
choice or find a way (by asking the experts who built it or use it) how to
design based on its features. That's why I'm asking.

One thing that comes to my mind, is to use custom types. Instead of storing
data in ItemCategories and ItemAttributes, store them as arrays in the
relevant tables in the same database. But then it seems to me that in this
case, Mongo would become a better choice because I lose the relational
nature and normalization somehow. What drawbacks have you experienced in
that sense?

Regards
Saeed

On Wed, Mar 5, 2025 at 7:38 PM Adrian Klaver 
wrote:

> On 3/5/25 04:15, me nefcanto wrote:
> > Dear Laurenz, the point is that I think if we put all databases into one
> > database, then we have blocked our growth in the future.
>
> How?
>
> > A monolith database can be scaled only vertically. We have had huge
> > headaches in the past with SQL Server on Windows and a single database.
> > But when you divide bounded contexts into different databases, then you
> > have the chance to deploy each database on a separate physical machine.
> > That means a lot in terms of performance. Please correct me if I am
> wrong.
>
> And you add the complexity of talking across machines, as well as
> maintaining separate machines.
>
> >
> > Let's put this physical restriction on ourselves that we have different
> > databases. What options do we have? One option that comes to my mind, is
> > to store the ID of the categories in the Products table. This means that
> > I don't need FDW anymore. And databases can be on separate machines. I
> > first query the categories database first, get the category IDs, and
> > then add a where clause to limit the product search. That could be an
> > option. Array data type in Postgres is something that I think other
> > RDBMSs do not have. Will that work? And how about attributes? Because
> > attributes are more than a single ID. I should store the attribute key,
> > alongside its value. It's a key-value pair. What can I do for that?
>
> You seem to be going out of the way to make your life more complicated.
>
> The only way you are going to find an answer is set up test cases and
> experiment. My bet is a single server with a single database and
> multiple schemas is where you end up, after all that is where you are
> starting from.
>
>
> >
> > Thank you for sharing your time. I really appreciate it.
> > Saeed
> >
> >
> >
> >
> >
> > On Wed, Mar 5, 2025 at 3:18 PM Laurenz Albe  > <mailto:laurenz.a...@cybertec.at>> wrote:
> >
> > On Wed, 2025-03-05 at 14:18 +0330, me nefcanto wrote:
> >  > That means a solid monolith database. We lose many goodies with
> that.
> >  > As a real-world example, right now we can import a single database
> >  > from the production to the development to test and troubleshoot
> data.
> >
> > Well, can't you import a single schema then?
> >
> >  > What if we h

Re: Quesion about querying distributed databases

2025-03-11 Thread me nefcanto
Dear Laurenz

> I hear you, and I agree with that.

Thank you. Such a relief.

> If that worked well, then it should also work well with PostgreSQL and
foreign data wrappers.

You're right. We had problems with cross-server queries on SQL Server and
MariaDB too. It seems that cross-server queries are not solved by any
engine. But we had no problem with cross-database queries. That's where it
worked well both on SQL Server and MariaDB. It seems that for
cross-database queries, Postgres returns the entire result set from the
other database to this database and then performs joins locally. It seems
that for Postgres it's not different if the foreign database is on the same
machine, or it's on another machine. I just say so by seeing the queries
and asking questions about them. I have not performed a test yet.

> Well, if you split the data into several databases, that *was* sharding.

The way I understood it, sharding is when you split the database by rows,
not by tables. Examples choose a column like Tenant or User or Date as the
base of sharding. Never have I seen an example that stores Orders on one
database and Customers on another database and call it sharding. I don't
know, but we might call it distributed databases.

 > Consider using other, better databases than PostgreSQL (if you can find
them).

That's the point here. If we can't design a good thing on Postgres, then we
stick back to MariaDB. That's why we're researching and testing. As I
mentioned above, Postgres is amazing at some points but lacks some simple
things that other engines expose out of the box.

> Perhaps you should get a consultant; the mailing list does not seem to be
the right format for that request.

We have done that over the last decade. For SQL Server and then for
MariaDB. We have come up with some very practical and useful designs.
Separating CLOBs from main tables, storing UUID only as the name of files
to match the cloud storage, storing date-times as UTC, using bigint
everywhere even for small tables for consistency, denormalizing enum
storage (storing text instead of numeric value) even in large tables, etc.
etc.

But to choose a technology, we do have enough literacy and experience. It's
just some simple questions and answers. If I know that FDW works
differently for same-server databases, then I know that we will migrate.

> Don't ever store arrays in the database.  It will be a nightmare.

This is a very interesting claim. May I ask you to share its problems and
your experience?


On Thu, Mar 6, 2025 at 11:34 AM Laurenz Albe 
wrote:

> On Thu, 2025-03-06 at 06:13 +0330, me nefcanto wrote:
> > I once worked with a monolithic SQL Server database with more than 10
> billion
> > records and about 8 Terabytes of data. A single backup took us more than
> 21 days.
> > It was a nightmare. Almost everybody knows that scaling up has a
> ceiling, but
> > scaling out has no boundaries.
>
> I hear you, and I agree with that.
>
>
> > We initially chose to break the database into smaller databases, because
> it
> > seemed natural for our modularized monolith architecture. And it worked
> great
> > for SQL Server. If you're small, we host them all on one server. If you
> get
> > bigger, we can put heavy databases on separate machines.
>
> So you mean that you had those databases on different servers?
> How would a cross-database query work in that case?  It must be something
> akin to foreign data in PostgreSQL.
>
> If that worked well, then it should also work well with PostgreSQL and
> foreign data wrappers.  Look at the execution plan you got on SQL Server
> and see where PostgreSQL chooses a different plan.  Then try to improve
> that.
> We can try to help if we see actual plans.
>
> > However, I don't have experience working with other types of database
> > scaling. I have used table partitioning, but I have never used sharding.
>
> Well, if you split the data into several databases, that *was* sharding.
>
> > Anyway, that's why I asked you guys. However, encouraging me to go back
> to
> > monolith without giving solutions on how to scale, is not helping. To be
> > honest, I'm somehow disappointed by how the most advanced open source
> > database does not support cross-database querying just like how SQL
> Server
> > does. But if it doesn't, it doesn't. Our team should either drop it as a
> > choice or find a way (by asking the experts who built it or use it) how
> > to design based on its features. That's why I'm asking.
>
> Excluding options from the start is limiting yourself.  Consider using
> other, better databases than PostgreSQL (if you can find them).
>
> It is difficult to come up with a concrete design based o

Please implement a catch-all error handler per row, for COPY

2025-03-01 Thread me nefcanto
Hello

Please consider these scenarios:

- I want to create a million fake products, sometimes even 100 million
(we're on MariaDB now and we plan to migrate to Postgres). My team uses
fake data for performance tests and other use cases.
- Another scenario is translations. Even in production, we have translation
files for more than 20 languages, and for more than 2 thousand keys. That
means we need to insert 40 thousand translation records in the production.
- Another scenario is updating nested model values for a large hierarchical
table. For example, the categories table. Anytime the user changes a record
in that table we need to recalculate the nested model for the entire
categories and bulk update the results.

All of these scenarios are such that data sanitation is difficult if not
possible before doing the bulk operation (copy).

I realized that when we specify `on_error ignore` it just handles a handful
of errors. I thought this was a bug and sent an email to the pgsql-bugs
maling list. But they said it's the intended behavior.

Can you please provide a row-level catch-all handler for the copy command?

Regards
Saeed


Re: Quesion about querying distributed databases

2025-03-05 Thread me nefcanto
Dear Laurenz,

That means a solid monolith database. We lose many goodies with that. As a
real-world example, right now we can import a single database from the
production to the development to test and troubleshoot data.

What if we host all databases on the same server and use FDW. What happens
in that case? Does it return 100 thousand records and join in the memory?

Because in SQL Server, when you perform a cross-database query (not
cross-server) the performance is extremely good, proving that it does not
return 100 thousand ItemId from Taxonomy.ItemCategories to join with
ProductId.

Is that the same case with Postgres too, If databases are located on one
server?

Regards
Saeed


Re: Quesion about querying distributed databases

2025-03-05 Thread me nefcanto
Laurenz Albe, thanks for your answer.

Right now this data is in MariaDB, on separate databases (schema) but on
one server. The solution in this situation is to have a cross-database
query. (this is the status quo of our application)

Now our team has decided to migrate to Postgres. However, we realized that
Postgres does not support cross-database queries. And if we want to do so,
we should use FDW. So, we thought we might as well put databases on
separate servers for scalability if we have to write more code. That's the
reason behind this question.

But we're stuck at performance. In SQL Server and MariaDB, cross-database
queries allow for neat separation of data while delivering good performance
in the orchestration layer. You have separate databases, which allows for
fine-grained management (different backup schedules, index recalculation,
deployment, etc.) but at the same time you can write a query in your app,
or in an orchestrator database (let's call it All) that is fast enough for
millions of records.

However, we're stuck in this in Postgres. What solutions exist for this
problem?

Regards
Saeed


On Wed, Mar 5, 2025 at 11:09 AM Laurenz Albe 
wrote:

> On Wed, 2025-03-05 at 10:12 +0330, me nefcanto wrote:
> > Adrian Klaver, thank you for the link. I asked the AI to create a query
> for me using FDW.
> >
> > The problem here is that it collects all of the product_id values from
> the ItemCategories table [...]
> >
> > That's not scalable. Is there a workaround for this?
>
> Without having scrutinized the case in detail: if your data are organized
> in an
> entity-attribute-value design and distributed across three databases, you
> cannot
> expect to end up with efficient queries.
>
> Perhaps you can extract the data and load them into a reasonably organized
> single database.  Such an ETL process might make the task much easier.
>
> Yours,
> Laurenz Albe
>


Re: Quesion about querying distributed databases

2025-03-05 Thread me nefcanto
Dear Laurenz, the point is that I think if we put all databases into one
database, then we have blocked our growth in the future.
A monolith database can be scaled only vertically. We have had huge
headaches in the past with SQL Server on Windows and a single database.
But when you divide bounded contexts into different databases, then you
have the chance to deploy each database on a separate physical machine.
That means a lot in terms of performance. Please correct me if I am wrong.

Let's put this physical restriction on ourselves that we have different
databases. What options do we have? One option that comes to my mind, is to
store the ID of the categories in the Products table. This means that I
don't need FDW anymore. And databases can be on separate machines. I first
query the categories database first, get the category IDs, and then add a
where clause to limit the product search. That could be an option. Array
data type in Postgres is something that I think other RDBMSs do not have.
Will that work? And how about attributes? Because attributes are more than
a single ID. I should store the attribute key, alongside its value. It's a
key-value pair. What can I do for that?

Thank you for sharing your time. I really appreciate it.
Saeed





On Wed, Mar 5, 2025 at 3:18 PM Laurenz Albe 
wrote:

> On Wed, 2025-03-05 at 14:18 +0330, me nefcanto wrote:
> > That means a solid monolith database. We lose many goodies with that.
> > As a real-world example, right now we can import a single database
> > from the production to the development to test and troubleshoot data.
>
> Well, can't you import a single schema then?
>
> > What if we host all databases on the same server and use FDW. What
> > happens in that case? Does it return 100 thousand records and join
> > in the memory?
>
> It will do just the same thing.  The performance could be better
> because of the reduced latency.
>
> > Because in SQL Server, when you perform a cross-database query
> > (not cross-server) the performance is extremely good, proving that
> > it does not return 100 thousand ItemId from Taxonomy.ItemCategories
> > to join with ProductId.
> >
> > Is that the same case with Postgres too, If databases are located
> > on one server?
>
> No, you cannot perform cross-database queries without a foreign
> data wrapper.  I don't see a reason why the statement shouldn't
> perform as well as in SQL Server if you use schemas instead of
> databases.
>
> Yours,
> Laurenz Albe
>