Hello, I am trying to determine the suitability of Postgres for a significant climate risk modelling project.
We are batch processing a large (500 million) collection of geographical points. For each point, we store ~6 dimensions of various risks (total cardinality of several millions of floats per geographical point). We need to perform various ad-hoc aggregations on geographical subsets of the values associated with these points. These aggregations could require median/percentiles, so they won't be as simple as mean/sum, and we expect we may have to write custom aggregations for some cases. Because we may want to run computations that would use PostGIS features (certainly polygon containment; potentially others), and because our existing applications already use Postgres, we have some degree of preference to do this in Postgres. I'd like to know if anyone here has successfully built a system to run this sort of computation at this scale in Postgres. If so, what sort of schema design did you use? Columnar stores referencing spatially indexed row stores that contain the spatial references, sharded by geographical region? What sort of throughput did you achieve? I'm also interested in any general observations folks may have about this project. Perhaps we should use Clickhouse (for the main data) together with Postgres (for the GIS computations)? Perhaps our float dataset should live outside any kind of oltp/olap database at all? Something else? And finally, if you have developed a system like this, are you available to assist us with building this system on a consulting basis? Thanks in advance, —Sohum
