On Thu, Jul 10, 2025 at 9:03 PM Merlin Moncure <mmonc...@gmail.com> wrote: > On Thu, Jul 10, 2025 at 5:48 AM Dominique Devienne <ddevie...@gmail.com> > wrote: >> We store scientific information in PostgreSQL, and some of that is >> bytea and large, thus we must "chunk it" both for performance, and not >> be limited to 1GB (we do exceed that, in rare occasions). > > Does it have to match the md5 of the 'unchunked' variant exactly?
Well, yeah, ideally. > If not, maybe you can rig a custom aggregate that would just "hash amend" Sure, there are various ways to get a deterministic hash of such a chunked-across-rows (logical) value. But it always involves hashing the hash, which statistically might not be ideal. But more importantly, it departs from the normal "serial" hashing of the full logical hash. With the full hash, I can use various FS OS tools and SQLite aggregate-UDFs and TBD PostgreSQL aggregate-Digests to compare those hashes in natural and consistent ways. Working around lack of aggregate digests in PostgreSQL forces to replicate those work-arounds at the FS and SQLite levels, for comparisons. I.e. Not good. > I had to do something very similar with compression, I had a need to be able > to compress bytea values with lz4 at SQL level due to limited support for > extensions in the cloud. it works great...if a bit slow, and requires the > database to handle the decompression. The slow part is what I want to avoid. I explored various ways to work-around true server-side aggregate hashing, and none are satisfactory nor performant enough. This is really the kind of primitive that must be built-in to be efficient enough to be "useful". (and I'd argue compression, deduplication, etc... are similar issues. BLOBs / BYTEAs are too often "limited" in RDBMSs, with people saying store them outside the DB, and I always find so strange, and a bit of a cope out to making it performant enough and/or full features enough. A bit of a rant here, sorry ;)).