Hi Islamiat, This is not an area that I have domain knowledge but a few overall points:
1. It sounds like this will be an on-going pipeline. I'd suggest you look into a table format to help manage changes (e.g. Delta Lake, Iceberg, Hudi or Duck Lake). 2. Be careful not to over-partition this will hurt performance by making a lot of small parquet files. 3. It isn't clear from the description that the actual data sizes of 10000s rows is not a big deal for Parquet to handle (it starts to push the limits in terms of the number of columns and we are working on trying to fix issues there). Hope this helps. -Micah On Mon, Feb 23, 2026 at 12:18 AM Islamiat Shittu <[email protected]> wrote: > Hello all, > > I am unsure of where to post my questions, so I do apologies for the long > message, and if anyone can point me to the right forum that would be great. > My questions are largely about schema, storage, and partitioning for > Parquet files and DuckDB. > > I am building an institute-wide omics data visualisation platform (similar > in spirit to CellxGene) but designed to support independent research groups > and am considering a DuckLake-like approach. > > Core stack: > > * > Parquet for long-term immutable data storage > * > DuckDB as the query engine; also used to store project metadata > * > Arrow interop for zero-copy where possible (to be fed into R Shiny) > > Datasets are contributed by many groups in mixed formats (e.g. csv, .rds, > .h5ad etc.) but I have a few constraints: > > * > inconsistent schemas > * > large datasets (1,000s – 10,000s of genes, cells) > * > interactive queries that require filtering, sorting, subsetting etc. > without loading full data into memory > > I plan on developing a conversion to Parquet/validation pipeline and have > considered the following tables: > > * > genes.parquet # layered w/ proteins.parquet? > * > counts.parquet # counts matrices > * > expression.parquet # analysis (DEG) results > * > cells.parquet # chunked (e.g. cells 0–4999) > * > embeddings.parquet # multilayered with PCA, UMAP, t-SNE > * > dataset_metadata.parquet # unstructured, string format or store in duckdb? > * > qc.parquet # data summaries and aggregates? > * > gene_sets.parquet # optional? > * > QUERY_HASH.parquet # query results cache? > > Questions: > > 1. > In this context, do users typically normalise data towards a specific > structure or allow arbitrary schemas and then map via DuckDB views or > lookup configs/semantic mapping layers? > > > 1. > Since I will be aggregating/ precomputing data (created during data > ingestion e.g. QC summaries), is best to store them as Parquet or database > tables for Shiny to access it? > > > 1. > Is it also best to store gene sets as cached queries, within the DB or as > separate Parquet? > > > 1. > Given Parquet’s columnar nature: > > * > Is it generally better to model expression matrices in long format rather > than wide format? > * > Are there recommended hybrid approaches (e.g. chunked blocks per gene or > per cell)? > * > How are multiple layers of the same data typically stored? (ie separate > files, columns or tables) > > Given the interactive workloads, I am considering DuckDB’s hive-style > partitioning to improve performance by of partitioning the top-level folder > to encode query filters for dataset_id, modality/omic_type, contrast, > organism. Does this make logical sense? For example: > > ``` > …/{DATASET_ID}/parquet_files/ > └── organism=human/ > └── modality=transcriptomics/ > └── dataset_id=dataset_001/ > └── contrast=treated_vs_control/ > ├── results/ > │ ├── expression.parquet > │ ├── genes.parquet > ``` > > Many thanks, > > Dammy Shittu > > > > Data Research Assistant > > Core Informatics Team @ UK Dementia Research Institute (UK DRI) > > --------------------------------------------------------------------- > > e: [email protected] | [email protected] > > a: UK DRI, UCL Queen Square Institute of Neurology, Queen Square, London > WC1N 3BG > > w: https://www.ukdri.ac.uk/centres/ucl > > > > As I work flexible hours, I do not expect replies outside of your own > normal working hours. >
