kszucs opened a new issue, #45750:
URL: https://github.com/apache/arrow/issues/45750

   ### Describe the enhancement requested
   
   ## Rationale
   
   Unlike the traditional approach where pages are split once a page's size 
reaches the default limit (typically 1MB), this implementation splits pages 
based on the chunk boundaries identified by the chunker. Consequently, the 
resulting chunks will have variable sizes but will be more resilient to data 
modifications such as updates, inserts, and deletes. This method enhances the 
robustness and efficiency of data storage and retrieval in Apache Parquet by 
ensuring that identical data segments are consistently chunked in the same 
manner, regardless of their position within the dataset.
   
   ## Parquet Deduplication
   
   The space savings can be significant, see some test results generated on 
test data containing a series of snapshots from a database:
   
   ```
   
┏━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
   ┃            ┃            ┃            ┃     Compressed Chunk ┃             
┃     Compressed Dedup ┃    Transmitted XTool ┃
   ┃ Title      ┃ Total Size ┃ Chunk Size ┃                 Size ┃ Dedup Ratio 
┃                Ratio ┃                Bytes ┃
   
┡━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
   │ JSONLines  │   93.0 GiB │   64.9 GiB │             12.4 GiB │         70% 
│                  13% │             13.5 GiB │
   │ Parquet    │   16.2 GiB │   15.0 GiB │             13.4 GiB │         93% 
│                  83% │             13.5 GiB │
   │ CDC ZSTD   │    8.8 GiB │    5.6 GiB │              5.6 GiB │         64% 
│                  64% │              6.1 GiB │
   │ CDC Snappy │   16.2 GiB │    8.6 GiB │              8.1 GiB │         53% 
│                  50% │              9.4 GiB │
   
└────────────┴────────────┴────────────┴──────────────────────┴─────────────┴──────────────────────┴──────────────────────┘
   ```
   
   The results are calculated by simulating a content addressable storage 
system like [Hugging Face 
Hub](https://xethub.com/blog/from-files-to-chunks-improving-hf-storage-efficiency)
 or [restic](https://restic.net/).
   
   
   
![Image](https://github.com/user-attachments/assets/17de78c1-66aa-4a73-ad6a-6fffc4ef4d4c)
   
   ## Example of inserting records into a parquet file
   
   The following heatmaps show common byte blocks of a parquet file before and 
after inserting some records. The green parts are common whereas the red parts 
are different, hence must be stored twice. Using CDC chunking CAS systems can 
achieve much higher deduplication ratios.
   
   <img width="856" alt="Image" 
src="https://github.com/user-attachments/assets/d43fc774-6cdb-447c-bd82-3fade8a3911f";
 />
   
   
   
   
   
   ### Component(s)
   
   C++, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to