[ 
https://issues.apache.org/jira/browse/AVRO-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated AVRO-4240:
---------------------------------
    Labels: pull-request-available  (was: )

> Size DataFileWriter output buffer to fit entire block frame to reduce write 
> syscalls
> ------------------------------------------------------------------------------------
>
>                 Key: AVRO-4240
>                 URL: https://issues.apache.org/jira/browse/AVRO-4240
>             Project: Apache Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.10.2, 1.11.5, 1.12.1
>            Reporter: David Mollitor
>            Assignee: David Mollitor
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.13.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> DataFileStream.DataBlock#writeBlockTo writes four pieces sequentially through 
> a DirectBinaryEncoder into a BufferedFileOutputStream that uses the default 
> 8KB BufferedOutputStream buffer:
>  # Entry count (varint-encoded long, 1-10 bytes)
>  # Block size (varint-encoded long, 1-10 bytes)
>  # Compressed block data (~64KB at the default sync interval)
>  # Sync marker (16 bytes)
> The default sync interval was increased from 16KB to 64KB in AVRO-1398 but 
> the BufferedFileOutputStream buffer size was never adjusted. Since the block 
> data far exceeds the 8KB buffer, BufferedOutputStream flushes the buffered 
> entry count and block size bytes, then writes the block data directly, then 
> the sync marker goes into the buffer and gets flushed again at the end, 
> resulting in at least 3 write syscalls per block instead of 1.
> This change sizes the BufferedFileOutputStream buffer to maxBlockSize() + 20 
> + sync.length so that a complete block frame fits in a single buffer, 
> accumulates all writes, and flushes once.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to