[
https://issues.apache.org/jira/browse/AVRO-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated AVRO-4240:
---------------------------------
Labels: pull-request-available (was: )
> Size DataFileWriter output buffer to fit entire block frame to reduce write
> syscalls
> ------------------------------------------------------------------------------------
>
> Key: AVRO-4240
> URL: https://issues.apache.org/jira/browse/AVRO-4240
> Project: Apache Avro
> Issue Type: Improvement
> Components: java
> Affects Versions: 1.10.2, 1.11.5, 1.12.1
> Reporter: David Mollitor
> Assignee: David Mollitor
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.13.0
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> DataFileStream.DataBlock#writeBlockTo writes four pieces sequentially through
> a DirectBinaryEncoder into a BufferedFileOutputStream that uses the default
> 8KB BufferedOutputStream buffer:
> # Entry count (varint-encoded long, 1-10 bytes)
> # Block size (varint-encoded long, 1-10 bytes)
> # Compressed block data (~64KB at the default sync interval)
> # Sync marker (16 bytes)
> The default sync interval was increased from 16KB to 64KB in AVRO-1398 but
> the BufferedFileOutputStream buffer size was never adjusted. Since the block
> data far exceeds the 8KB buffer, BufferedOutputStream flushes the buffered
> entry count and block size bytes, then writes the block data directly, then
> the sync marker goes into the buffer and gets flushed again at the end,
> resulting in at least 3 write syscalls per block instead of 1.
> This change sizes the BufferedFileOutputStream buffer to maxBlockSize() + 20
> + sync.length so that a complete block frame fits in a single buffer,
> accumulates all writes, and flushes once.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)