mpoeter opened a new issue, #44917: URL: https://github.com/apache/arrow/issues/44917
### Describe the enhancement requested I am currently working on a project where we wanted to use parquet for storing data, but unfortunately our initial performance tests showed that reading from parquet files is quite slow. Profiling showed that a lot of time is spent in (potentially unnecessary) memcpy calls. The test file is really small - a single RowGroup and a number of columns (all compressed; dictionary encoding off) - with just ~17MB. My first attempt was to use `ColumnReader::NextBatch`. I intentionally passed a large batch size to ensure that everything can be loaded into a single batch, but that results in memory reallocations to adjust the buffer size to the loaded data (because `shrink_to_fit` in `BufferBuilder::Finish` defaults to true), thus memcpying stuff around: ``` libc.so.6 ! __memmove_avx512_unaligned_erms - memmove-vec-unaligned-erms.S table_benchmark ! memcpy + 0x16 - string_fortified.h:29 table_benchmark ! ReallocateAligned + 0x186 - memory_pool.cc:354 table_benchmark ! arrow::BaseMemoryPoolImpl<arrow::(anonymous namespace)::SystemAllocator>::Reallocate + 0x2e - memory_pool.cc:485 table_benchmark ! arrow::PoolBuffer::Resize + 0x116 - memory_pool.cc:884 table_benchmark ! arrow::BufferBuilder::Resize + 0x26 - buffer_builder.h:81 table_benchmark ! arrow::BufferBuilder::Finish + 0x24 - buffer_builder.h:166 table_benchmark ! arrow::BaseBinaryBuilder<arrow::BinaryType>::FinishInternal + 0xb1 - builder_binary.h:299 table_benchmark ! arrow::ArrayBuilder::Finish + 0x3f - builder_base.cc:363 table_benchmark ! parquet::internal::(anonymous namespace)::ByteArrayChunkedRecordReader::GetBuilderChunks + 0x135 - column_reader.cc:2061 table_benchmark ! parquet::arrow::(anonymous namespace)::TransferBinary + 0x125 - reader_internal.cc:555 table_benchmark ! parquet::arrow::TransferColumnData + 0xb5 - reader_internal.cc:868 table_benchmark ! parquet::arrow::(anonymous namespace)::LeafReader::LoadBatch + 0x143 - reader.cc:488 table_benchmark ! parquet::arrow::ColumnReaderImpl::NextBatch + 0x3b - reader.cc:109 ``` To avoid this I then tried to use `ColumnChunkReader::Read` in the hope that this one is smart enough to perform a single allocation of the appropriate size (after all we should know exactly the uncompressed size of the column data in the file), but unfortunately this is not the case and we end up in a similar situation: ``` libc.so.6 ! __memmove_avx512_unaligned_erms - memmove-vec-unaligned-erms.S table_benchmark ! memcpy + 0x16 - string_fortified.h:29 table_benchmark ! ReallocateAligned + 0x186 - memory_pool.cc:354 table_benchmark ! arrow::BaseMemoryPoolImpl<arrow::(anonymous namespace)::SystemAllocator>::Reallocate + 0x2e - memory_pool.cc:485 table_benchmark ! arrow::PoolBuffer::Reserve + 0xc6 - memory_pool.cc:863 table_benchmark ! arrow::PoolBuffer::Resize + 0x75 - memory_pool.cc:889 table_benchmark ! arrow::BufferBuilder::Resize + 0x26 - buffer_builder.h:81 table_benchmark ! arrow::BufferBuilder::Reserve + 0x418 - buffer_builder.h:98 table_benchmark ! arrow::TypedBufferBuilder<unsigned char, void>::Reserve - buffer_builder.h:291 table_benchmark ! arrow::BaseBinaryBuilder<arrow::BinaryType>::ReserveData + 0x2e - builder_binary.h:333 table_benchmark ! Prepare + 0x62 - decoder.cc:441 table_benchmark ! DecodeArrowDense + 0x46 - decoder.cc:660 table_benchmark ! parquet::(anonymous namespace)::PlainByteArrayDecoder::DecodeArrow + 0x10 - decoder.cc:647 table_benchmark ! parquet::internal::(anonymous namespace)::ByteArrayChunkedRecordReader::ReadValuesSpaced + 0x4b - column_reader.cc:2076 table_benchmark ! ReadSpacedForOptionalOrRepeated + 0xcc - column_reader.cc:1840 table_benchmark ! ReadOptionalRecords + 0x2d - column_reader.cc:1796 table_benchmark ! parquet::internal::(anonymous namespace)::TypedRecordReader<parquet::PhysicalType<(parquet::Type::type)6>>::ReadRecordData + 0x100 - column_reader.cc:1866 table_benchmark ! parquet::internal::(anonymous namespace)::TypedRecordReader<parquet::PhysicalType<(parquet::Type::type)6>>::ReadRecords + 0x1f0 - column_reader.cc:1335 table_benchmark ! parquet::arrow::(anonymous namespace)::LeafReader::LoadBatch + 0x8a - reader.cc:482 table_benchmark ! parquet::arrow::ColumnReaderImpl::NextBatch + 0x3b - reader.cc:109 table_benchmark ! parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadColumn + 0x113 - reader.cc:284 table_benchmark ! parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadColumn + 0x138 - reader.cc:292 table_benchmark ! parquet::arrow::(anonymous namespace)::ColumnChunkReaderImpl::Read + 0x7c - reader.cc:408 ``` According to my profile results, in both cases we spend about as much time memcpying stuff around as we spend in LZ4 decompression. (And on top of that we also spend quite a bit of time in memset.) I wasn't sure whether to post this as an "Enhancement" or a "Question". After all, there might be a different way to read the column data that can avoid all this reallocation, but if there is, I didn't find it. ### Component(s) C++, Parquet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org