mpoeter opened a new issue, #44917:
URL: https://github.com/apache/arrow/issues/44917

   ### Describe the enhancement requested
   
   I am currently working on a project where we wanted to use parquet for 
storing data, but unfortunately our initial performance tests showed that 
reading from parquet files is quite slow. Profiling showed that a lot of time 
is spent in (potentially unnecessary) memcpy calls. The test file is really 
small - a single RowGroup and a number of columns (all compressed; dictionary 
encoding off) - with just ~17MB.
   
   My first attempt was to use `ColumnReader::NextBatch`. I intentionally 
passed a large batch size to ensure that everything can be loaded into a single 
batch, but that results in memory reallocations to adjust the buffer size to 
the loaded data (because `shrink_to_fit` in `BufferBuilder::Finish` defaults to 
true), thus memcpying stuff around:
   ```
   libc.so.6 ! __memmove_avx512_unaligned_erms - memmove-vec-unaligned-erms.S
   table_benchmark ! memcpy + 0x16 - string_fortified.h:29
   table_benchmark ! ReallocateAligned + 0x186 - memory_pool.cc:354
   table_benchmark ! arrow::BaseMemoryPoolImpl<arrow::(anonymous 
namespace)::SystemAllocator>::Reallocate + 0x2e - memory_pool.cc:485
   table_benchmark ! arrow::PoolBuffer::Resize + 0x116 - memory_pool.cc:884
   table_benchmark ! arrow::BufferBuilder::Resize + 0x26 - buffer_builder.h:81
   table_benchmark ! arrow::BufferBuilder::Finish + 0x24 - buffer_builder.h:166
   table_benchmark ! 
arrow::BaseBinaryBuilder<arrow::BinaryType>::FinishInternal + 0xb1 - 
builder_binary.h:299
   table_benchmark ! arrow::ArrayBuilder::Finish + 0x3f - builder_base.cc:363
   table_benchmark ! parquet::internal::(anonymous 
namespace)::ByteArrayChunkedRecordReader::GetBuilderChunks + 0x135 - 
column_reader.cc:2061
   table_benchmark ! parquet::arrow::(anonymous namespace)::TransferBinary + 
0x125 - reader_internal.cc:555
   table_benchmark ! parquet::arrow::TransferColumnData + 0xb5 - 
reader_internal.cc:868
   table_benchmark ! parquet::arrow::(anonymous 
namespace)::LeafReader::LoadBatch + 0x143 - reader.cc:488
   table_benchmark ! parquet::arrow::ColumnReaderImpl::NextBatch + 0x3b - 
reader.cc:109
   ```
   
   To avoid this I then tried to use `ColumnChunkReader::Read` in the hope that 
this one is smart enough to perform a single allocation of the appropriate size 
(after all we should know exactly the uncompressed size of the column data in 
the file), but unfortunately this is not the case and we end up in a similar 
situation:
   ```
   libc.so.6 ! __memmove_avx512_unaligned_erms - memmove-vec-unaligned-erms.S
   table_benchmark ! memcpy + 0x16 - string_fortified.h:29
   table_benchmark ! ReallocateAligned + 0x186 - memory_pool.cc:354
   table_benchmark ! arrow::BaseMemoryPoolImpl<arrow::(anonymous 
namespace)::SystemAllocator>::Reallocate + 0x2e - memory_pool.cc:485
   table_benchmark ! arrow::PoolBuffer::Reserve + 0xc6 - memory_pool.cc:863
   table_benchmark ! arrow::PoolBuffer::Resize + 0x75 - memory_pool.cc:889
   table_benchmark ! arrow::BufferBuilder::Resize + 0x26 - buffer_builder.h:81
   table_benchmark ! arrow::BufferBuilder::Reserve + 0x418 - buffer_builder.h:98
   table_benchmark ! arrow::TypedBufferBuilder<unsigned char, void>::Reserve - 
buffer_builder.h:291
   table_benchmark ! arrow::BaseBinaryBuilder<arrow::BinaryType>::ReserveData + 
0x2e - builder_binary.h:333
   table_benchmark ! Prepare + 0x62 - decoder.cc:441
   table_benchmark ! DecodeArrowDense + 0x46 - decoder.cc:660
   table_benchmark ! parquet::(anonymous 
namespace)::PlainByteArrayDecoder::DecodeArrow + 0x10 - decoder.cc:647
   table_benchmark ! parquet::internal::(anonymous 
namespace)::ByteArrayChunkedRecordReader::ReadValuesSpaced + 0x4b - 
column_reader.cc:2076
   table_benchmark ! ReadSpacedForOptionalOrRepeated + 0xcc - 
column_reader.cc:1840
   table_benchmark ! ReadOptionalRecords + 0x2d - column_reader.cc:1796
   table_benchmark ! parquet::internal::(anonymous 
namespace)::TypedRecordReader<parquet::PhysicalType<(parquet::Type::type)6>>::ReadRecordData
 + 0x100 - column_reader.cc:1866
   table_benchmark ! parquet::internal::(anonymous 
namespace)::TypedRecordReader<parquet::PhysicalType<(parquet::Type::type)6>>::ReadRecords
 + 0x1f0 - column_reader.cc:1335
   table_benchmark ! parquet::arrow::(anonymous 
namespace)::LeafReader::LoadBatch + 0x8a - reader.cc:482
   table_benchmark ! parquet::arrow::ColumnReaderImpl::NextBatch + 0x3b - 
reader.cc:109
   table_benchmark ! parquet::arrow::(anonymous 
namespace)::FileReaderImpl::ReadColumn + 0x113 - reader.cc:284
   table_benchmark ! parquet::arrow::(anonymous 
namespace)::FileReaderImpl::ReadColumn + 0x138 - reader.cc:292
   table_benchmark ! parquet::arrow::(anonymous 
namespace)::ColumnChunkReaderImpl::Read + 0x7c - reader.cc:408
   ```
   
   According to my profile results, in both cases we spend about as much time 
memcpying stuff around as we spend in LZ4 decompression. (And on top of that we 
also spend quite a bit of time in memset.)
   
   I wasn't sure whether to post this as an "Enhancement" or a "Question". 
After all, there might be a different way to read the column data that can 
avoid all this reallocation, but if there is, I didn't find it.
   
   ### Component(s)
   
   C++, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to