zbs opened a new issue, #45638:
URL: https://github.com/apache/arrow/issues/45638

   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   I created a ParquetWriter class that initializes a schema/writer/outstream 
and writes to a parquet file one batch at a time; however, it seems to be 
OOMing and for the life of me I can not figure out why. I've experimented with 
different chunk sizes, calling flush after each write, but to no avail, and 
cannot find any documentation that indicates I'm doing anything seriously 
wrong. Further, I used valgrind and it did not find any memory leaks.
   
   Note this is using arrow 9.0.0, so it's considerably behind, but I'm hoping 
the solution doesn't need to take that into account too much
   ```
   class ParquetFileWriter {
     public:
       ParquetFileWriter(
         const std::string &                 filename,
         parquet::WriterProperties::Builder & writerPropertiesBuilder );
   
       void initSchema( const std::shared_ptr< arrow::Schema > & schema ) 
override;
   
       void writeBatch( const arrow::RecordBatch & batch ) override;
   
       void poll() override;
   
     private:
       std::shared_ptr< arrow::io::OutputStream >       outstream_;
       std::unique_ptr< parquet::arrow::FileWriter >    parquetWriter_;
       std::shared_ptr< parquet::WriterProperties >     writerProperties_;
     };
   
     ParquetFileWriter::ParquetFileWriter(
       const std::string &                 filename,
       parquet::WriterProperties::Builder & writerPropertiesBuilder )
       : outstream_( arrow::io::FileOutputStream::Open( filename, false 
).ValueOrDie() ) {
       writerPropertiesBuilder.compression(parquet::Compression::ZSTD);
       writerProperties_ = writerPropertiesBuilder.build();
     }
   
     void
     ParquetFileWriter::initSchema( const std::shared_ptr< arrow::Schema > & 
schema ) {
       auto status = parquet::arrow::FileWriter::Open(*schema, 
arrow::default_memory_pool(), outstream_, writerProperties_, &parquetWriter_);
       VERIFY(status.ok(), "Failed to open Parquet file writer: %s", 
status.ToString().c_str());
     }
   
     void
     ParquetFileWriter::writeBatch( const arrow::RecordBatch & batch ) {
       check( parquetWriter_ != nullptr, "Parquet writer is not initialized." );
       std::vector<std::shared_ptr<arrow::RecordBatch>> batches = { 
arrow::RecordBatch::Make(batch.schema(), batch.num_rows(), batch.columns()) };
       auto table = arrow::Table::FromRecordBatches(batch.schema(), batches);
       const arrow::Status status = parquetWriter_->WriteTable(**table, 
batch.num_rows());
   
       check( status.ok(), "Parquet batch write failed: %s", 
status.ToString().c_str() );
       check( flushStatus.ok(), "Flush status failed: %s", 
status.ToString().c_str() );
     }
   ```
   
   In the above code, `writeBatch` might be called thousands of times.
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to