[I] [C++][Dataset] Failure during writing dataset empty Batch [arrow]

via GitHub Thu, 23 Jan 2025 04:07:40 -0800


mroz45 opened a new issue, #45338:
URL: https://github.com/apache/arrow/issues/45338


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   When attempting to write a dataset that includes an empty RecordBatch, an 
assertion failure occurs in arrow::internal::Gather. The error message 
indicates that the assertion src && idx && out failed.
   
   /arrow/cpp/src/arrow/compute/kernels/gather_internal.h:221: 
arrow::internal::Gather<kValueWidthInBits, IndexCType, 
kWithFactor>::Gather(int64_t, const uint8_t*, int64_t, int64_t, const 
IndexCType*, uint8_t*, int64_t) [with int kValueWidthInBits = 32; IndexCType = 
unsigned int; bool kWithFactor = false; int64_t = long int; uint8_t = unsigned 
char]: Assertion `src && idx && out' failed.
   ~/arrowM/arrow/cpp/build_t/src/arrow/dataset
   
   I have no idea what should be correct approach to resolve this problem. 
Mayby reader shoud not read empty RecordBatches?
   I created Test reproducing this behavior. 
   
   ``
   TEST(TestDatasetEmptyRecordBatch, EmptyRecordBatch) {
     const auto dataset_schema = ::arrow::schema({
         field("a", int32()),
         field("b", boolean()),
         field("c", int32()),
     });
   
     const auto physical_schema = SchemaFromColumnNames(dataset_schema, {"a", 
"b"});
   
     // Create a mock filesystem and populate it with data, including both 
non-empty and
     // empty record batches.
     auto mock_fs = std::make_shared<fs::internal::MockFileSystem>(fs::kNoTime);
     {  // recreate pre PR-39995 dataset 
https://github.com/apache/arrow/pull/39995
       ASSERT_OK(mock_fs->CreateDir("/my_dataset"));
       ASSERT_OK(mock_fs->CreateDir("/my_dataset/c=100"));
       ASSERT_OK_AND_ASSIGN(auto os, 
mock_fs->OpenOutputStream("/my_dataset/c=100/0.arrow"));
       ASSERT_OK_AND_ASSIGN(auto recordBatchWriter,
                            arrow::ipc::MakeFileWriter(os.get(), 
physical_schema));
   
       auto rb = RecordBatchFromJSON(physical_schema, R"([{"a": 1,    "b": 
null},
                                                  {"a": 2,    "b": true}])");
   
       ASSERT_OK(recordBatchWriter->WriteRecordBatch(*rb));
       ASSERT_OK_AND_ASSIGN(auto empty, 
arrow::RecordBatch::MakeEmpty(physical_schema));
       ASSERT_OK(recordBatchWriter->WriteRecordBatch(*empty));
       ASSERT_OK(recordBatchWriter->Close());
       ASSERT_OK(os->Close());
     }
     auto partitioning = std::make_shared<HivePartitioning>(
         schema({field("c", int32(), /*nullable=*/false)}));
     auto ipc_format = std::make_shared<dataset::IpcFileFormat>();
     auto options = std::make_shared<ScanOptions>();
   
     FileSystemFactoryOptions FSoptions;
     FSoptions.partitioning = partitioning;
   
     arrow::fs::FileSelector selector;
     selector.base_dir = "/my_dataset";
     selector.recursive = true;
   
     ASSERT_OK_AND_ASSIGN(auto factory, FileSystemDatasetFactory::Make(
                                            mock_fs, selector, ipc_format, 
FSoptions));
   
     ASSERT_OK_AND_ASSIGN(auto dataset, factory->Finish());
   
     dataset::FileSystemDatasetWriteOptions fs_write_options_;
     fs_write_options_.filesystem = mock_fs;
     fs_write_options_.partitioning = partitioning;
     fs_write_options_.base_dir = "/my_dataset2";
     fs_write_options_.basename_template = "{i}.arrow";
     fs_write_options_.file_write_options = ipc_format->DefaultWriteOptions();
     auto plan = acero::Declaration::Sequence({
         {"scan", ScanNodeOptions{dataset, options}},
         {"write", dataset::WriteNodeOptions{fs_write_options_}},
     });
     ASSERT_OK(DeclarationToStatus(plan));
   }
   ``
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [C++][Dataset] Failure during writing dataset empty Batch [arrow]

Reply via email to