alamb opened a new issue, #9637:
URL: https://github.com/apache/arrow-rs/issues/9637

   **Describe the bug**
   - Found while testing https://github.com/apache/arrow-rs/pull/9447 
   
   Writing nested list data with parquet::arrow::ArrowWriter and 
content-defined chunking enabled can panic inside the parquet column writer 
with an
     out-of-bounds slice access.
   
     This appears to be a regression from:
   
     - bad: bc74c7192a48bd36a9e79b883a3482af396a2350 (feat(parquet): add 
content defined chunking for arrow writer (#9450))
     - good: 39dda22517e6369d006aaac5eaac53d9cd72c29b
    
   
   
   **To Reproduce**
   This currently fails on main:
   ```shell
   nice cargo bench -p parquet --bench arrow_writer -- list_primitive/cdc
   ```
   
   Fails like this:
   ```
   Benchmarking list_primitive/cdc: Warming up for 3.0000 s
   thread 'main' (11848300) panicked at parquet/src/column/writer/mod.rs:720:39:
   range end index 59344 out of range for slice of length 58905
   note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
   ```
   
   Here is a standalone reproducer (in parquet/tests/arrow_writer.rs) that 
codex made for me:
   
   ```rust
   #[test]
   fn test_arrow_writer_cdc_list_roundtrip_regression() {
       let schema = Arc::new(Schema::new(vec![
           Field::new(
               "_1",
               DataType::List(Arc::new(Field::new_list_field(DataType::Int32, 
true))),
               true,
           ),
           Field::new(
               "_2",
               DataType::List(Arc::new(Field::new_list_field(DataType::Boolean, 
true))),
               true,
           ),
           Field::new(
               "_3",
               
DataType::LargeList(Arc::new(Field::new_list_field(DataType::Utf8, true))),
               true,
           ),
       ]));
       let props = WriterProperties::builder()
           .set_content_defined_chunking(Some(CdcOptions::default()))
           .build();
       let batch = create_random_batch(schema, 2, 0.25, 0.75).unwrap();
   
       let mut buffer = Vec::new();
       let mut writer = ArrowWriter::try_new(&mut buffer, batch.schema(), 
Some(props)).unwrap();
       writer.write(&batch).unwrap();
       writer.close().unwrap();
   
       let mut reader = ParquetRecordBatchReader::try_new(Bytes::from(buffer), 
1024).unwrap();
       let read = reader.next().unwrap().unwrap();
       assert_eq!(batch, read);
   }
   ```
   
   Run like
   ```shell
   cargo test -p parquet test_arrow_writer_cdc_list_roundtrip_regression
   ```
   
   Results:
   ```
   running 1 test
   test test_arrow_writer_cdc_list_roundtrip_regression ... FAILED
   
   failures:
   
   ---- test_arrow_writer_cdc_list_roundtrip_regression stdout ----
   
   thread 'test_arrow_writer_cdc_list_roundtrip_regression' (11845398) panicked 
at parquet/src/column/writer/mod.rs:720:39:
   range end index 1 out of range for slice of length 0
   note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
   ```
   
   **Expected behavior**
   No panics, tests should pass
   
   **Additional context**
   <!--
   Add any other context about the problem here.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to