alamb opened a new issue, #9637: URL: https://github.com/apache/arrow-rs/issues/9637
**Describe the bug** - Found while testing https://github.com/apache/arrow-rs/pull/9447 Writing nested list data with parquet::arrow::ArrowWriter and content-defined chunking enabled can panic inside the parquet column writer with an out-of-bounds slice access. This appears to be a regression from: - bad: bc74c7192a48bd36a9e79b883a3482af396a2350 (feat(parquet): add content defined chunking for arrow writer (#9450)) - good: 39dda22517e6369d006aaac5eaac53d9cd72c29b **To Reproduce** This currently fails on main: ```shell nice cargo bench -p parquet --bench arrow_writer -- list_primitive/cdc ``` Fails like this: ``` Benchmarking list_primitive/cdc: Warming up for 3.0000 s thread 'main' (11848300) panicked at parquet/src/column/writer/mod.rs:720:39: range end index 59344 out of range for slice of length 58905 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace ``` Here is a standalone reproducer (in parquet/tests/arrow_writer.rs) that codex made for me: ```rust #[test] fn test_arrow_writer_cdc_list_roundtrip_regression() { let schema = Arc::new(Schema::new(vec![ Field::new( "_1", DataType::List(Arc::new(Field::new_list_field(DataType::Int32, true))), true, ), Field::new( "_2", DataType::List(Arc::new(Field::new_list_field(DataType::Boolean, true))), true, ), Field::new( "_3", DataType::LargeList(Arc::new(Field::new_list_field(DataType::Utf8, true))), true, ), ])); let props = WriterProperties::builder() .set_content_defined_chunking(Some(CdcOptions::default())) .build(); let batch = create_random_batch(schema, 2, 0.25, 0.75).unwrap(); let mut buffer = Vec::new(); let mut writer = ArrowWriter::try_new(&mut buffer, batch.schema(), Some(props)).unwrap(); writer.write(&batch).unwrap(); writer.close().unwrap(); let mut reader = ParquetRecordBatchReader::try_new(Bytes::from(buffer), 1024).unwrap(); let read = reader.next().unwrap().unwrap(); assert_eq!(batch, read); } ``` Run like ```shell cargo test -p parquet test_arrow_writer_cdc_list_roundtrip_regression ``` Results: ``` running 1 test test test_arrow_writer_cdc_list_roundtrip_regression ... FAILED failures: ---- test_arrow_writer_cdc_list_roundtrip_regression stdout ---- thread 'test_arrow_writer_cdc_list_roundtrip_regression' (11845398) panicked at parquet/src/column/writer/mod.rs:720:39: range end index 1 out of range for slice of length 0 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace ``` **Expected behavior** No panics, tests should pass **Additional context** <!-- Add any other context about the problem here. --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
