alamb commented on code in PR #9497:
URL: https://github.com/apache/arrow-rs/pull/9497#discussion_r3017751948


##########
arrow-json/src/reader/mod.rs:
##########
@@ -819,7 +819,7 @@ mod tests {
     use arrow_array::cast::AsArray;
     use arrow_array::{
         Array, BooleanArray, Float64Array, GenericListViewArray, ListArray, 
OffsetSizeTrait,
-        StringArray, StringViewArray,
+        StringArray, StringViewArray, StructArray, make_array,
     };
     use arrow_buffer::{ArrowNativeType, Buffer};
     use arrow_cast::display::{ArrayFormatter, FormatOptions};

Review Comment:
   This PR has somehow broken decoding of nested REE arrays. It appears to be 
related to the changes in `arrow-json/src/reader/run_end_array.rs`:
   
   Here is a test case (I made a PR to add it): 
https://github.com/apache/arrow-rs/pull/9634
   
   ```rust
   
       #[test]
       fn test_read_nested_run_end_encoded() {
           let buf = r#"
           {"a": "x"}
           {"a": "x"}
           {"a": "y"}
           "#;
   
           // The outer REE compresses whole rows, while the inner REE 
compresses the
           // repeated string values produced by decoding those rows.
           let inner_type = DataType::RunEndEncoded(
               Arc::new(Field::new("run_ends", DataType::Int64, false)),
               Arc::new(Field::new("values", DataType::Utf8, true)),
           );
           let outer_type = DataType::RunEndEncoded(
               Arc::new(Field::new("run_ends", DataType::Int64, false)),
               Arc::new(Field::new("values", inner_type, true)),
           );
           let schema = Arc::new(Schema::new(vec![Field::new("a", outer_type, 
true)]));
           let batches = do_read(buf, 1024, false, false, schema);
           assert_eq!(batches.len(), 1);
   
           let col = batches[0].column(0);
           let outer = col.as_run::<arrow_array::types::Int64Type>();
           // Three logical rows compress to two outer runs: ["x", "x"] and 
["y"].
           assert_eq!(outer.len(), 3);
           assert_eq!(outer.run_ends().values(), &[2, 3]);
   
           let nested = 
outer.values().as_run::<arrow_array::types::Int64Type>();
           // The physical values of the outer REE are themselves a two-element 
REE.
           assert_eq!(nested.len(), 2);
           assert_eq!(nested.run_ends().values(), &[1, 2]);
   
           let nested_values = nested.values().as_string::<i32>();
           assert_eq!(nested_values.len(), 2);
           assert_eq!(nested_values.value(0), "x");
           assert_eq!(nested_values.value(1), "y");
       }
   ```
   
   
   On main:
   
   ```
   andrewlamb@Andrews-MacBook-Pro-3:~/Software/arrow-rs$ cargo test -p 
arrow-json reader::tests::test_read_nested_run_end_encoded -- --nocapture
       Blocking waiting for file lock on build directory
       Finished `test` profile [unoptimized + debuginfo] target(s) in 0.35s
        Running unittests src/lib.rs 
(target/debug/deps/arrow_json-713c99bb8504d9f8)
   
   running 1 test
   test reader::tests::test_read_nested_run_end_encoded ... ok
   
   test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 120 filtered 
out; finished in 0.00s
   
   andrewlamb@Andrews-MacBook-Pro-3:~/Software/arrow-rs
   ```
   
   
   On this branch it panics
   ```rust
   andrewlamb@Andrews-MacBook-Pro-3:~/Software/arrow-rs$ cargo test -p 
arrow-json reader::tests::test_read_nested_run_end_encoded -- --nocapture
       Finished `test` profile [unoptimized + debuginfo] target(s) in 0.07s
        Running unittests src/lib.rs 
(target/debug/deps/arrow_json-dd7fe742fe543fef)
   
   running 1 test
   
   thread 'reader::tests::test_read_nested_run_end_encoded' (11584243) panicked 
at arrow-ord/src/cmp.rs:259:18:
   internal error: entered unreachable code
   note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
   test reader::tests::test_read_nested_run_end_encoded ... FAILED
   
   failures:
   
   failures:
       reader::tests::test_read_nested_run_end_encoded
   
   test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 120 filtered 
out; finished in 0.00s
   
   error: test failed, to rerun pass `-p arrow-json --lib`
   andrewlamb@Andrews-MacBook-Pro-3:~/Software/arrow-rs$
   ```



##########
arrow-json/src/reader/list_array.rs:
##########
@@ -15,20 +15,23 @@
 // specific language governing permissions and limitations
 // under the License.
 
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+use arrow_array::builder::BooleanBufferBuilder;
+use arrow_array::{ArrayRef, GenericListArray, GenericListViewArray, 
OffsetSizeTrait};
+use arrow_buffer::buffer::NullBuffer;
+use arrow_buffer::{OffsetBuffer, ScalarBuffer};
+use arrow_schema::{ArrowError, DataType, FieldRef};
+
 use crate::reader::tape::{Tape, TapeElement};
 use crate::reader::{ArrayDecoder, DecoderContext};
-use arrow_array::OffsetSizeTrait;
-use arrow_array::builder::BooleanBufferBuilder;
-use arrow_buffer::{Buffer, buffer::NullBuffer};
-use arrow_data::{ArrayData, ArrayDataBuilder};
-use arrow_schema::{ArrowError, DataType};
-use std::marker::PhantomData;
 
 pub type ListArrayDecoder<O> = ListLikeArrayDecoder<O, false>;
 pub type ListViewArrayDecoder<O> = ListLikeArrayDecoder<O, true>;
 
 pub struct ListLikeArrayDecoder<O, const IS_VIEW: bool> {
-    data_type: DataType,
+    field: FieldRef,

Review Comment:
   this is a nice change (as this now clones an Arc rather than a DataType) 👍 



##########
arrow-json/Cargo.toml:
##########
@@ -39,8 +39,9 @@ all-features = true
 arrow-array = { workspace = true }
 arrow-buffer = { workspace = true }
 arrow-cast = { workspace = true }
-arrow-data = { workspace = true }
+arrow-ord = { workspace = true }

Review Comment:
   Can you please pull these changes (I think they are related to rEE 
optimizations) into a separate PR for review separately? 



##########
arrow-json/src/reader/list_array.rs:
##########
@@ -91,34 +94,34 @@ impl<O: OffsetSizeTrait, const IS_VIEW: bool> ArrayDecoder 
for ListLikeArrayDeco
             }
 
             let offset = O::from_usize(child_pos.len()).ok_or_else(|| {
-                ArrowError::JsonError(format!("offset overflow decoding {}", 
self.data_type))
+                ArrowError::JsonError(format!("offset overflow decoding 
{}ListArray", O::PREFIX))
             })?;
             offsets.push(offset);
         }
 
-        let child_data = self.decoder.decode(tape, &child_pos)?;
+        let values = self.decoder.decode(tape, &child_pos)?;
         let nulls = nulls.as_mut().map(|x| NullBuffer::new(x.finish()));
 
-        let mut data = ArrayDataBuilder::new(self.data_type.clone())
-            .len(pos.len())
-            .nulls(nulls)
-            .child_data(vec![child_data]);
-
         if IS_VIEW {
             let mut sizes = Vec::with_capacity(offsets.len() - 1);
             for i in 1..offsets.len() {
                 sizes.push(offsets[i] - offsets[i - 1]);
             }
             offsets.pop();
-            data = data
-                .add_buffer(Buffer::from_vec(offsets))
-                .add_buffer(Buffer::from_vec(sizes));
+            let array = GenericListViewArray::<O>::try_new(

Review Comment:
   This now does full validation for LIstViewArray I think -- which is more 
expensive than the previous creation via data.build_unchecked()
   
   
   I can see two alternatives:
   1. Add a `unsafe GenericListViewArray::new_unchecked` methods
   2. Leave this code as is and create an ArrayRef directly from the ArrayData
   
   I would selfishly suggest 2 (and then a follow on PR to add `unsafe 
GenericListViewArray::new_unchecked` / rework this code) to make reviewing this 
one eaiser on my



##########
arrow-json/src/reader/run_end_array.rs:
##########
@@ -56,64 +60,39 @@ impl<R: RunEndIndexType> RunEndEncodedArrayDecoder<R> {
 }
 
 impl<R: RunEndIndexType + Send> ArrayDecoder for RunEndEncodedArrayDecoder<R> {
-    fn decode(&mut self, tape: &Tape<'_>, pos: &[u32]) -> Result<ArrayData, 
ArrowError> {
+    fn decode(&mut self, tape: &Tape<'_>, pos: &[u32]) -> Result<ArrayRef, 
ArrowError> {
         let len = pos.len();
         if len == 0 {
-            return Ok(new_empty_array(&self.data_type).to_data());
+            return Ok(new_empty_array(&self.data_type));
         }
 
-        let flat_data = self.decoder.decode(tape, pos)?;
+        let flat_array = self.decoder.decode(tape, pos)?;

Review Comment:
   as pointed out, there is something wrong with this code -- I recommend 
reverting the extra optimization from this PR and adding it as a follow on PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to