scovich commented on code in PR #9598:
URL: https://github.com/apache/arrow-rs/pull/9598#discussion_r3016563038
##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -86,15 +88,14 @@ pub(crate) fn follow_shredded_path_element<'a>(
return Ok(missing_path_step());
};
- let struct_array = field.as_struct_opt().ok_or_else(|| {
- // TODO: Should we blow up? Or just end the traversal and let
the normal
- // variant pathing code sort out the mess that it must anyway
be
- // prepared to handle?
- ArrowError::InvalidArgumentError(format!(
- "Expected Struct array while following path, got {}",
- field.data_type(),
- ))
- })?;
+ // The field might be a VariantArray (StructArray) if shredded,
+ // or it might be a primitive array. Only proceed if it's a
StructArray.
+ let Some(struct_array) = field.as_struct_opt() else {
+ // Field exists but is not a StructArray, so it cannot be
+ // followed further. Fall back to the value column if present,
+ // otherwise the path is missing.
Review Comment:
I don't understand this comment. What value column fallback does it refer
to? The only caller of this `follow_shredded_path_element` function simply
creates an all-null array if we return `missing_path_step()`.
##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -223,26 +224,61 @@ fn shredded_get_path(
// For shredded/partially-shredded targets (`typed_value` present),
recurse into each field
// separately to take advantage of deeper shredding in child fields.
if let DataType::Struct(fields) = as_field.data_type() {
- if target.typed_value_field().is_none() {
+ let has_variant_fields = fields
+ .iter()
+ .any(|f| f.try_extension_type::<VariantType>().is_ok());
+ if target.typed_value_field().is_none() && !has_variant_fields {
return shred_basic_variant(target, VariantPath::default(),
Some(as_field));
}
- let children = fields
- .iter()
- .map(|field| {
- shredded_get_path(
- &target,
- &[VariantPathElement::from(field.name().as_str())],
- Some(field),
- cast_options,
- )
- })
- .collect::<Result<Vec<_>>>()?;
+ let mut updated_fields = Vec::with_capacity(fields.len());
+ let mut children = Vec::with_capacity(fields.len());
+ for field in fields.iter() {
+ // If the field has VariantType extension metadata, extract it as a
+ // VariantArray instead of casting to the declared data type. This
allows
+ // callers to request structs where some fields remain as variants.
+ // See test_struct_extraction_with_variant_fields for usage
example.
+ let is_variant_field =
field.try_extension_type::<VariantType>().is_ok();
+ let field_as_type = (!is_variant_field).then(|| field.as_ref());
+ let child = shredded_get_path(
+ &target,
+ &[VariantPathElement::from(field.name().as_str())],
+ field_as_type,
+ cast_options,
+ )?;
+
+ // When the field is entirely absent in the data, shredded_get_path
+ // returns a NullArray. For variant fields, construct an all-null
+ // VariantArray so the extension metadata is preserved.
+ let child = if is_variant_field && child.data_type() ==
&DataType::Null {
+ let mut builder = VariantArrayBuilder::new(child.len());
+ for _ in 0..child.len() {
+ builder.append_null();
+ }
+ let null_variant = builder.build();
+ Arc::new(null_variant.into_inner()) as ArrayRef
+ } else {
+ child
+ };
+
+ // Update field data type to match the actual child array.
+ // Preserve VariantType extension metadata for variant fields so
+ // downstream consumers can recognize them as Variant columns.
+ let mut new_field = field
+ .as_ref()
+ .clone()
+ .with_data_type(child.data_type().clone());
Review Comment:
Rescuing https://github.com/apache/arrow-rs/pull/9598#discussion_r2969786655
from github oblivion:
> > Why needed? Doesn't it already have that type?
>
> For example, a VariantArray's internal StructArray has concrete fields
like `metadata` and `value`, not empty fields. So we need `with_data_type` to
align the field's declared type with the actual child array's type, otherwise
`StructArray::try_new` will reject the mismatch.
I think there are two different scenarios here:
1. Caller requested a strongly typed struct. Then the input schema and child
data type already match. Even for missing columns, the shredding code returns a
properly typed array full of nulls, rather than a `NullArray`. No variant
extension type attached.
2. Caller requested a (possibly shredded) variant. Then the input schema and
child data type may _not_ match and we have to re-apply it and re-attach the
variant extension type.
Right now, the code treats everything as case 2/, which is extra work but
otherwise (mostly?) harmless.
Re "mostly harmless": How do we handle recursive variant shredding? For
example, if I construct a shredding schema corresponding to `x.y.z::INT`, the
resulting "happy path" is `x.typed_value.y.typed_value.z.typed_value: INT`. If
I then pass that schema to `variant_get`, I have specifically requested that
specific shredding schema, even if that means shredding or
unshred-then-reshredding the underlying data whose own shredding schema doesn't
match. But only the top-level column `x` has the variant extension type. It is
incorrect for `x.typed_value.y` to be annotated as variant. Perhaps we still
pass that annotation in order to guide the recursion, but the schema that comes
back to the user afterward must not include it.
I'm not sure if we're handling these cases correctly right now?
##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -86,15 +88,14 @@ pub(crate) fn follow_shredded_path_element<'a>(
return Ok(missing_path_step());
};
- let struct_array = field.as_struct_opt().ok_or_else(|| {
- // TODO: Should we blow up? Or just end the traversal and let
the normal
- // variant pathing code sort out the mess that it must anyway
be
- // prepared to handle?
- ArrowError::InvalidArgumentError(format!(
- "Expected Struct array while following path, got {}",
- field.data_type(),
- ))
- })?;
+ // The field might be a VariantArray (StructArray) if shredded,
+ // or it might be a primitive array. Only proceed if it's a
StructArray.
+ let Some(struct_array) = field.as_struct_opt() else {
+ // Field exists but is not a StructArray, so it cannot be
+ // followed further. Fall back to the value column if present,
+ // otherwise the path is missing.
Review Comment:
Update: I _think_ I finally figured it out... this code is attempting to
traverse through a shredded variant object. Say we're following the path
`x.y.z` and the current path step is `y`:
1. (L72) The `typed_value` we start with should ideally be a struct. If it's
any other type, then `x` was not shredded as struct. Two sub-cases:
1. `x.value` is present. Some entries may be `Variant::Object`. In fact,
it's possible that _every_ entry is `Variant::Object` (or `Variant::Null`),
e.g. if the data match the query schema but the wrong shredding schema was
applied when writing it out. So it's actually incorrect to blow up here,
regardless of JSONpath semantics, because we didn't prove there's a problem yet
(have to row-shred `x.value` and see what it contains).
2. `x.value` column is just plain missing. This means the column shredded
perfectly as some other type, and we know there are no nulls and no structs.
This is currently treated as an error case, but we should really just return
`missing_path_step()` instead (https://github.com/apache/arrow-rs/issues/9606).
3. (L86) `x.typed_value` was a struct as expected, but `x.typed_value.y`
doesn't exist => `missing_path_step()`
4. (L91) The `x.typed_value.y` exists, but it's not a struct. This is a
malformed data scenario because we're in a shredded variant object and the spec
mandates that _every_ field must be a struct (containing `value` and/or
`typed_value` fields). So the original code was correct to error out upon
discovering the problem, regardless of JSONpath semantics (it just failed to
explain why it was correct).
5. (L101) `x.typed_value.y` is a struct, so we attempt to return a shredding
state for its `value` and/or `typed_value` fields.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]