fiktor opened a new issue, #41388:
URL: https://github.com/apache/arrow/issues/41388

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   ## Example 1
   ```
   import numpy as np
   import pyarrow as pa
   np_array = np.array([b'\x00Very important data. Keep it safe!', b'Less 
important data. Keep it safe??'])
   table = pa.Table.from_pydict({'sample': np_array})
   print(table['sample'][0].as_py())
   ```
   Expected output `b'\x00Very important data. Keep it safe!'`, actual output 
`b''`. Note: `type(table['sample'])` in this example is 
`pyarrow.lib.ChunkedArray`. `table['sample'][1]` is `b'Less important data. 
Keep it safe??'` as expected. The byte arrays in this example look like 
human-readable strings (except for the first null byte), but in reality we are 
trying to encode binary data as a fixed length byte array and store it in 
(variable length for compatibility with HuggingFace) byte array parquet column.
   
   ## Example 2
   With the same `np_array`:
   ```
   pa_array = pa.array(np_array, type=pa.binary())
   print(pa_array[0])
   ```
   Expected output `b'\x00Very important data. Keep it safe!'`, actual output 
`b''`. Note: `type(pa_array)` is `pyarrow.lib.BinaryArray`.
   
   ## Possible workaround
   Replacing `table = pa.Table.from_pydict({'sample': np_array})` with `table = 
pa.Table.from_pydict({'sample': a.array(array, 
type=pa.binary(array.itemsize)).cast(target_type=pa.binary())})` seems to 
produce the intended result.
   
   Note that in our usecase we want the resulting dtype to be `pa.binary()` 
(BYTE_ARRAY in parquet), not `pa.binary(35) (FIXED_LEN_BYTE_ARRAY in parquet), 
because the latter does not seem to be supported by HuggingFace datasets 
library.
   
   ## System info
   The above outputs are on Ubuntu 22.04.4 LTS with Python 3.10.11, pyarrow 
version 12.0.1 (using parquet-cpp-arrow version 12.0.1).
   
   ## Related issues
   This seems to be similar but distinct from #36308 , where the input 
contained byte arrays of different lengths and a conversion to `pa.string()` 
column was desired.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to