antecede opened a new issue, #45110:
URL: https://github.com/apache/arrow/issues/45110

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   # The original link of the error message is as follows:
   https://github.com/huggingface/datasets/issues/7346
   # The error message is as follows:
   
   ### Describe the bug
   
   When loading a large 2D data (1000 × 1152) with a large number of (2,000 
data in this case) in `load_dataset`, the error message `OSError: Invalid 
flatbuffers message` is reported. 
   
   When only 300 pieces of data of this size (1000 × 1152) are stored, they can 
be loaded correctly. 
   
   When 2,000 2D arrays are stored in each file, about 100 files are generated, 
each with a file size of about 5-6GB. But when 300 2D arrays are stored in each 
file, **about 600 files are generated, which is too many files**.
   
   ### Steps to reproduce the bug
   
   error:
   ```python
   ---------------------------------------------------------------------------
   OSError                                   Traceback (most recent call last)
   Cell In[2], line 4
         1 from datasets import Dataset
         2 from datasets import load_dataset
   ----> 4 real_dataset = load_dataset("arrow", 
data_files='tensorData/real_ResidueTensor/*', 
split="train")#.with_format("torch") # , split="train"
         5 # sim_dataset = load_dataset("arrow", 
data_files='tensorData/sim_ResidueTensor/*', split="train").with_format("torch")
         6 real_dataset
   
   File 
[~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/datasets/load.py:2151](http://localhost:8899/lab/tree/RTC%3Anew_world/esm3/~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/datasets/load.py#line=2150),
 in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, 
download_config, download_mode, verification_mode, keep_in_memory, save_infos, 
revision, token, streaming, num_proc, storage_options, trust_remote_code, 
**config_kwargs)
      2148     return builder_instance.as_streaming_dataset(split=split)
      2150 # Download and prepare data
   -> 2151 builder_instance.download_and_prepare(
      2152     download_config=download_config,
      2153     download_mode=download_mode,
      2154     verification_mode=verification_mode,
      2155     num_proc=num_proc,
      2156     storage_options=storage_options,
      2157 )
      2159 # Build dataset for splits
      2160 keep_in_memory = (
      2161     keep_in_memory if keep_in_memory is not None else 
is_small_dataset(builder_instance.info.dataset_size)
      2162 )
   
   File 
[~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/datasets/builder.py:924](http://localhost:8899/lab/tree/RTC%3Anew_world/esm3/~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/datasets/builder.py#line=923),
 in DatasetBuilder.download_and_prepare(self, output_dir, download_config, 
download_mode, verification_mode, dl_manager, base_path, file_format, 
max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
       922 if num_proc is not None:
       923     prepare_split_kwargs["num_proc"] = num_proc
   --> 924 self._download_and_prepare(
       925     dl_manager=dl_manager,
       926     verification_mode=verification_mode,
       927     **prepare_split_kwargs,
       928     **download_and_prepare_kwargs,
       929 )
       930 # Sync info
       931 self.info.dataset_size = sum(split.num_bytes for split in 
self.info.splits.values())
   
   File 
[~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/datasets/builder.py:978](http://localhost:8899/lab/tree/RTC%3Anew_world/esm3/~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/datasets/builder.py#line=977),
 in DatasetBuilder._download_and_prepare(self, dl_manager, verification_mode, 
**prepare_split_kwargs)
       976 split_dict = SplitDict(dataset_name=self.dataset_name)
       977 split_generators_kwargs = 
self._make_split_generators_kwargs(prepare_split_kwargs)
   --> 978 split_generators = self._split_generators(dl_manager, 
**split_generators_kwargs)
       980 # Checksums verification
       981 if verification_mode == VerificationMode.ALL_CHECKS and 
dl_manager.record_checksums:
   
   File 
[~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/datasets/packaged_modules/arrow/arrow.py:47](http://localhost:8899/lab/tree/RTC%3Anew_world/esm3/~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/datasets/packaged_modules/arrow/arrow.py#line=46),
 in Arrow._split_generators(self, dl_manager)
        45 with open(file, "rb") as f:
        46     try:
   ---> 47         reader = pa.ipc.open_stream(f)
        48     except pa.lib.ArrowInvalid:
        49         reader = pa.ipc.open_file(f)
   
   File 
[~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/pyarrow/ipc.py:190](http://localhost:8899/lab/tree/RTC%3Anew_world/esm3/~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/pyarrow/ipc.py#line=189),
 in open_stream(source, options, memory_pool)
       171 def open_stream(source, *, options=None, memory_pool=None):
       172     """
       173     Create reader for Arrow streaming format.
       174 
      (...)
       188         A reader for the given source
       189     """
   --> 190     return RecordBatchStreamReader(source, options=options,
       191                                    memory_pool=memory_pool)
   
   File 
[~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/pyarrow/ipc.py:52](http://localhost:8899/lab/tree/RTC%3Anew_world/esm3/~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/pyarrow/ipc.py#line=51),
 in RecordBatchStreamReader.__init__(self, source, options, memory_pool)
        50 def __init__(self, source, *, options=None, memory_pool=None):
        51     options = _ensure_default_ipc_read_options(options)
   ---> 52     self._open(source, options=options, memory_pool=memory_pool)
   
   File 
[~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/pyarrow/ipc.pxi:1006](http://localhost:8899/lab/tree/RTC%3Anew_world/esm3/~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/pyarrow/ipc.pxi#line=1005),
 in pyarrow.lib._RecordBatchStreamReader._open()
   
   File 
[~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/pyarrow/error.pxi:155](http://localhost:8899/lab/tree/RTC%3Anew_world/esm3/~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/pyarrow/error.pxi#line=154),
 in pyarrow.lib.pyarrow_internal_check_status()
   
   File 
[~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/pyarrow/error.pxi:92](http://localhost:8899/lab/tree/RTC%3Anew_world/esm3/~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/pyarrow/error.pxi#line=91),
 in pyarrow.lib.check_status()
   
   OSError: Invalid flatbuffers message.
   ```
   
   reproduce:Here is just an example result, the real 2D matrix is the output 
of the ESM large model, and the matrix size is approximate
   ```python
   import numpy as np
   import pyarrow as pa
   
   random_arrays_list = [np.random.rand(1000, 1152) for _ in range(2000)]
   table = pa.Table.from_pydict({
       'tensor': [tensor.tolist() for tensor in random_arrays_list]
   })
   
   import pyarrow.feather as feather
   feather.write_feather(table, 'test.arrow')
   
   from datasets import load_dataset
   dataset = load_dataset("arrow", data_files='test.arrow', split="train")
   ```
   
   ### Expected behavior
   
   `load_dataset` load the dataset as normal as `feather.read_feather`
   ```python
   import pyarrow.feather as feather
   feather.read_feather('tensorData/real_ResidueTensor/real_tensor_1.arrow')
   ```
   
   Plus `load_dataset("parquet", data_files='test.arrow', split="train")` works 
fine
   
   ### Environment info
   
   - `datasets` version: 3.2.0
   - Platform: Linux-6.8.0-49-generic-x86_64-with-glibc2.39
   - Python version: 3.12.3
   - `huggingface_hub` version: 0.26.5
   - PyArrow version: 18.1.0
   - Pandas version: 2.2.3
   - `fsspec` version: 2024.9.0
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to