adampinky85 opened a new issue, #43682:
URL: https://github.com/apache/arrow/issues/43682

   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   Hi team,
   
   We extensively use Arrow / Parquet files for data analysis with Pandas, it's 
excellent! We are attempting to use the parquet stream writer to build parquet 
files which are then consumed in Python for research. 
   
   The code below works except for dictionary fields types. These are important 
to our work as we have billions of rows and a large number of repeating short 
strings. In Pandas with the dictionary field type it will load columns as the 
`category` dtype rather than `object` which results in unmanageable memory 
requirements.
   
   Our original approach first builds arrow tables with an arrow schema using 
`StringDictionaryBuilder` and then writes the parquet files. This method works 
with the dictionary fields but does not offer the streaming approach we require.
   
   We would really appreciate any support on how to set the baz column to be a 
dictionary field? Many thanks!
   
   Versions:
   ```
   C++:
   libarrow 16.1.0-1
   libparquet 16.1.0-1
   
   Python:
   pyarrow 16.1.0
   Python 3.12.3
   ```
   
   **Parquet Stream Writer**
   ```
   // parquet writer options
   auto parquet_properties = parquet::WriterProperties::Builder()
           .compression(arrow::Compression::SNAPPY)
           ->data_page_version(parquet::ParquetDataPageVersion::V2)
           ->encoding(parquet::Encoding::DELTA_BINARY_PACKED)
           ->enable_dictionary()
           ->enable_statistics()
           ->version(parquet::ParquetVersion::PARQUET_2_6)
           ->build();
   
   
   // parquet schema
   auto fields = parquet::schema::NodeVector{};
   fields.push_back(parquet::schema::PrimitiveNode::Make(
           "foo",
           parquet::Repetition::REQUIRED,
           parquet::Type::INT64,
           parquet::ConvertedType::INT_64)
   );
   fields.push_back(parquet::schema::PrimitiveNode::Make(
           "bar",
           parquet::Repetition::REQUIRED,
           parquet::LogicalType::Timestamp(false, 
parquet::LogicalType::TimeUnit::MILLIS, false, true),
           parquet::Type::INT64)
   );
   fields.push_back(parquet::schema::PrimitiveNode::Make(
           "baz",
           parquet::Repetition::REQUIRED,
           parquet::LogicalType::String(),
           parquet::Type::BYTE_ARRAY)
   );
   auto schema = 
std::static_pointer_cast<parquet::schema::GroupNode>(parquet::schema::GroupNode::Make("schema",
 parquet::Repetition::REQUIRED, fields));
   
   // open filestream
   auto file_system = arrow::fs::LocalFileSystem{};
   auto outfile = file_system.OpenOutputStream("new.parquet").ValueOrDie();
   
   // open parquet stream writer
   auto parquet_file_writer = parquet::ParquetFileWriter::Open(outfile, schema, 
parquet_properties);
   auto parquet_stream = parquet::StreamWriter{std::move(parquet_file_writer)};
   ```
   
   **Pandas Reading New parquet::StreamWriter** 
   ```
   import pandas as pd
   import pyarrow.parquet as pq
   
   df = pd.read_parquet("new.parquet")
   df.dtypes
   
   foo             int64
   bar    datetime64[ms]
   baz            object <-------- should be category
   dtype: object
   
   schema = pq.read_schema("new.parquet")
   schema
   
   foo: int64 not null
   bar: timestamp[ms] not null
   baz: string not null <-------- string not dictionary
   ```
   
   **Pandas Reading Original parquet::arrow::WriteTable**
   ```
   import pandas as pd
   import pyarrow.parquet as pq
   
   df = pd.read_parquet("original.parquet")
   df.dtypes
   
   foo             int64
   bar    datetime64[ms]
   baz            category <-------- correct
   dtype: object
   
   schema = pq.read_schema("original.parquet")
   schema
   
   foo: int64 not null
   bar: timestamp[ms] not null
   baz: dictionary<values=string, indices=int32, ordered=0> <-------- correct
   ```
   
   
   
   
   
   
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to