adampinky85 opened a new issue, #43682: URL: https://github.com/apache/arrow/issues/43682
### Describe the usage question you have. Please include as many useful details as possible. Hi team, We extensively use Arrow / Parquet files for data analysis with Pandas, it's excellent! We are attempting to use the parquet stream writer to build parquet files which are then consumed in Python for research. The code below works except for dictionary fields types. These are important to our work as we have billions of rows and a large number of repeating short strings. In Pandas with the dictionary field type it will load columns as the `category` dtype rather than `object` which results in unmanageable memory requirements. Our original approach first builds arrow tables with an arrow schema using `StringDictionaryBuilder` and then writes the parquet files. This method works with the dictionary fields but does not offer the streaming approach we require. We would really appreciate any support on how to set the baz column to be a dictionary field? Many thanks! Versions: ``` C++: libarrow 16.1.0-1 libparquet 16.1.0-1 Python: pyarrow 16.1.0 Python 3.12.3 ``` **Parquet Stream Writer** ``` // parquet writer options auto parquet_properties = parquet::WriterProperties::Builder() .compression(arrow::Compression::SNAPPY) ->data_page_version(parquet::ParquetDataPageVersion::V2) ->encoding(parquet::Encoding::DELTA_BINARY_PACKED) ->enable_dictionary() ->enable_statistics() ->version(parquet::ParquetVersion::PARQUET_2_6) ->build(); // parquet schema auto fields = parquet::schema::NodeVector{}; fields.push_back(parquet::schema::PrimitiveNode::Make( "foo", parquet::Repetition::REQUIRED, parquet::Type::INT64, parquet::ConvertedType::INT_64) ); fields.push_back(parquet::schema::PrimitiveNode::Make( "bar", parquet::Repetition::REQUIRED, parquet::LogicalType::Timestamp(false, parquet::LogicalType::TimeUnit::MILLIS, false, true), parquet::Type::INT64) ); fields.push_back(parquet::schema::PrimitiveNode::Make( "baz", parquet::Repetition::REQUIRED, parquet::LogicalType::String(), parquet::Type::BYTE_ARRAY) ); auto schema = std::static_pointer_cast<parquet::schema::GroupNode>(parquet::schema::GroupNode::Make("schema", parquet::Repetition::REQUIRED, fields)); // open filestream auto file_system = arrow::fs::LocalFileSystem{}; auto outfile = file_system.OpenOutputStream("new.parquet").ValueOrDie(); // open parquet stream writer auto parquet_file_writer = parquet::ParquetFileWriter::Open(outfile, schema, parquet_properties); auto parquet_stream = parquet::StreamWriter{std::move(parquet_file_writer)}; ``` **Pandas Reading New parquet::StreamWriter** ``` import pandas as pd import pyarrow.parquet as pq df = pd.read_parquet("new.parquet") df.dtypes foo int64 bar datetime64[ms] baz object <-------- should be category dtype: object schema = pq.read_schema("new.parquet") schema foo: int64 not null bar: timestamp[ms] not null baz: string not null <-------- string not dictionary ``` **Pandas Reading Original parquet::arrow::WriteTable** ``` import pandas as pd import pyarrow.parquet as pq df = pd.read_parquet("original.parquet") df.dtypes foo int64 bar datetime64[ms] baz category <-------- correct dtype: object schema = pq.read_schema("original.parquet") schema foo: int64 not null bar: timestamp[ms] not null baz: dictionary<values=string, indices=int32, ordered=0> <-------- correct ``` ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org