adampinky85 opened a new issue, #43682:
URL: https://github.com/apache/arrow/issues/43682
### Describe the usage question you have. Please include as many useful
details as possible.
Hi team,
We extensively use Arrow / Parquet files for data analysis with Pandas, it's
excellent! We are attempting to use the parquet stream writer to build parquet
files which are then consumed in Python for research.
The code below works except for dictionary fields types. These are important
to our work as we have billions of rows and a large number of repeating short
strings. In Pandas with the dictionary field type it will load columns as the
`category` dtype rather than `object` which results in unmanageable memory
requirements.
Our original approach first builds arrow tables with an arrow schema using
`StringDictionaryBuilder` and then writes the parquet files. This method works
with the dictionary fields but does not offer the streaming approach we require.
We would really appreciate any support on how to set the baz column to be a
dictionary field? Many thanks!
Versions:
```
C++:
libarrow 16.1.0-1
libparquet 16.1.0-1
Python:
pyarrow 16.1.0
Python 3.12.3
```
**Parquet Stream Writer**
```
// parquet writer options
auto parquet_properties = parquet::WriterProperties::Builder()
.compression(arrow::Compression::SNAPPY)
->data_page_version(parquet::ParquetDataPageVersion::V2)
->encoding(parquet::Encoding::DELTA_BINARY_PACKED)
->enable_dictionary()
->enable_statistics()
->version(parquet::ParquetVersion::PARQUET_2_6)
->build();
// parquet schema
auto fields = parquet::schema::NodeVector{};
fields.push_back(parquet::schema::PrimitiveNode::Make(
"foo",
parquet::Repetition::REQUIRED,
parquet::Type::INT64,
parquet::ConvertedType::INT_64)
);
fields.push_back(parquet::schema::PrimitiveNode::Make(
"bar",
parquet::Repetition::REQUIRED,
parquet::LogicalType::Timestamp(false,
parquet::LogicalType::TimeUnit::MILLIS, false, true),
parquet::Type::INT64)
);
fields.push_back(parquet::schema::PrimitiveNode::Make(
"baz",
parquet::Repetition::REQUIRED,
parquet::LogicalType::String(),
parquet::Type::BYTE_ARRAY)
);
auto schema =
std::static_pointer_cast<parquet::schema::GroupNode>(parquet::schema::GroupNode::Make("schema",
parquet::Repetition::REQUIRED, fields));
// open filestream
auto file_system = arrow::fs::LocalFileSystem{};
auto outfile = file_system.OpenOutputStream("new.parquet").ValueOrDie();
// open parquet stream writer
auto parquet_file_writer = parquet::ParquetFileWriter::Open(outfile, schema,
parquet_properties);
auto parquet_stream = parquet::StreamWriter{std::move(parquet_file_writer)};
```
**Pandas Reading New parquet::StreamWriter**
```
import pandas as pd
import pyarrow.parquet as pq
df = pd.read_parquet("new.parquet")
df.dtypes
foo int64
bar datetime64[ms]
baz object <-------- should be category
dtype: object
schema = pq.read_schema("new.parquet")
schema
foo: int64 not null
bar: timestamp[ms] not null
baz: string not null <-------- string not dictionary
```
**Pandas Reading Original parquet::arrow::WriteTable**
```
import pandas as pd
import pyarrow.parquet as pq
df = pd.read_parquet("original.parquet")
df.dtypes
foo int64
bar datetime64[ms]
baz category <-------- correct
dtype: object
schema = pq.read_schema("original.parquet")
schema
foo: int64 not null
bar: timestamp[ms] not null
baz: dictionary<values=string, indices=int32, ordered=0> <-------- correct
```
### Component(s)
C++
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]