harrylee2015 opened a new issue, #742: URL: https://github.com/apache/arrow-go/issues/742
### Describe the bug, including details regarding any error messages, version, and platform. ## Issue Title parquet-go: ClickHouse fails to read generated Parquet files (No more data to read) ## Description When using `github.com/xitongsys/parquet-go` (v1.6.2) to write Parquet files, ClickHouse cannot read them and throws: DB::Exception: apache::thrift::transport::TTransportException: No more data to read: read stage: ColumnData ## What works vs what doesn't - ✅ `parquet-tools cat` can read all data - ✅ Python PyArrow / pandas can read all data - ❌ ClickHouse fails with the above error - ❌ Trino/Athena v3 also fails (reported in #547) ## Root cause analysis The issue is that `parquet-go` does not properly write **ColumnIndex and OffsetIndex** metadata. ClickHouse relies on these indexes for predicate pushdown. When indexes are missing/corrupted, ClickHouse throws "No more data to read" because it expects index entries that don't exist. `parquet-tools inspect test.parquet` shows **no column_index or offset_index information** in the output, confirming the indexes are missing. ## Related issues - #547 - Same root cause, affects Trino/Athena - rudderlabs/parquet-go#4 - Fix implemented in fork - thomaspoignant/go-feature-flag#4570 - Real-world impact example ## Proposed fix Merge the fix from #547 or incorporate changes from the RudderLabs fork that: 1. Correctly calculate `CompressedPageSize` (include header size) 2. Exclude dictionary pages from ColumnIndex arrays 3. Properly write OffsetIndex metadata ## Workaround for users 1. In ClickHouse: `SET input_format_parquet_use_native_reader = 0;` 2. Or replace dependency: `replace github.com/xitongsys/parquet-go => github.com/rudderlabs/parquet-go v0.0.3` ## Environment - parquet-go version: v1.6.2 - ClickHouse version: 23.x / 24.x ### Component(s) Parquet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
