harrylee2015 opened a new issue, #742:
URL: https://github.com/apache/arrow-go/issues/742

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   ## Issue Title
   parquet-go: ClickHouse fails to read generated Parquet files (No more data 
to read)
   
   ## Description
   When using `github.com/xitongsys/parquet-go` (v1.6.2) to write Parquet 
files, ClickHouse cannot read them and throws:
   
   DB::Exception: apache::thrift::transport::TTransportException: No more data 
to read: read stage: ColumnData
   
   
   ## What works vs what doesn't
   - ✅ `parquet-tools cat` can read all data
   - ✅ Python PyArrow / pandas can read all data  
   - ❌ ClickHouse fails with the above error
   - ❌ Trino/Athena v3 also fails (reported in #547)
   
   ## Root cause analysis
   The issue is that `parquet-go` does not properly write **ColumnIndex and 
OffsetIndex** metadata. ClickHouse relies on these indexes for predicate 
pushdown. When indexes are missing/corrupted, ClickHouse throws "No more data 
to read" because it expects index entries that don't exist.
   
   `parquet-tools inspect test.parquet` shows **no column_index or offset_index 
information** in the output, confirming the indexes are missing.
   
   ## Related issues
   - #547 - Same root cause, affects Trino/Athena
   - rudderlabs/parquet-go#4 - Fix implemented in fork
   - thomaspoignant/go-feature-flag#4570 - Real-world impact example
   
   ## Proposed fix
   Merge the fix from #547 or incorporate changes from the RudderLabs fork that:
   1. Correctly calculate `CompressedPageSize` (include header size)
   2. Exclude dictionary pages from ColumnIndex arrays
   3. Properly write OffsetIndex metadata
   
   ## Workaround for users
   1. In ClickHouse: `SET input_format_parquet_use_native_reader = 0;`
   2. Or replace dependency: `replace github.com/xitongsys/parquet-go => 
github.com/rudderlabs/parquet-go v0.0.3`
   
   ## Environment
   - parquet-go version: v1.6.2
   - ClickHouse version: 23.x / 24.x
   
   
   
   ### Component(s)
   
   Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to