mengw15 opened a new pull request, #5282:
URL: https://github.com/apache/texera/pull/5282

   ### What changes were proposed in this PR?
   
   A CSV with an empty column header (e.g. a trailing comma `id,name,,age`) 
produced an `Attribute` with an empty name. After #3295 every output port 
writes via `IcebergTableWriter → Parquet`, where Avro rejects empty names with 
`IllegalArgumentException: Empty name` at flush time — losing the operator 
port's entire result.
   
   Rename blank header positions to `column-<index>` (the convention used by 
pandas, Spark, R, and DuckDB) in all three CSV scan operators: 
`CSVScanSourceOpDesc`, `ParallelCSVScanSourceOpDesc`, and 
`CSVOldScanSourceOpDesc`. The issue named the first two; the legacy `csvOld` 
variant is still registered in `LogicalOp` and had the same latent bug.
   
   Note: `ParallelCSVScanSourceOpDesc` is currently commented out of 
`LogicalOp`'s operator registry (so it is not reachable from the UI), but it is 
fixed here for consistency and so the bug does not resurface if it is 
re-enabled for experiments.
   
   ### Any related issues, documentation, discussions?
   
   Closes #5279.
   
   ### How was this PR tested?
   
   Added three cases to `CSVScanSourceOpDescSpec` — one per operator — feeding 
a CSV with an empty third header (`id,name,,age`) and asserting the inferred 
schema is `["id", "name", "column-3", "age"]`. `WorkflowOperator` suite is 
green (13 passed); scalafmt clean.
   
   ### Was this PR authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (claude-opus-4-7)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to