James Baugh created SPARK-50616:
-----------------------------------

             Summary: Add File Extension Option to CSV DataSource Writer
                 Key: SPARK-50616
                 URL: https://issues.apache.org/jira/browse/SPARK-50616
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 3.5.3
            Reporter: James Baugh
             Fix For: 3.5.4


h3. What changes were proposed in this pull request?

The existing CSV DataSource allows one to set the delimiter/separator but does 
not allow the changing of the file extension. This means that a file can have 
values separated by tabs but me marked as a ".csv" file. This change allows one 
to change the file extension to match the delimiter/separator (e.g. ".tsv" for 
a tab separated value file).

PR: [https://github.com/apache/spark/pull/49233]
h3. Why are the changes needed?

This PR adds an additional option to set the fileExtension. The end result is 
that when a separator is set that is not a comma that the output file has a 
file extension that matches the separator (e.g. file.tsv, file.psv, etc...).

Notes on Previous Pull Request 
[#17973|https://github.com/apache/spark/pull/17973]
A pull request adding this option was discussed 7 years ago. One reason it 
wasn't added was:
"I would like to suggest to leave this out if there is no better reason for 
now. Downside of this is, it looks this allows arbitrary name and it does not 
gurantee the extention is, say, tsv when the delmiter is a tab. It is purely up 
to the user."

I don't believe this is a good reason to not let the user set the extension. If 
we let them set the delimiter/separator to an arbitrary string/char then why 
not let the user also set the file extension to specify the separator that the 
file uses (e.g. tsv, psv, etc...). This addition keeps the "csv" file extension 
as the default and has the benefit of allowing other separators to match the 
file extension.
h3. Does this PR introduce _any_ user-facing change?

Yes. This PR adds one row to the options table for the CSV DataSource 
documentation to include the "fileExtension" option.
h3. How was this patch tested?

One unit test was added to validate a file is written with the new extension.
h3. Was this patch authored or co-authored using generative AI tooling?

No



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to