angela-li opened a new issue, #37908:
URL: https://github.com/apache/arrow/issues/37908
### Describe the usage question you have. Please include as many useful
details as possible.
I've started using {arrow} to read in data in R, and I noticed that it
handles messy (aka human-collected!) tabular data slightly worse than
{data.table}'s `fread` function. (But I want to work with arrow as vs.
data.table as the partitioning element is going to be useful for me in the
future!)
One issue I came across was how `open_dataset` handles incorrectly quoted
data, or data where the default `quote_char` of `"` is included accidentally in
a column.
Here's how the behavior is different between data.table and arrow. (Here's
the [test_data.txt
file](https://gist.github.com/angela-li/fd7c244299a8015d03cfbadf5820539b) for
the below code. It's a .txt file because the original humongous data file is
delivered as a .txt.)
```r
# Does not work
test <- open_dataset("test_data.txt",
format = "text",
delim = "|")
# Works - generates error about improper quoting, but reads it correctly
library(data.table)
test <- fread("test_data.txt")
```
The data.table() documentation describes how they handle this data situation
reasonably well, [NEWS file
here](https://github.com/Rdatatable/data.table/blob/88039186915028ab3c93ccfd8e22c0d1c3534b1a/NEWS.md?plain=1#L1814).
For now, I think I can change `parse_options` in the open_dataset() function
to handle this, but it was quite fiddly to do this - hard to track down in the
docs how to do this. Changing this option is also not good for the rest of the
data, where I _do_ want the `quote_char` to be `"`.
```r
# Works, but is frustrating to figure out - and not ideal for the rest of
the data
test <- open_dataset("test_data.txt",
format = "text",
parse_options = CsvParseOptions$create(
delimiter = "|",
quoting = TRUE,
quote_char = '', # changing this to
blank, instead of '"', solves the problem
double_quote = TRUE,
escaping = FALSE,
escape_char = "\\",
newlines_in_values = FALSE,
ignore_empty_lines = TRUE))
```
I don't know if improper quoting happens elsewhere in the data, so ideally
there would be some way to detect and fix this type of improper quoting
systematically (as versus skipping rows manually, or changing the `quote_char`
to blank, which could cause issues for other columns.)
Two qs:
1. (immediate q) Are there more effective usage strategies for handling this
type of data?
2. (longer-term improvement) Would automatically handling this type of messy
data be something that the arrow team would consider building into
`open_dataset()`?
Thanks for your help!
### Component(s)
R
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]