angela-li opened a new issue, #37908:
URL: https://github.com/apache/arrow/issues/37908

   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   I've started using {arrow} to read in data in R, and I noticed that it 
handles messy (aka human-collected!) tabular data slightly worse than 
{data.table}'s `fread` function. (But I want to work with arrow as vs. 
data.table as the partitioning element is going to be useful for me in the 
future!)
   
   One issue I came across was how `open_dataset` handles incorrectly quoted 
data, or data where the default `quote_char` of `"` is included accidentally in 
a column.
   
   Here's how the behavior is different between data.table and arrow. (Here's 
the [test_data.txt 
file](https://gist.github.com/angela-li/fd7c244299a8015d03cfbadf5820539b) for 
the below code. It's a .txt file because the original humongous data file is 
delivered as a .txt.)
   
   ```r
   # Does not work
   test <- open_dataset("test_data.txt",
                format = "text",
                delim = "|")
   
   # Works - generates error about improper quoting, but reads it correctly
   library(data.table)
   test <- fread("test_data.txt")
   ```
   
   The data.table() documentation describes how they handle this data situation 
reasonably well, [NEWS file 
here](https://github.com/Rdatatable/data.table/blob/88039186915028ab3c93ccfd8e22c0d1c3534b1a/NEWS.md?plain=1#L1814).
   
   For now, I think I can change `parse_options` in the open_dataset() function 
to handle this, but it was quite fiddly to do this - hard to track down in the 
docs how to do this. Changing this option is also not good for the rest of the 
data, where I _do_ want the `quote_char` to be `"`.
   
   ```r
   # Works, but is frustrating to figure out - and not ideal for the rest of 
the data
   test <- open_dataset("test_data.txt",
                format = "text",
                parse_options = CsvParseOptions$create(
                                      delimiter = "|",
                                      quoting = TRUE,
                                      quote_char = '', # changing this to 
blank, instead of '"', solves the problem
                                      double_quote = TRUE,
                                      escaping = FALSE,
                                      escape_char = "\\",
                                      newlines_in_values = FALSE,
                                      ignore_empty_lines = TRUE))
   ```
   
   I don't know if improper quoting happens elsewhere in the data, so ideally 
there would be some way to detect and fix this type of improper quoting 
systematically (as versus skipping rows manually, or changing the `quote_char` 
to blank, which could cause issues for other columns.)
   
   Two qs:
   
   1. (immediate q) Are there more effective usage strategies for handling this 
type of data?
   2. (longer-term improvement) Would automatically handling this type of messy 
data be something that the arrow team would consider building into 
`open_dataset()`?
   
   Thanks for your help!
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to