TysonStanley opened a new issue, #43742:
URL: https://github.com/apache/arrow/issues/43742
### Describe the bug, including details regarding any error messages,
version, and platform.
When a data.table object is indexed (e.g., with `setindex()`), this can
cause problems in reading the parquet back in (`Error: IOError: Couldn't
deserialize thrift: TProtocolException: Exceeded size limit`) and it explodes
the size of the file (e.g., from 400MB to 2GB with the only change being the
index). See reprex below:
```r
library(data.table)
library(arrow)
dt<-data.table(x=c(1:1e8), y = round(runif(n=1:1e8, min=1, max=5)))
#Looking at rows where y == 3
dt[y == 3,]
#Creating a new variable, which is done uniformly across all rows
(suggesting the previous row index isn't applicable?)
dt[, z := 1]
#Save the dt
write_parquet(dt, "example.parquet")
gc()
#Cannot open the dt
dt_open<-read_parquet("example.parquet")
#Removing indexing that was created when looking at the y==3 subset before
saving allows
#the file to be opened after re-saving.
setindex(dt, NULL)
write_parquet(dt, "example2.parquet")
dt_open<-read_parquet("example2.parquet")
```
### Component(s)
Parquet, R
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]