TysonStanley opened a new issue, #43742:
URL: https://github.com/apache/arrow/issues/43742

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   When a data.table object is indexed (e.g., with `setindex()`), this can 
cause problems in reading the parquet back in (`Error: IOError: Couldn't 
deserialize thrift: TProtocolException: Exceeded size limit`) and it explodes 
the size of the file (e.g., from 400MB to 2GB with the only change being the 
index). See reprex below:
   
   ```r
   library(data.table)
   library(arrow)
   
   dt<-data.table(x=c(1:1e8), y = round(runif(n=1:1e8, min=1, max=5)))
   
   #Looking at rows where y == 3
   dt[y == 3,]
   
   #Creating a new variable, which is done uniformly across all rows 
(suggesting the previous row index isn't applicable?)
   dt[, z := 1]
   
   #Save the dt
   write_parquet(dt, "example.parquet")
   gc()
   
   #Cannot open the dt
   dt_open<-read_parquet("example.parquet")
   
   #Removing indexing that was created when looking at the y==3 subset before 
saving allows
   #the file to be opened after re-saving.
   setindex(dt, NULL)
   
   write_parquet(dt, "example2.parquet")
   dt_open<-read_parquet("example2.parquet")
   ```
   
   ### Component(s)
   
   Parquet, R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to