[I] Writing partitionned dataset with Rarrow on S3 [arrow]

via GitHub Mon, 17 Jun 2024 02:02:27 -0700


alexisdondon opened a new issue, #42173:
URL: https://github.com/apache/arrow/issues/42173


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Giving a dataset if i try to write this dataset as a partitionned parquet 
dataset to a on premise s3 like minio on a path like s3/mybucket/data/mydataset
   
   ```
   # Définir le nombre de lignes
   n <- 100000
   
   # Générer des valeurs numériques aléatoires pour deux colonnes
   set.seed(123)  # Pour la reproductibilité
   num_col1 <- runif(n, min = 0, max = 100)  # Valeurs numériques entre 0 et 100
   num_col2 <- rnorm(n, mean = 50, sd = 10)  # Valeurs normalement distribuées 
avec une moyenne de 50 et un écart-type de 10
   
   # Générer des chaînes de caractères aléatoires pour une colonne
   char_col <- replicate(n, paste0(sample(LETTERS, 5, replace = TRUE), collapse 
= ""))
   
   # Générer des valeurs qualitatives pour une colonne
   qual_col <- sample(c("A", "B", "C", "D"), n, replace = TRUE)
   
   # Construire le data.frame
   df <- data.frame(
     numeric1 = num_col1,
     numeric2 = num_col2,
     character = char_col,
     qualitative = qual_col,
     stringsAsFactors = FALSE
   )
   
   # Afficher les premières lignes du data.frame
   head(df)
   
   # Configurer l'accès au S3
   minio <- arrow::S3FileSystem$create(
     endpoint_override = Sys.getenv("S3_ENDPOINT"),
     access_key = Sys.getenv("AWS_ACCESS_KEY_ID"),
     secret_key = Sys.getenv("AWS_SECRET_ACCESS_KEY"),
     session_token = Sys.getenv("AWS_SESSION_TOKEN")
   )
   
   df |> arrow::write_dataset(
     minio$path(paste0("mybucket/data/mydataset")),
     partitioning = "qualitative",
     format= "parquet"
   )
   
   ```
   
   Then i have a HEAD request on s3 that is denied, giving to the user 
```s3:ListBucket``` on mybucket resolve the bug but giving ListBucket is not 
without security impact.
   
   ```
   Error: IOError: When testing for existence of bucket 'mybucket': AWS Error 
ACCESS_DENIED during HeadBucket operation: No response body.
   ```
   
   There was some discussion/issues about arrow having a mode to non check the 
existence or not create the bucket if not exists.
   
   With pyarrow and the same acl on s3 i can write the dataset the wrapper do 
not check for existence or check without a HEAD at list.
   
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Writing partitionned dataset with Rarrow on S3 [arrow]

Reply via email to