davlee1972 opened a new issue, #2094:
URL: https://github.com/apache/arrow-adbc/issues/2094

   ### What happened?
   
   To work around the memory limitations bug 
(https://github.com/apache/arrow-adbc/issues/1997) with adbc_ingest (Version 
1.1.0) 
   
   I started running adbc_ingest(data=@recordbatchreader) on one parquet file 
at a time instead of a dataset of parquet files..
   
   ```
   adbc_ingest(data=my_dataset.scanner().to_reader())
   
   vs:
   
   for file in my_dataset.files:
       file_dataset = ds.dataset(file)
       adbc_ingest(data=file_dataset.scanner().to_reader())
   ```
   
   The final row counts are coming up 5% short. I think there might be some 
sort of issue with starting a fresh temporary staging area and running puts 
with the same file names clashing with the prior adbc_ingest() operation..
   
   I'm going to do some further testing by adding 1 minute delays between 
calling adbc_ingest()..
   
   
![image](https://github.com/user-attachments/assets/f1369f5d-5883-415e-8850-082445424929)
   
   I've got 120 gigs worth of parquet files organized in partitioned 
directories with file sizes ~3 gig eachs and 10 row groups per file..
   
   ```
   drwxrwsr-x 2 4096 Aug 20 20:10 risk_date_yyyymmdd=20240806
   drwxrwsr-x 2 4096 Aug 20 20:10 risk_date_yyyymmdd=20240807
   drwxrwsr-x 2 4096 Aug 20 20:10 risk_date_yyyymmdd=20240808
   etc. etc. etc..
   ```
   
   
   
   ### Stack Trace
   
   _No response_
   
   ### How can we reproduce the bug?
   
   _No response_
   
   ### Environment/Setup
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to