omkenge commented on issue #1200:
URL: 
https://github.com/apache/iceberg-python/issues/1200#issuecomment-2640451331
   Hello @Fokko 
   Here is the small Implementation 
   1. List Data Files in S3
   We use PyArrow’s S3FileSystem to retrieve file paths from the given table 
location:
         
         
           def list_data_files_from_table(table_location: str) -> set:
               if not table_location.startswith("s3://"):
                   raise ValueError("Table location must start with 's3://'")
           
               base = table_location.rstrip("/")
               data_location = f"{base}/data" if not base.endswith("/data") 
else base
           
               s3 = fs.S3FileSystem(
                   region="eu-central-1",
                   endpoint_override="127.0.0.1:9000",
                   access_key="admin",
                   secret_key="password",
                   scheme="http"
               )
           
               bucket, prefix = data_location[5:].split("/", 1)
               selector = fs.FileSelector(f"{bucket}/{prefix}", recursive=True)
               
               file_infos = s3.get_file_info(selector)
               return {f"s3://{info.path}" for info in file_infos if info.type 
== fs.FileType.File}
   2. Extract Metadata-Tracked Files
   Using PyIceberg, we retrieve file paths stored in the table metadata:
     ```
   def extract_metadata_files(table) -> set:
         metadata_table = table.inspect.files()
         return set(metadata_table.column("file_path").to_pylist())
   
   ```
   3. Identify Orphan Files
   ```
   def find_orphan_files(table_location, table):
       s3_files = list_data_files_from_table(table_location)
       metadata_files = extract_metadata_files(table)
       
       orphan_files = s3_files - metadata_files  # Files in S3 but not in 
metadata
       return orphan_files
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to