omkenge commented on issue #1200: URL: https://github.com/apache/iceberg-python/issues/1200#issuecomment-2640451331
Hello @Fokko Here is the small Implementation 1. List Data Files in S3 We use PyArrow’s S3FileSystem to retrieve file paths from the given table location: def list_data_files_from_table(table_location: str) -> set: if not table_location.startswith("s3://"): raise ValueError("Table location must start with 's3://'") base = table_location.rstrip("/") data_location = f"{base}/data" if not base.endswith("/data") else base s3 = fs.S3FileSystem( region="eu-central-1", endpoint_override="127.0.0.1:9000", access_key="admin", secret_key="password", scheme="http" ) bucket, prefix = data_location[5:].split("/", 1) selector = fs.FileSelector(f"{bucket}/{prefix}", recursive=True) file_infos = s3.get_file_info(selector) return {f"s3://{info.path}" for info in file_infos if info.type == fs.FileType.File} 2. Extract Metadata-Tracked Files Using PyIceberg, we retrieve file paths stored in the table metadata: ``` def extract_metadata_files(table) -> set: metadata_table = table.inspect.files() return set(metadata_table.column("file_path").to_pylist()) ``` 3. Identify Orphan Files ``` def find_orphan_files(table_location, table): s3_files = list_data_files_from_table(table_location) metadata_files = extract_metadata_files(table) orphan_files = s3_files - metadata_files # Files in S3 but not in metadata return orphan_files ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org