LuigiCerone commented on PR #7997: URL: https://github.com/apache/iceberg/pull/7997#issuecomment-1650473011
I tested this locally in a Docker environment, the metada JSON file is one generated by the [docker-spark-iceberg quickstart](https://github.com/tabular-io/docker-spark-iceberg/blob/main/spark/notebooks/Iceberg%20-%20Getting%20Started.ipynb). HDFS setup created [with this repo](https://github.com/big-data-europe/docker-hadoop): ```bash root@579a35817003:/# hdfs dfs -ls /user/luigi Found 1 items -rw-r--r-- 3 root supergroup 5153 2023-07-25 09:58 /user/luigi/00000-3ee52bcd-ba94-4a7e-a0c0-60ce14d5397b.metadata.json ``` In another container (after env vars setup [according to pyarrow docs](https://arrow.apache.org/docs/python/filesystems.html#hadoop-distributed-file-system-hdfs)): ```bash root@341c4b757a72:/opt/spark/work-dir# pip install "git+https://github.com/LuigiCerone/iceberg.git@feat/hdfs_support#subdirectory=python&egg=pyiceberg[pyarrow]" root@341c4b757a72:/opt/spark/work-dir# export HADOOP_HOME=/usr/local/hadoop root@341c4b757a72:/opt/spark/work-dir# export CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath --glob` root@341c4b757a72:/opt/spark/work-dir# python Python 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information. >>> from pyiceberg.table import StaticTable >>> table = StaticTable.from_metadata("hdfs:///user/luigi/00000-3ee52bcd-ba94-4a7e-a0c0-60ce14d5397b.metadata.json", properties={"hdfs.host": "namenode", "hdfs.port": 9000, "hdfs.user": "luigi"}) >>> table.metadata TableMetadataV1(location='s3://warehouse/nyc/taxis', table_uuid=UUID('a4f97eac-3793-4f17-99ad-7079b6ce408a'), last_updated_ms=1690277912736, last_column_id=19, schemas=[Schema(NestedField(field_id=1, name='VendorID', field_type=LongType(), required=False), NestedField(field_id=2, name='tpep_pickup_datetime', field_type=TimestamptzType(), required=False), NestedField(field_id=3, name='tpep_dropoff_datetime', field_type=TimestamptzType(), required=False), NestedField(field_id=4, name='passenger_count', field_type=DoubleType(), required=False), NestedField(field_id=5, name='trip_distance', field_type=DoubleType(), required=False), NestedField(field_id=6, name='RatecodeID', field_type=DoubleType(), required=False), NestedField(field_id=7, name='store_and_fwd_flag', field_type=StringType(), required=False), NestedField(field_id=8, name='PULocationID', field_type=LongType(), required=False), NestedField(field_id=9, name='DOLocationID', field_type=LongType(), required=False), NestedField (field_id=10, name='payment_type', field_type=LongType(), required=False), NestedField(field_id=11, name='fare_amount', field_type=DoubleType(), required=False), NestedField(field_id=12, name='extra', field_type=DoubleType(), required=False), NestedField(field_id=13, name='mta_tax', field_type=DoubleType(), required=False), NestedField(field_id=14, name='tip_amount', field_type=DoubleType(), required=False), NestedField(field_id=15, name='tolls_amount', field_type=DoubleType(), required=False), NestedField(field_id=16, name='improvement_surcharge', field_type=DoubleType(), required=False), NestedField(field_id=17, name='total_amount', field_type=DoubleType(), required=False), NestedField(field_id=18, name='congestion_surcharge', field_type=DoubleType(), required=False), NestedField(field_id=19, name='airport_fee', field_type=DoubleType(), required=False), schema_id=0, identifier_field_ids=[])], current_schema_id=0, partition_specs=[PartitionSpec(PartitionField(source_id=2, field_id= 1000, transform=DayTransform(), name='tpep_pickup_datetime_day'), spec_id=0)], default_spec_id=0, last_partition_id=1000, properties={'owner': 'root'}, current_snapshot_id=None, snapshots=[], snapshot_log=[], metadata_log=[], sort_orders=[SortOrder(order_id=0)], default_sort_order_id=0, refs={}, format_version=1, schema_=Schema(NestedField(field_id=1, name='VendorID', field_type=LongType(), required=False), NestedField(field_id=2, name='tpep_pickup_datetime', field_type=TimestamptzType(), required=False), NestedField(field_id=3, name='tpep_dropoff_datetime', field_type=TimestamptzType(), required=False), NestedField(field_id=4, name='passenger_count', field_type=DoubleType(), required=False), NestedField(field_id=5, name='trip_distance', field_type=DoubleType(), required=False), NestedField(field_id=6, name='RatecodeID', field_type=DoubleType(), required=False), NestedField(field_id=7, name='store_and_fwd_flag', field_type=StringType(), required=False), NestedField(field_id=8, name= 'PULocationID', field_type=LongType(), required=False), NestedField(field_id=9, name='DOLocationID', field_type=LongType(), required=False), NestedField(field_id=10, name='payment_type', field_type=LongType(), required=False), NestedField(field_id=11, name='fare_amount', field_type=DoubleType(), required=False), NestedField(field_id=12, name='extra', field_type=DoubleType(), required=False), NestedField(field_id=13, name='mta_tax', field_type=DoubleType(), required=False), NestedField(field_id=14, name='tip_amount', field_type=DoubleType(), required=False), NestedField(field_id=15, name='tolls_amount', field_type=DoubleType(), required=False), NestedField(field_id=16, name='improvement_surcharge', field_type=DoubleType(), required=False), NestedField(field_id=17, name='total_amount', field_type=DoubleType(), required=False), NestedField(field_id=18, name='congestion_surcharge', field_type=DoubleType(), required=False), NestedField(field_id=19, name='airport_fee', field_type=DoubleTy pe(), required=False), schema_id=0, identifier_field_ids=[]), partition_spec=[{'name': 'tpep_pickup_datetime_day', 'transform': 'day', 'source-id': 2, 'field-id': 1000}]) >>> table.metadata_location 'hdfs:///user/luigi/00000-3ee52bcd-ba94-4a7e-a0c0-60ce14d5397b.metadata.json' ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org