LuigiCerone commented on PR #7997:
URL: https://github.com/apache/iceberg/pull/7997#issuecomment-1650473011

   I tested this locally in a Docker environment, the metada JSON file is one 
generated by the [docker-spark-iceberg 
quickstart](https://github.com/tabular-io/docker-spark-iceberg/blob/main/spark/notebooks/Iceberg%20-%20Getting%20Started.ipynb).
   
   HDFS setup created [with this 
repo](https://github.com/big-data-europe/docker-hadoop):
   ```bash
   root@579a35817003:/# hdfs dfs -ls /user/luigi
   Found 1 items
   -rw-r--r--   3 root supergroup       5153 2023-07-25 09:58 
/user/luigi/00000-3ee52bcd-ba94-4a7e-a0c0-60ce14d5397b.metadata.json
   ```
   
   In another container (after env vars setup [according to pyarrow 
docs](https://arrow.apache.org/docs/python/filesystems.html#hadoop-distributed-file-system-hdfs)):
   
   ```bash
   root@341c4b757a72:/opt/spark/work-dir# pip install 
"git+https://github.com/LuigiCerone/iceberg.git@feat/hdfs_support#subdirectory=python&egg=pyiceberg[pyarrow]";
   root@341c4b757a72:/opt/spark/work-dir# export HADOOP_HOME=/usr/local/hadoop
   root@341c4b757a72:/opt/spark/work-dir# export 
CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath --glob`
   root@341c4b757a72:/opt/spark/work-dir# python
   Python 3.8.13 (default, Mar 28 2022, 11:38:47)
   [GCC 7.5.0] :: Anaconda, Inc. on linux
   Type "help", "copyright", "credits" or "license" for more information.
   >>> from pyiceberg.table import StaticTable
   >>> table = 
StaticTable.from_metadata("hdfs:///user/luigi/00000-3ee52bcd-ba94-4a7e-a0c0-60ce14d5397b.metadata.json",
 properties={"hdfs.host": "namenode", "hdfs.port": 9000, "hdfs.user": "luigi"})
   >>> table.metadata
   TableMetadataV1(location='s3://warehouse/nyc/taxis', 
table_uuid=UUID('a4f97eac-3793-4f17-99ad-7079b6ce408a'), 
last_updated_ms=1690277912736, last_column_id=19, 
schemas=[Schema(NestedField(field_id=1, name='VendorID', field_type=LongType(), 
required=False), NestedField(field_id=2, name='tpep_pickup_datetime', 
field_type=TimestamptzType(), required=False), NestedField(field_id=3, 
name='tpep_dropoff_datetime', field_type=TimestamptzType(), required=False), 
NestedField(field_id=4, name='passenger_count', field_type=DoubleType(), 
required=False), NestedField(field_id=5, name='trip_distance', 
field_type=DoubleType(), required=False), NestedField(field_id=6, 
name='RatecodeID', field_type=DoubleType(), required=False), 
NestedField(field_id=7, name='store_and_fwd_flag', field_type=StringType(), 
required=False), NestedField(field_id=8, name='PULocationID', 
field_type=LongType(), required=False), NestedField(field_id=9, 
name='DOLocationID', field_type=LongType(), required=False), NestedField
 (field_id=10, name='payment_type', field_type=LongType(), required=False), 
NestedField(field_id=11, name='fare_amount', field_type=DoubleType(), 
required=False), NestedField(field_id=12, name='extra', 
field_type=DoubleType(), required=False), NestedField(field_id=13, 
name='mta_tax', field_type=DoubleType(), required=False), 
NestedField(field_id=14, name='tip_amount', field_type=DoubleType(), 
required=False), NestedField(field_id=15, name='tolls_amount', 
field_type=DoubleType(), required=False), NestedField(field_id=16, 
name='improvement_surcharge', field_type=DoubleType(), required=False), 
NestedField(field_id=17, name='total_amount', field_type=DoubleType(), 
required=False), NestedField(field_id=18, name='congestion_surcharge', 
field_type=DoubleType(), required=False), NestedField(field_id=19, 
name='airport_fee', field_type=DoubleType(), required=False), schema_id=0, 
identifier_field_ids=[])], current_schema_id=0, 
partition_specs=[PartitionSpec(PartitionField(source_id=2, field_id=
 1000, transform=DayTransform(), name='tpep_pickup_datetime_day'), spec_id=0)], 
default_spec_id=0, last_partition_id=1000, properties={'owner': 'root'}, 
current_snapshot_id=None, snapshots=[], snapshot_log=[], metadata_log=[], 
sort_orders=[SortOrder(order_id=0)], default_sort_order_id=0, refs={}, 
format_version=1, schema_=Schema(NestedField(field_id=1, name='VendorID', 
field_type=LongType(), required=False), NestedField(field_id=2, 
name='tpep_pickup_datetime', field_type=TimestamptzType(), required=False), 
NestedField(field_id=3, name='tpep_dropoff_datetime', 
field_type=TimestamptzType(), required=False), NestedField(field_id=4, 
name='passenger_count', field_type=DoubleType(), required=False), 
NestedField(field_id=5, name='trip_distance', field_type=DoubleType(), 
required=False), NestedField(field_id=6, name='RatecodeID', 
field_type=DoubleType(), required=False), NestedField(field_id=7, 
name='store_and_fwd_flag', field_type=StringType(), required=False), 
NestedField(field_id=8, name=
 'PULocationID', field_type=LongType(), required=False), 
NestedField(field_id=9, name='DOLocationID', field_type=LongType(), 
required=False), NestedField(field_id=10, name='payment_type', 
field_type=LongType(), required=False), NestedField(field_id=11, 
name='fare_amount', field_type=DoubleType(), required=False), 
NestedField(field_id=12, name='extra', field_type=DoubleType(), 
required=False), NestedField(field_id=13, name='mta_tax', 
field_type=DoubleType(), required=False), NestedField(field_id=14, 
name='tip_amount', field_type=DoubleType(), required=False), 
NestedField(field_id=15, name='tolls_amount', field_type=DoubleType(), 
required=False), NestedField(field_id=16, name='improvement_surcharge', 
field_type=DoubleType(), required=False), NestedField(field_id=17, 
name='total_amount', field_type=DoubleType(), required=False), 
NestedField(field_id=18, name='congestion_surcharge', field_type=DoubleType(), 
required=False), NestedField(field_id=19, name='airport_fee', 
field_type=DoubleTy
 pe(), required=False), schema_id=0, identifier_field_ids=[]), 
partition_spec=[{'name': 'tpep_pickup_datetime_day', 'transform': 'day', 
'source-id': 2, 'field-id': 1000}])
   >>> table.metadata_location
   'hdfs:///user/luigi/00000-3ee52bcd-ba94-4a7e-a0c0-60ce14d5397b.metadata.json'
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to