rdblue commented on code in PR #5672: URL: https://github.com/apache/iceberg/pull/5672#discussion_r961879991
########## docs/python-api-intro.md: ########## @@ -27,158 +27,152 @@ menu: # Iceberg Python API -Much of the python api conforms to the java api. You can get more info about the java api [here](../api). +Much of the python api conforms to the Java API. You can get more info about the java api [here](../api). -## Catalog - -The Catalog interface, like java provides search and management operations for tables. - -To create a catalog: +## Instal -``` python -from iceberg.hive import HiveTables +You can install the latest release version from pypi: -# instantiate Hive Tables -conf = {"hive.metastore.uris": 'thrift://{hms_host}:{hms_port}', - "hive.metastore.warehouse.dir": {tmpdir} } -tables = HiveTables(conf) +```sh +pip3 install "pyiceberg[s3fs,hive]" ``` -and to create a table from a catalog: - -``` python -from iceberg.api.schema import Schema\ -from iceberg.api.types import TimestampType, DoubleType, StringType, NestedField -from iceberg.api.partition_spec import PartitionSpecBuilder - -schema = Schema(NestedField.optional(1, "DateTime", TimestampType.with_timezone()), - NestedField.optional(2, "Bid", DoubleType.get()), - NestedField.optional(3, "Ask", DoubleType.get()), - NestedField.optional(4, "symbol", StringType.get())) -partition_spec = PartitionSpecBuilder(schema).add(1, 1000, "DateTime_day", "day").build() +Or install the latest development version locally: -tables.create(schema, "test.test_123", partition_spec) ``` - - -## Tables - -The Table interface provides access to table metadata - -+ schema returns the current table `Schema` -+ spec returns the current table `PartitonSpec` -+ properties returns a map of key-value `TableProperties` -+ currentSnapshot returns the current table `Snapshot` -+ snapshots returns all valid snapshots for the table -+ snapshot(id) returns a specific snapshot by ID -+ location returns the table’s base location - -Tables also provide refresh to update the table to the latest version. - -### Scanning -Iceberg table scans start by creating a `TableScan` object with `newScan`. - -``` python -scan = table.new_scan(); +pip3 install poetry --upgrade +pip3 install -e ".[s3fs,hive]" ``` -To configure a scan, call filter and select on the `TableScan` to get a new `TableScan` with those changes. - -``` python -filtered_scan = scan.filter(Expressions.equal("id", 5)) -``` +With optional dependencies: -String expressions can also be passed to the filter method. +| Key | Description: | +|-----------|-----------------------------------------------------------------------| +| hive | Support for the Hive metastore | +| pyarrow | PyArrow as a FileIO implementation to interact with the object store | +| s3fs | S3FS as a FileIO implementation to interact with the object store | +| zstandard | Support for zstandard Avro compresssion | +| snappy | Support for snappy Avro compresssion | -``` python -filtered_scan = scan.filter("id=5") -``` +## Catalog -`Schema` projections can be applied against a `TableScan` by passing a list of column names. +To instantiate a catalog: ``` python -filtered_scan = scan.select(["col_1", "col_2", "col_3"]) -``` +>>> from pyiceberg.catalog.hive import HiveCatalog +>>> catalog = HiveCatalog(name='prod', uri='thrift://localhost:9083/') -Because some data types cannot be read using the python library, a convenience method for excluding columns from projection is provided. - -``` python -filtered_scan = scan.select_except(["unsupported_col_1", "unsupported_col_2"]) -``` +>>> catalog.list_namespaces() +[('default',), ('nyc',)] +>>> catalog.list_tables('nyc') +[('nyc', 'taxis')] -Calls to configuration methods create a new `TableScan` so that each `TableScan` is immutable. +>>> catalog.load_table(('nyc', 'taxis')) +Table(identifier=('nyc', 'taxis'), ...) +``` -When a scan is configured, `planFiles`, `planTasks`, and `Schema` are used to return files, tasks, and the read projection. +And to create a table from a catalog: ``` python -scan = table.new_scan() \ - .filter("id=5") \ - .select(["id", "data"]) - -projection = scan.schema -for task in scan.plan_tasks(): - print(task) +from pyiceberg.schema import Schema +from pyiceberg.types import TimestampType, DoubleType, StringType, NestedField + +schema = Schema( + NestedField(field_id=1, name="datetime", field_type=TimestampType(), required=False), + NestedField(field_id=2, name="bid", field_type=DoubleType(), required=False), + NestedField(field_id=3, name="ask", field_type=DoubleType(), required=False), + NestedField(field_id=4, name="symbol", field_type=StringType(), required=False), +) + +from pyiceberg.table.partitioning import PartitionSpec, PartitionField +from pyiceberg.transforms import DayTransform + +partition_spec = PartitionSpec( + PartitionField(source_id=1, field_id=1000, transform=DayTransform(), name="datetime_day") +) + +from pyiceberg.table.sorting import SortOrder, SortField +from pyiceberg.transforms import IdentityTransform + +sort_order = SortOrder( + SortField(source_id=4, transform=IdentityTransform()) +) + +from pyiceberg.catalog.hive import HiveCatalog +catalog = HiveCatalog(name='prod', uri='thrift://localhost:9083/') + +catalog.create_table( + identifier='default.bids', + location='/Users/fokkodriesprong/Desktop/docker-spark-iceberg/wh/bids/', + schema=schema, + partition_spec=partition_spec, + sort_order=sort_order +) + +Table( Review Comment: I usually put result output in a separate pre box, but up to you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org