Fokko opened a new pull request, #6258:
URL: https://github.com/apache/iceberg/pull/6258

   This converts an Iceberg conversion to a PyArrow expression and filters the 
table before materializing it.
   
   This should also make things more efficient:
   
   
https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset
   
   - Optimized reading with predicate pushdown (filtering rows), projection 
(selecting columns), parallel reading or fine-grained managing of tasks.
   
   ```python
   In [3]: from pyiceberg.catalog import load_catalog
      ...:
      ...: catalog = load_catalog('local')
      ...:
      ...: table = catalog.load_table(('nyc', 'taxis'))
      ...:
      ...: from pyiceberg.expressions import EqualTo
      ...:
      ...: table.scan().select("VendorID", 
"tpep_pickup_datetime").filter_rows(EqualTo("VendorID", 1)).to_arrow()
      ...:
   Out[3]:
   pyarrow.Table
   VendorID: int64
   tpep_pickup_datetime: timestamp[us, tz=+00:00]
   ----
   VendorID: 
[[1,1,1,1,1,...,1,1,1,1,1],[1,1,1,1,1,...,1,1,1,1,1],...,[1,1,1,1,1,...,1,1,1,1,1],[1,1,1,1,1,...,1,1,1,1,1]]
   tpep_pickup_datetime: [[2021-04-01 00:00:18.000000,2021-04-01 
00:42:37.000000,2021-04-01 00:57:56.000000,2021-04-01 
00:01:58.000000,2021-04-01 00:27:53.000000,...,2021-04-02
   22:10:18.000000,2021-04-02 22:41:52.000000,2021-04-02 
22:57:26.000000,2021-04-02 22:16:03.000000,2021-04-02 
22:34:19.000000],[2021-04-02 22:06:46.000000,2021-04-02
   22:47:17.000000,2021-04-02 22:02:10.000000,2021-04-02 
22:02:31.000000,2021-04-02 22:15:02.000000,...,2021-04-05 
13:15:36.000000,2021-04-05 13:30:38.000000,2021-04-05
   13:19:12.000000,2021-04-05 13:34:42.000000,2021-04-05 
13:41:51.000000],...,[2021-04-30 08:21:52.000000,2021-04-30 
08:54:38.000000,2021-04-30 08:23:12.000000,2021-04-30
   08:14:43.000000,2021-04-30 08:30:01.000000,...,2021-04-14 
09:44:14.000000,2021-04-14 09:45:33.000000,2021-04-14 
09:17:11.000000,2021-04-14 09:21:46.000000,2021-04-14
   09:24:27.000000],[2021-04-14 09:57:34.000000,2021-04-14 
09:57:11.000000,2021-04-14 09:17:21.000000,2021-04-14 
09:27:45.000000,2021-04-14 09:16:00.000000,...,2021-04-30
   23:33:05.000000,2021-04-30 23:33:17.000000,2021-04-30 
23:06:50.000000,2021-04-30 23:20:32.000000,2021-04-30 23:05:21.000000]]
   ```
   
   ```python
   In [4]: from pyiceberg.catalog import load_catalog
      ...:
      ...: catalog = load_catalog('local')
      ...:
      ...: table = catalog.load_table(('nyc', 'taxis'))
      ...:
      ...: table.scan().select("VendorID", "tpep_pickup_datetime").to_arrow()
      ...:
      ...:
   Out[4]:
   pyarrow.Table
   VendorID: int64
   tpep_pickup_datetime: timestamp[us, tz=+00:00]
   ----
   VendorID: 
[[1,1,1,1,2,...,1,1,2,2,2],[2,2,2,1,1,...,2,2,2,2,1],...,[1,1,1,2,2,...,1,6,1,2,2],[2,2,2,1,2,...,2,1,2,2,1]]
   tpep_pickup_datetime: [[2021-04-01 00:00:18.000000,2021-04-01 
00:42:37.000000,2021-04-01 00:57:56.000000,2021-04-01 
00:01:58.000000,2021-04-01 00:24:55.000000,...,2021-04-02
   22:16:03.000000,2021-04-02 22:34:19.000000,2021-04-02 
22:12:43.000000,2021-04-02 22:41:39.000000,2021-04-02 
22:09:28.000000],[2021-04-02 22:20:04.000000,2021-04-02
   22:34:37.000000,2021-04-02 22:54:15.000000,2021-04-02 
22:06:46.000000,2021-04-02 22:47:17.000000,...,2021-04-05 
13:25:45.000000,2021-04-05 13:35:34.000000,2021-04-05
   13:18:25.000000,2021-04-05 13:37:43.000000,2021-04-05 
13:41:51.000000],...,[2021-04-30 08:21:52.000000,2021-04-30 
08:54:38.000000,2021-04-30 08:23:12.000000,2021-04-30
   07:44:56.000000,2021-04-30 08:08:37.000000,...,2021-04-14 
09:21:46.000000,2021-04-14 09:04:55.000000,2021-04-14 
09:24:27.000000,2021-04-14 09:31:27.000000,2021-04-14
   09:36:01.000000],[2021-04-14 09:04:52.000000,2021-04-14 
09:04:00.000000,2021-04-14 09:52:59.000000,2021-04-14 
09:57:34.000000,2021-04-14 09:29:00.000000,...,2021-04-30
   23:39:00.000000,2021-04-30 23:20:32.000000,2021-04-30 
23:33:00.000000,2021-04-30 23:31:38.000000,2021-04-30 23:05:21.000000]] ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to