Fokko opened a new pull request, #6501: URL: https://github.com/apache/iceberg/pull/6501
I noticed that PyArrow is doing two calls to the Avro file, while one should be sufficient (150kb): ``` 2022-12-27T08:45:32.822 [206 Partial Content] s3.GetObject minio:9000/warehouse/wh/nyc/taxis/metadata/00009-3dd7ff4a-a81f-4780-bee1-de9335490c20.metadata.json 172.18.0.3 1.142ms ↑ 169 B ↓ 14 KiB 2022-12-27T08:45:32.913 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/snap-6682082212753545990-1-9f900c62-59ca-4ea1-ae75-e373d113b036.avro 172.18.0.1 867µs ↑ 153 B ↓ 412 B 2022-12-27T08:45:32.925 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/snap-6682082212753545990-1-9f900c62-59ca-4ea1-ae75-e373d113b036.avro 172.18.0.1 1.626ms ↑ 159 B ↓ 4.6 KiB 2022-12-27T08:45:32.973 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro 172.18.0.1 1.216ms ↑ 153 B ↓ 413 B 2022-12-27T08:45:32.989 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro 172.18.0.1 3.719ms ↑ 159 B ↓ 8.5 KiB 2022-12-27T08:45:33.020 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro 172.18.0.1 3.904ms ↑ 159 B ↓ 8.5 KiB 2022-12-27T08:45:33.042 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro 172.18.0.1 1.903ms ↑ 159 B ↓ 1.7 KiB 2022-12-27T08:45:33.104 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro 172.18.0.1 1.232ms ↑ 153 B ↓ 413 B 2022-12-27T08:45:33.113 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro 172.18.0.1 683µs ↑ 159 B ↓ 8.5 KiB 2022-12-27T08:45:33.120 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro 172.18.0.1 975µs ↑ 159 B ↓ 7.0 KiB 2022-12-27T08:45:33.141 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro 172.18.0.1 383µs ↑ 153 B ↓ 413 B 2022-12-27T08:45:33.144 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro 172.18.0.1 774µs ↑ 159 B ↓ 8.5 KiB 2022-12-27T08:45:33.148 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro 172.18.0.1 833µs ↑ 159 B ↓ 7.4 KiB 2022-12-27T08:45:33.170 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro 172.18.0.1 432µs ↑ 153 B ↓ 413 B 2022-12-27T08:45:33.173 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro 172.18.0.1 1.208ms ↑ 159 B ↓ 8.5 KiB 2022-12-27T08:45:33.178 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro 172.18.0.1 814µs ↑ 159 B ↓ 8.2 KiB 2022-12-27T08:45:33.202 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro 172.18.0.1 427µs ↑ 153 B ↓ 413 B 2022-12-27T08:45:33.205 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro 172.18.0.1 671µs ↑ 159 B ↓ 8.5 KiB 2022-12-27T08:45:33.209 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro 172.18.0.1 502µs ↑ 159 B ↓ 7.9 KiB 2022-12-27T08:45:33.233 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro 172.18.0.1 616µs ↑ 153 B ↓ 413 B 2022-12-27T08:45:33.236 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro 172.18.0.1 955µs ↑ 159 B ↓ 8.5 KiB 2022-12-27T08:45:33.240 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro 172.18.0.1 934µs ↑ 159 B ↓ 7.4 KiB 2022-12-27T08:45:33.262 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro 172.18.0.1 308µs ↑ 153 B ↓ 413 B 2022-12-27T08:45:33.265 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro 172.18.0.1 641µs ↑ 159 B ↓ 8.5 KiB 2022-12-27T08:45:33.269 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro 172.18.0.1 831µs ↑ 159 B ↓ 7.6 KiB 2022-12-27T08:45:33.295 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro 172.18.0.1 625µs ↑ 153 B ↓ 413 B 2022-12-27T08:45:33.298 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro 172.18.0.1 828µs ↑ 159 B ↓ 8.5 KiB 2022-12-27T08:45:33.302 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro 172.18.0.1 897µs ↑ 159 B ↓ 7.8 KiB 2022-12-27T08:45:33.324 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro 172.18.0.1 474µs ↑ 153 B ↓ 413 B 2022-12-27T08:45:33.326 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro 172.18.0.1 644µs ↑ 159 B ↓ 8.5 KiB 2022-12-27T08:45:33.330 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro 172.18.0.1 904µs ↑ 159 B ↓ 7.1 KiB ``` Instead I've changed it to using the PyArrow `open_input_stream` which is used for sequential reading, where `open_input_file` is used for random access. After this change we can see that the file is requested just once: ``` 2022-12-28T13:31:46.815 [206 Partial Content] s3.GetObject minio:9000/warehouse/wh/nyc/taxis/metadata/00009-3dd7ff4a-a81f-4780-bee1-de9335490c20.metadata.json 172.18.0.3 983µs ↑ 169 B ↓ 14 KiB 2022-12-28T13:31:46.912 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/snap-6682082212753545990-1-9f900c62-59ca-4ea1-ae75-e373d113b036.avro 172.18.0.1 3.509ms ↑ 153 B ↓ 412 B 2022-12-28T13:31:46.923 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/snap-6682082212753545990-1-9f900c62-59ca-4ea1-ae75-e373d113b036.avro 172.18.0.1 2.779ms ↑ 159 B ↓ 4.6 KiB 2022-12-28T13:31:46.967 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro 172.18.0.1 2.011ms ↑ 153 B ↓ 413 B 2022-12-28T13:31:46.988 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro 172.18.0.1 8.244ms ↑ 159 B ↓ 18 KiB 2022-12-28T13:31:47.107 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro 172.18.0.1 404µs ↑ 153 B ↓ 413 B 2022-12-28T13:31:47.110 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro 172.18.0.1 830µs ↑ 159 B ↓ 15 KiB 2022-12-28T13:31:47.132 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro 172.18.0.1 286µs ↑ 153 B ↓ 413 B 2022-12-28T13:31:47.135 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro 172.18.0.1 760µs ↑ 159 B ↓ 15 KiB 2022-12-28T13:31:47.157 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro 172.18.0.1 303µs ↑ 153 B ↓ 413 B 2022-12-28T13:31:47.159 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro 172.18.0.1 821µs ↑ 159 B ↓ 16 KiB 2022-12-28T13:31:47.187 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro 172.18.0.1 323µs ↑ 153 B ↓ 413 B 2022-12-28T13:31:47.191 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro 172.18.0.1 863µs ↑ 159 B ↓ 16 KiB 2022-12-28T13:31:47.213 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro 172.18.0.1 602µs ↑ 153 B ↓ 413 B 2022-12-28T13:31:47.216 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro 172.18.0.1 839µs ↑ 159 B ↓ 15 KiB 2022-12-28T13:31:47.238 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro 172.18.0.1 293µs ↑ 153 B ↓ 413 B 2022-12-28T13:31:47.242 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro 172.18.0.1 898µs ↑ 159 B ↓ 16 KiB 2022-12-28T13:31:47.267 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro 172.18.0.1 316µs ↑ 153 B ↓ 413 B 2022-12-28T13:31:47.270 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro 172.18.0.1 655µs ↑ 159 B ↓ 16 KiB 2022-12-28T13:31:47.295 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro 172.18.0.1 315µs ↑ 153 B ↓ 413 B 2022-12-28T13:31:47.298 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro 172.18.0.1 999µs ↑ 159 B ↓ 15 KiB ``` This also includes the manifest. I think this makes more sense since we always want to read the whole file. The only thing is that the 1mb is a bit arbitrary. I've verified that s3fs also works as expected: ``` 2022-12-28T13:52:52.647 [206 Partial Content] s3.GetObject minio:9000/warehouse/wh/nyc/taxis/metadata/00009-3dd7ff4a-a81f-4780-bee1-de9335490c20.metadata.json 172.18.0.3 991µs ↑ 169 B ↓ 14 KiB 2022-12-28T13:53:03.335 [206 Partial Content] s3.GetObject minio:9000/warehouse/wh/nyc/taxis/metadata/00009-3dd7ff4a-a81f-4780-bee1-de9335490c20.metadata.json 172.18.0.3 954µs ↑ 169 B ↓ 14 KiB 2022-12-28T13:53:03.845 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/snap-6682082212753545990-1-9f900c62-59ca-4ea1-ae75-e373d113b036.avro 172.18.0.1 1.357ms ↑ 138 B ↓ 412 B 2022-12-28T13:53:03.857 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/snap-6682082212753545990-1-9f900c62-59ca-4ea1-ae75-e373d113b036.avro 172.18.0.1 1.355ms ↑ 153 B ↓ 4.6 KiB 2022-12-28T13:53:03.864 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro 172.18.0.1 422µs ↑ 138 B ↓ 413 B 2022-12-28T13:53:03.868 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro 172.18.0.1 707µs ↑ 153 B ↓ 18 KiB 2022-12-28T13:53:03.897 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro 172.18.0.1 333µs ↑ 138 B ↓ 413 B 2022-12-28T13:53:03.901 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro 172.18.0.1 793µs ↑ 153 B ↓ 15 KiB 2022-12-28T13:53:03.921 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro 172.18.0.1 371µs ↑ 138 B ↓ 413 B 2022-12-28T13:53:03.924 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro 172.18.0.1 794µs ↑ 153 B ↓ 15 KiB 2022-12-28T13:53:03.945 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro 172.18.0.1 332µs ↑ 138 B ↓ 413 B 2022-12-28T13:53:03.948 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro 172.18.0.1 570µs ↑ 153 B ↓ 16 KiB 2022-12-28T13:53:03.971 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro 172.18.0.1 496µs ↑ 138 B ↓ 413 B 2022-12-28T13:53:03.974 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro 172.18.0.1 1.266ms ↑ 153 B ↓ 16 KiB 2022-12-28T13:53:03.998 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro 172.18.0.1 389µs ↑ 138 B ↓ 413 B 2022-12-28T13:53:04.001 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro 172.18.0.1 717µs ↑ 153 B ↓ 15 KiB 2022-12-28T13:53:04.023 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro 172.18.0.1 306µs ↑ 138 B ↓ 413 B 2022-12-28T13:53:04.026 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro 172.18.0.1 920µs ↑ 153 B ↓ 16 KiB 2022-12-28T13:53:04.049 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro 172.18.0.1 397µs ↑ 138 B ↓ 413 B 2022-12-28T13:53:04.070 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro 172.18.0.1 708µs ↑ 153 B ↓ 16 KiB 2022-12-28T13:53:04.092 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro 172.18.0.1 289µs ↑ 138 B ↓ 413 B 2022-12-28T13:53:04.094 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro 172.18.0.1 733µs ↑ 153 B ↓ 15 KiB ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org