Fokko opened a new pull request, #6501:
URL: https://github.com/apache/iceberg/pull/6501

   I noticed that PyArrow is doing two calls to the Avro file, while one should 
be sufficient (150kb):
   
   ```
   2022-12-27T08:45:32.822 [206 Partial Content] s3.GetObject 
minio:9000/warehouse/wh/nyc/taxis/metadata/00009-3dd7ff4a-a81f-4780-bee1-de9335490c20.metadata.json
 172.18.0.3        1.142ms      ↑ 169 B ↓ 14 KiB
   2022-12-27T08:45:32.913 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/snap-6682082212753545990-1-9f900c62-59ca-4ea1-ae75-e373d113b036.avro
 172.18.0.1        867µs       ↑ 153 B ↓ 412 B
   2022-12-27T08:45:32.925 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/snap-6682082212753545990-1-9f900c62-59ca-4ea1-ae75-e373d113b036.avro
 172.18.0.1        1.626ms      ↑ 159 B ↓ 4.6 KiB
   2022-12-27T08:45:32.973 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro
 172.18.0.1        1.216ms      ↑ 153 B ↓ 413 B
   2022-12-27T08:45:32.989 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro
 172.18.0.1        3.719ms      ↑ 159 B ↓ 8.5 KiB
   2022-12-27T08:45:33.020 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro
 172.18.0.1        3.904ms      ↑ 159 B ↓ 8.5 KiB
   2022-12-27T08:45:33.042 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro
 172.18.0.1        1.903ms      ↑ 159 B ↓ 1.7 KiB
   2022-12-27T08:45:33.104 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro
 172.18.0.1        1.232ms      ↑ 153 B ↓ 413 B
   2022-12-27T08:45:33.113 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro
 172.18.0.1        683µs       ↑ 159 B ↓ 8.5 KiB
   2022-12-27T08:45:33.120 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro
 172.18.0.1        975µs       ↑ 159 B ↓ 7.0 KiB
   2022-12-27T08:45:33.141 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro
 172.18.0.1        383µs       ↑ 153 B ↓ 413 B
   2022-12-27T08:45:33.144 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro
 172.18.0.1        774µs       ↑ 159 B ↓ 8.5 KiB
   2022-12-27T08:45:33.148 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro
 172.18.0.1        833µs       ↑ 159 B ↓ 7.4 KiB
   2022-12-27T08:45:33.170 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro
 172.18.0.1        432µs       ↑ 153 B ↓ 413 B
   2022-12-27T08:45:33.173 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro
 172.18.0.1        1.208ms      ↑ 159 B ↓ 8.5 KiB
   2022-12-27T08:45:33.178 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro
 172.18.0.1        814µs       ↑ 159 B ↓ 8.2 KiB
   2022-12-27T08:45:33.202 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro
 172.18.0.1        427µs       ↑ 153 B ↓ 413 B
   2022-12-27T08:45:33.205 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro
 172.18.0.1        671µs       ↑ 159 B ↓ 8.5 KiB
   2022-12-27T08:45:33.209 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro
 172.18.0.1        502µs       ↑ 159 B ↓ 7.9 KiB
   2022-12-27T08:45:33.233 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro
 172.18.0.1        616µs       ↑ 153 B ↓ 413 B
   2022-12-27T08:45:33.236 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro
 172.18.0.1        955µs       ↑ 159 B ↓ 8.5 KiB
   2022-12-27T08:45:33.240 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro
 172.18.0.1        934µs       ↑ 159 B ↓ 7.4 KiB
   2022-12-27T08:45:33.262 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro
 172.18.0.1        308µs       ↑ 153 B ↓ 413 B
   2022-12-27T08:45:33.265 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro
 172.18.0.1        641µs       ↑ 159 B ↓ 8.5 KiB
   2022-12-27T08:45:33.269 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro
 172.18.0.1        831µs       ↑ 159 B ↓ 7.6 KiB
   2022-12-27T08:45:33.295 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro
 172.18.0.1        625µs       ↑ 153 B ↓ 413 B
   2022-12-27T08:45:33.298 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro
 172.18.0.1        828µs       ↑ 159 B ↓ 8.5 KiB
   2022-12-27T08:45:33.302 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro
 172.18.0.1        897µs       ↑ 159 B ↓ 7.8 KiB
   2022-12-27T08:45:33.324 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro
 172.18.0.1        474µs       ↑ 153 B ↓ 413 B
   2022-12-27T08:45:33.326 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro
 172.18.0.1        644µs       ↑ 159 B ↓ 8.5 KiB
   2022-12-27T08:45:33.330 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro
 172.18.0.1        904µs       ↑ 159 B ↓ 7.1 KiB
   ```
   
   Instead I've changed it to using the PyArrow `open_input_stream` which is 
used for sequential reading, where `open_input_file` is used for random access.
   
   After this change we can see that the file is requested just once:
   ```
   2022-12-28T13:31:46.815 [206 Partial Content] s3.GetObject 
minio:9000/warehouse/wh/nyc/taxis/metadata/00009-3dd7ff4a-a81f-4780-bee1-de9335490c20.metadata.json
 172.18.0.3        983µs       ↑ 169 B ↓ 14 KiB
   2022-12-28T13:31:46.912 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/snap-6682082212753545990-1-9f900c62-59ca-4ea1-ae75-e373d113b036.avro
 172.18.0.1        3.509ms      ↑ 153 B ↓ 412 B
   2022-12-28T13:31:46.923 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/snap-6682082212753545990-1-9f900c62-59ca-4ea1-ae75-e373d113b036.avro
 172.18.0.1        2.779ms      ↑ 159 B ↓ 4.6 KiB
   2022-12-28T13:31:46.967 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro
 172.18.0.1        2.011ms      ↑ 153 B ↓ 413 B
   2022-12-28T13:31:46.988 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro
 172.18.0.1        8.244ms      ↑ 159 B ↓ 18 KiB
   2022-12-28T13:31:47.107 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro
 172.18.0.1        404µs       ↑ 153 B ↓ 413 B
   2022-12-28T13:31:47.110 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro
 172.18.0.1        830µs       ↑ 159 B ↓ 15 KiB
   2022-12-28T13:31:47.132 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro
 172.18.0.1        286µs       ↑ 153 B ↓ 413 B
   2022-12-28T13:31:47.135 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro
 172.18.0.1        760µs       ↑ 159 B ↓ 15 KiB
   2022-12-28T13:31:47.157 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro
 172.18.0.1        303µs       ↑ 153 B ↓ 413 B
   2022-12-28T13:31:47.159 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro
 172.18.0.1        821µs       ↑ 159 B ↓ 16 KiB
   2022-12-28T13:31:47.187 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro
 172.18.0.1        323µs       ↑ 153 B ↓ 413 B
   2022-12-28T13:31:47.191 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro
 172.18.0.1        863µs       ↑ 159 B ↓ 16 KiB
   2022-12-28T13:31:47.213 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro
 172.18.0.1        602µs       ↑ 153 B ↓ 413 B
   2022-12-28T13:31:47.216 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro
 172.18.0.1        839µs       ↑ 159 B ↓ 15 KiB
   2022-12-28T13:31:47.238 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro
 172.18.0.1        293µs       ↑ 153 B ↓ 413 B
   2022-12-28T13:31:47.242 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro
 172.18.0.1        898µs       ↑ 159 B ↓ 16 KiB
   2022-12-28T13:31:47.267 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro
 172.18.0.1        316µs       ↑ 153 B ↓ 413 B
   2022-12-28T13:31:47.270 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro
 172.18.0.1        655µs       ↑ 159 B ↓ 16 KiB
   2022-12-28T13:31:47.295 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro
 172.18.0.1        315µs       ↑ 153 B ↓ 413 B
   2022-12-28T13:31:47.298 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro
 172.18.0.1        999µs       ↑ 159 B ↓ 15 KiB
   ```
   
   This also includes the manifest. I think this makes more sense since we 
always want to read the whole file. The only thing is that the 1mb is a bit 
arbitrary.
   
   I've verified that s3fs also works as expected:
   ```
   2022-12-28T13:52:52.647 [206 Partial Content] s3.GetObject 
minio:9000/warehouse/wh/nyc/taxis/metadata/00009-3dd7ff4a-a81f-4780-bee1-de9335490c20.metadata.json
 172.18.0.3        991µs       ↑ 169 B ↓ 14 KiB
   2022-12-28T13:53:03.335 [206 Partial Content] s3.GetObject 
minio:9000/warehouse/wh/nyc/taxis/metadata/00009-3dd7ff4a-a81f-4780-bee1-de9335490c20.metadata.json
 172.18.0.3        954µs       ↑ 169 B ↓ 14 KiB
   2022-12-28T13:53:03.845 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/snap-6682082212753545990-1-9f900c62-59ca-4ea1-ae75-e373d113b036.avro
 172.18.0.1        1.357ms      ↑ 138 B ↓ 412 B
   2022-12-28T13:53:03.857 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/snap-6682082212753545990-1-9f900c62-59ca-4ea1-ae75-e373d113b036.avro
 172.18.0.1        1.355ms      ↑ 153 B ↓ 4.6 KiB
   2022-12-28T13:53:03.864 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro
 172.18.0.1        422µs       ↑ 138 B ↓ 413 B
   2022-12-28T13:53:03.868 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro
 172.18.0.1        707µs       ↑ 153 B ↓ 18 KiB
   2022-12-28T13:53:03.897 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro
 172.18.0.1        333µs       ↑ 138 B ↓ 413 B
   2022-12-28T13:53:03.901 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro
 172.18.0.1        793µs       ↑ 153 B ↓ 15 KiB
   2022-12-28T13:53:03.921 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro
 172.18.0.1        371µs       ↑ 138 B ↓ 413 B
   2022-12-28T13:53:03.924 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro
 172.18.0.1        794µs       ↑ 153 B ↓ 15 KiB
   2022-12-28T13:53:03.945 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro
 172.18.0.1        332µs       ↑ 138 B ↓ 413 B
   2022-12-28T13:53:03.948 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro
 172.18.0.1        570µs       ↑ 153 B ↓ 16 KiB
   2022-12-28T13:53:03.971 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro
 172.18.0.1        496µs       ↑ 138 B ↓ 413 B
   2022-12-28T13:53:03.974 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro
 172.18.0.1        1.266ms      ↑ 153 B ↓ 16 KiB
   2022-12-28T13:53:03.998 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro
 172.18.0.1        389µs       ↑ 138 B ↓ 413 B
   2022-12-28T13:53:04.001 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro
 172.18.0.1        717µs       ↑ 153 B ↓ 15 KiB
   2022-12-28T13:53:04.023 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro
 172.18.0.1        306µs       ↑ 138 B ↓ 413 B
   2022-12-28T13:53:04.026 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro
 172.18.0.1        920µs       ↑ 153 B ↓ 16 KiB
   2022-12-28T13:53:04.049 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro
 172.18.0.1        397µs       ↑ 138 B ↓ 413 B
   2022-12-28T13:53:04.070 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro
 172.18.0.1        708µs       ↑ 153 B ↓ 16 KiB
   2022-12-28T13:53:04.092 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro
 172.18.0.1        289µs       ↑ 138 B ↓ 413 B
   2022-12-28T13:53:04.094 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro
 172.18.0.1        733µs       ↑ 153 B ↓ 15 KiB
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to