Fokko commented on code in PR #8084:
URL: https://github.com/apache/iceberg/pull/8084#discussion_r1271394826
##########
python/pyiceberg/avro/file.py:
##########
@@ -172,8 +165,7 @@ def __enter__(self) -> AvroFile[D]:
Returns:
A generator returning the AvroStructs.
"""
- self.input_stream = self.input_file.open(seekable=False)
- self.decoder = BinaryDecoder(self.input_stream)
+ self.decoder =
InMemoryBinaryDecoder(io.BytesIO(self.input_file.open(seekable=False).read()))
Review Comment:
```suggestion
self.decoder =
InMemoryBinaryDecoder(io.BytesIO(self.input_file.open().read()))
```
With `seekable=False`:
```
2023-07-23T07:01:03.636 [206 Partial Content] s3.GetObject
warehouse.minio:9000/nyc/taxis/metadata/00002-ccecc6b7-78b9-4b73-abcf-eb7c7bb03646.metadata.json
172.31.0.2 1.177ms ↑ 141 B ↓ 6.9 KiB
2023-07-23T07:01:03.927 [200 OK] s3.HeadObject
127.0.0.1:9000/warehouse/nyc/taxis/metadata/snap-7156108617712533857-1-22b607a2-01b1-4448-abb1-a8eb63c4d60f.avro
172.31.0.1 993µs ↑ 153 B ↓ 0 B
2023-07-23T07:01:03.943 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/nyc/taxis/metadata/snap-7156108617712533857-1-22b607a2-01b1-4448-abb1-a8eb63c4d60f.avro
172.31.0.1 1.246ms ↑ 159 B ↓ 3.7 KiB
2023-07-23T07:01:03.952 [200 OK] s3.HeadObject
127.0.0.1:9000/warehouse/nyc/taxis/metadata/22b607a2-01b1-4448-abb1-a8eb63c4d60f-m0.avro
172.31.0.1 556µs ↑ 153 B ↓ 0 B
2023-07-23T07:01:03.955 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/nyc/taxis/metadata/22b607a2-01b1-4448-abb1-a8eb63c4d60f-m0.avro
172.31.0.1 2.292ms ↑ 159 B ↓ 1.0 MiB
2023-07-23T07:01:03.986 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/nyc/taxis/metadata/22b607a2-01b1-4448-abb1-a8eb63c4d60f-m0.avro
172.31.0.1 3.226ms ↑ 159 B ↓ 951 KiB
```
And `seekable=True`:
```
2023-07-23T07:02:03.314 [206 Partial Content] s3.GetObject
warehouse.minio:9000/nyc/taxis/metadata/00002-ccecc6b7-78b9-4b73-abcf-eb7c7bb03646.metadata.json
172.31.0.2 864µs ↑ 141 B ↓ 6.9 KiB
2023-07-23T07:02:03.599 [200 OK] s3.HeadObject
127.0.0.1:9000/warehouse/nyc/taxis/metadata/snap-7156108617712533857-1-22b607a2-01b1-4448-abb1-a8eb63c4d60f.avro
172.31.0.1 845µs ↑ 153 B ↓ 0 B
2023-07-23T07:02:03.616 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/nyc/taxis/metadata/snap-7156108617712533857-1-22b607a2-01b1-4448-abb1-a8eb63c4d60f.avro
172.31.0.1 780µs ↑ 159 B ↓ 3.7 KiB
2023-07-23T07:02:03.625 [200 OK] s3.HeadObject
127.0.0.1:9000/warehouse/nyc/taxis/metadata/22b607a2-01b1-4448-abb1-a8eb63c4d60f-m0.avro
172.31.0.1 277µs ↑ 153 B ↓ 0 B
2023-07-23T07:02:03.628 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/nyc/taxis/metadata/22b607a2-01b1-4448-abb1-a8eb63c4d60f-m0.avro
172.31.0.1 3.134ms ↑ 159 B ↓ 1.9 MiB
```
It [looks
like](https://github.com/apache/arrow/blob/a2a3b9580d66cf40c5df8358f260fdaf5ccc301f/python/pyarrow/io.pxi#L362-L395)
that with `seekable=False`, it will just read the buffer size until the end.
With the seekable, it knows how big the file is, and it will just read it into
one call.
##########
python/pyiceberg/avro/file.py:
##########
@@ -172,8 +165,7 @@ def __enter__(self) -> AvroFile[D]:
Returns:
A generator returning the AvroStructs.
"""
- self.input_stream = self.input_file.open(seekable=False)
- self.decoder = BinaryDecoder(self.input_stream)
+ self.decoder =
InMemoryBinaryDecoder(io.BytesIO(self.input_file.open(seekable=False).read()))
Review Comment:
```suggestion
self.decoder =
InMemoryBinaryDecoder(io.BytesIO(self.input_file.open().read()))
```
With `seekable=False`:
```
2023-07-23T07:01:03.636 [206 Partial Content] s3.GetObject
warehouse.minio:9000/nyc/taxis/metadata/00002-ccecc6b7-78b9-4b73-abcf-eb7c7bb03646.metadata.json
172.31.0.2 1.177ms ↑ 141 B ↓ 6.9 KiB
2023-07-23T07:01:03.927 [200 OK] s3.HeadObject
127.0.0.1:9000/warehouse/nyc/taxis/metadata/snap-7156108617712533857-1-22b607a2-01b1-4448-abb1-a8eb63c4d60f.avro
172.31.0.1 993µs ↑ 153 B ↓ 0 B
2023-07-23T07:01:03.943 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/nyc/taxis/metadata/snap-7156108617712533857-1-22b607a2-01b1-4448-abb1-a8eb63c4d60f.avro
172.31.0.1 1.246ms ↑ 159 B ↓ 3.7 KiB
2023-07-23T07:01:03.952 [200 OK] s3.HeadObject
127.0.0.1:9000/warehouse/nyc/taxis/metadata/22b607a2-01b1-4448-abb1-a8eb63c4d60f-m0.avro
172.31.0.1 556µs ↑ 153 B ↓ 0 B
2023-07-23T07:01:03.955 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/nyc/taxis/metadata/22b607a2-01b1-4448-abb1-a8eb63c4d60f-m0.avro
172.31.0.1 2.292ms ↑ 159 B ↓ 1.0 MiB
2023-07-23T07:01:03.986 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/nyc/taxis/metadata/22b607a2-01b1-4448-abb1-a8eb63c4d60f-m0.avro
172.31.0.1 3.226ms ↑ 159 B ↓ 951 KiB
```
And `seekable=True`:
```
2023-07-23T07:02:03.314 [206 Partial Content] s3.GetObject
warehouse.minio:9000/nyc/taxis/metadata/00002-ccecc6b7-78b9-4b73-abcf-eb7c7bb03646.metadata.json
172.31.0.2 864µs ↑ 141 B ↓ 6.9 KiB
2023-07-23T07:02:03.599 [200 OK] s3.HeadObject
127.0.0.1:9000/warehouse/nyc/taxis/metadata/snap-7156108617712533857-1-22b607a2-01b1-4448-abb1-a8eb63c4d60f.avro
172.31.0.1 845µs ↑ 153 B ↓ 0 B
2023-07-23T07:02:03.616 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/nyc/taxis/metadata/snap-7156108617712533857-1-22b607a2-01b1-4448-abb1-a8eb63c4d60f.avro
172.31.0.1 780µs ↑ 159 B ↓ 3.7 KiB
2023-07-23T07:02:03.625 [200 OK] s3.HeadObject
127.0.0.1:9000/warehouse/nyc/taxis/metadata/22b607a2-01b1-4448-abb1-a8eb63c4d60f-m0.avro
172.31.0.1 277µs ↑ 153 B ↓ 0 B
2023-07-23T07:02:03.628 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/nyc/taxis/metadata/22b607a2-01b1-4448-abb1-a8eb63c4d60f-m0.avro
172.31.0.1 3.134ms ↑ 159 B ↓ 1.9 MiB
```
It [looks
like](https://github.com/apache/arrow/blob/a2a3b9580d66cf40c5df8358f260fdaf5ccc301f/python/pyarrow/io.pxi#L362-L395)
that with `seekable=False`, it will just read the buffer size until the end.
With the seekable, it knows how big the file is, and it will just read it into
one call.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]