rustyconover commented on PR #8084:
URL: https://github.com/apache/iceberg/pull/8084#issuecomment-1637312031
Some benchmark results
```
# Main branch
$ hyperfine "python3 measure.py"
Benchmark 1: python3 measure.py
Time (mean ± σ): 8.340 s ± 0.113 s [User: 8.598 s, System: 0.563
s]
Range (min … max): 8.161 s … 8.484 s 10 runs
# With this PR
$ hyperfine "python3 measure.py"
Benchmark 1: python3 measure.py
Time (mean ± σ): 4.565 s ± 0.032 s [User: 4.854 s, System: 0.536
s]
Range (min … max): 4.520 s … 4.607 s 10 runs
```
Seems to be about a 46% decrease in run time to parse my test Avro file.
This is still pure-python.
After this PR applied here is the current profiler output for my test file.
The profiler does slow things down.
```
9888710 function calls (9858577 primitive calls) in 10.129 seconds
Ordered by: internal time
List reduced from 256 to 100 due to restriction <100>
ncalls tottime percall cumtime percall filename:lineno(function)
8125697 6.257 0.000 6.257 0.000
/Users/rusty/Development/iceberg/python/pyiceberg/avro/decoder.py:287(read_int)
36357 1.672 0.000 6.652 0.000
/Users/rusty/Development/iceberg/python/pyiceberg/avro/decoder.py:77(read_int_int_dict)
24238 0.744 0.000 1.645 0.000
/Users/rusty/Development/iceberg/python/pyiceberg/avro/decoder.py:303(read_int_bytes_dict)
42934/14312 0.268 0.000 9.851 0.001
/Users/rusty/Development/iceberg/python/pyiceberg/avro/reader.py:310(read)
343467 0.148 0.000 0.290 0.000
/Users/rusty/Development/iceberg/python/pyiceberg/typedef.py:155(__setitem__)
214665 0.147 0.000 8.957 0.000
/Users/rusty/Development/iceberg/python/pyiceberg/avro/reader.py:250(read)
257599 0.143 0.000 0.151 0.000
/Users/rusty/Development/iceberg/python/pyiceberg/manifest.py:204(__setattr__)
1 0.101 0.101 10.129 10.129 <string>:1(<module>)
551 0.093 0.000 0.093 0.000 {built-in method
zlib.decompress}
85867 0.092 0.000 8.468 0.000
/Users/rusty/Development/iceberg/python/pyiceberg/avro/reader.py:392(read)
42938 0.066 0.000 0.115 0.000
/Users/rusty/Development/iceberg/python/pyiceberg/typedef.py:137(__init__)
57244 0.049 0.000 0.083 0.000
/Users/rusty/Development/iceberg/python/pyiceberg/avro/decoder.py:315(read_bytes)
14312 0.046 0.000 0.083 0.000
/Users/rusty/Development/iceberg/python/pyiceberg/manifest.py:245(__init__)
57260 0.042 0.000 0.138 0.000
/Users/rusty/Development/iceberg/python/pyiceberg/avro/decoder.py:133(read_utf8)
114488 0.041 0.000 0.185 0.000
/Users/rusty/Development/iceberg/python/pyiceberg/avro/reader.py:132(read)
57260 0.022 0.000 0.160 0.000
/Users/rusty/Development/iceberg/python/pyiceberg/avro/reader.py:188(read)
43007 0.019 0.000 0.026 0.000 {built-in method builtins.hash}
14311 0.017 0.000 0.051 0.000
/Users/rusty/Development/iceberg/python/pyiceberg/avro/reader.py:353(read)
14312 0.015 0.000 0.062 0.000
/Users/rusty/Development/iceberg/python/pyiceberg/manifest.py:211(__init__)
14311 0.015 0.000 9.868 0.001
/Users/rusty/Development/iceberg/python/pyiceberg/avro/file.py:123(__next__)
12119 0.014 0.000 0.022 0.000
/Users/rusty/Development/iceberg/python/pyiceberg/avro/decoder.py:72(read_ints)
```
I have a single table that has about uses 100 Avro files of this complexity,
so more improvements are still desired.
A few things about this test Avro file, contains references to 14,000
MetadataFiles and the table has >200 columns.
There are few ideas to pursue:
1. Implement Cython for parsing integers. Replace `decoder::read_int` with
this the Cython implementation for the InMemoryBinaryDecoder.
2. Defer parsing maps and lists until they are accessed, keep their byte
representation in memory and when the actual values are requested deserialize
them. This would be helpful for common cases where not all min/max values are
required. There is currently a problem with skipping maps and lists, the
Iceberg serialization code does not use the proper Avro encoder it uses
[directBinaryEncoder](https://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/io/DirectBinaryEncoder.html)
rather than
[blockingBinaryEncoder](https://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/io/BlockingBinaryEncoder.html)
which buffers data and writes byte sizes after the Avro block size. See
[IcebergEncoder.java](https://github.com/apache/iceberg/blame/cdbaf5754e0cd2ee7439a1d94cb4cfc92fc29731/core/src/main/java/org/apache/iceberg/data/avro/IcebergEncoder.java#L101),
[KeyMetadataEncoder.java](https://github.com/apache/iceberg/blob/cdbaf5754e0cd2ee7439a1d94cb4cfc92fc29731/core/src/main/java/org/apache/iceberg/encryption/KeyMetadataEncoder.java#L93)
and
[AvroEncoderUtil.java](https://github.com/apache/iceberg/blob/cdbaf5754e0cd2ee7439a1d94cb4cfc92fc29731/core/src/main/java/org/apache/iceberg/avro/AvroEncoderUtil.java#L57)
for examples where the Avro encoder is specified.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]