[GitHub] [iceberg] rustyconover commented on pull request #8084: Python: Improve AVRO reading speed

via GitHub Sun, 16 Jul 2023 20:03:21 -0700


rustyconover commented on PR #8084:
URL: https://github.com/apache/iceberg/pull/8084#issuecomment-1637312031


   Some benchmark results
   
   ```
   # Main branch
   $ hyperfine "python3 measure.py"
   Benchmark 1: python3 measure.py
     Time (mean ± σ):      8.340 s ±  0.113 s    [User: 8.598 s, System: 0.563 
s]
     Range (min … max):    8.161 s …  8.484 s    10 runs
    
   # With this PR
   $ hyperfine "python3 measure.py"
   Benchmark 1: python3 measure.py
     Time (mean ± σ):      4.565 s ±  0.032 s    [User: 4.854 s, System: 0.536 
s]
     Range (min … max):    4.520 s …  4.607 s    10 runs
   ```
   
   Seems to be about a 46% decrease in run time to parse my test Avro file.
   
   This is still pure-python.
   
   After this PR applied here is the current profiler output for my test file.  
The profiler does slow things down.
   
   ```
            9888710 function calls (9858577 primitive calls) in 10.129 seconds
   
      Ordered by: internal time
      List reduced from 256 to 100 due to restriction <100>
   
      ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     8125697    6.257    0.000    6.257    0.000 
/Users/rusty/Development/iceberg/python/pyiceberg/avro/decoder.py:287(read_int)
       36357    1.672    0.000    6.652    0.000 
/Users/rusty/Development/iceberg/python/pyiceberg/avro/decoder.py:77(read_int_int_dict)
       24238    0.744    0.000    1.645    0.000 
/Users/rusty/Development/iceberg/python/pyiceberg/avro/decoder.py:303(read_int_bytes_dict)
   42934/14312    0.268    0.000    9.851    0.001 
/Users/rusty/Development/iceberg/python/pyiceberg/avro/reader.py:310(read)
      343467    0.148    0.000    0.290    0.000 
/Users/rusty/Development/iceberg/python/pyiceberg/typedef.py:155(__setitem__)
      214665    0.147    0.000    8.957    0.000 
/Users/rusty/Development/iceberg/python/pyiceberg/avro/reader.py:250(read)
      257599    0.143    0.000    0.151    0.000 
/Users/rusty/Development/iceberg/python/pyiceberg/manifest.py:204(__setattr__)
           1    0.101    0.101   10.129   10.129 <string>:1(<module>)
         551    0.093    0.000    0.093    0.000 {built-in method 
zlib.decompress}
       85867    0.092    0.000    8.468    0.000 
/Users/rusty/Development/iceberg/python/pyiceberg/avro/reader.py:392(read)
       42938    0.066    0.000    0.115    0.000 
/Users/rusty/Development/iceberg/python/pyiceberg/typedef.py:137(__init__)
       57244    0.049    0.000    0.083    0.000 
/Users/rusty/Development/iceberg/python/pyiceberg/avro/decoder.py:315(read_bytes)
       14312    0.046    0.000    0.083    0.000 
/Users/rusty/Development/iceberg/python/pyiceberg/manifest.py:245(__init__)
       57260    0.042    0.000    0.138    0.000 
/Users/rusty/Development/iceberg/python/pyiceberg/avro/decoder.py:133(read_utf8)
      114488    0.041    0.000    0.185    0.000 
/Users/rusty/Development/iceberg/python/pyiceberg/avro/reader.py:132(read)
       57260    0.022    0.000    0.160    0.000 
/Users/rusty/Development/iceberg/python/pyiceberg/avro/reader.py:188(read)
       43007    0.019    0.000    0.026    0.000 {built-in method builtins.hash}
       14311    0.017    0.000    0.051    0.000 
/Users/rusty/Development/iceberg/python/pyiceberg/avro/reader.py:353(read)
       14312    0.015    0.000    0.062    0.000 
/Users/rusty/Development/iceberg/python/pyiceberg/manifest.py:211(__init__)
       14311    0.015    0.000    9.868    0.001 
/Users/rusty/Development/iceberg/python/pyiceberg/avro/file.py:123(__next__)
       12119    0.014    0.000    0.022    0.000 
/Users/rusty/Development/iceberg/python/pyiceberg/avro/decoder.py:72(read_ints)
   ```
   
   I have a single table that has about uses 100 Avro files of this complexity, 
so more improvements are still desired.
   
   A few things about this test Avro file, contains references to 14,000 
MetadataFiles and the table has >200 columns.
   
   There are few ideas to pursue:
   
   1. Implement Cython for parsing integers.  Replace `decoder::read_int` with 
this the Cython implementation for the InMemoryBinaryDecoder.
   2. Defer parsing maps and lists until they are accessed, keep their byte 
representation in memory and when the actual values are requested deserialize 
them.  This would be helpful for common cases where not all min/max values are 
required.  There is currently a problem with skipping maps and lists, the 
Iceberg serialization code does not use the proper Avro encoder it uses 
[directBinaryEncoder](https://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/io/DirectBinaryEncoder.html)
 rather than 
[blockingBinaryEncoder](https://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/io/BlockingBinaryEncoder.html)
 which buffers data and writes byte sizes after the Avro block size.  See 
   
[IcebergEncoder.java](https://github.com/apache/iceberg/blame/cdbaf5754e0cd2ee7439a1d94cb4cfc92fc29731/core/src/main/java/org/apache/iceberg/data/avro/IcebergEncoder.java#L101),
 
[KeyMetadataEncoder.java](https://github.com/apache/iceberg/blob/cdbaf5754e0cd2ee7439a1d94cb4cfc92fc29731/core/src/main/java/org/apache/iceberg/encryption/KeyMetadataEncoder.java#L93)
 and 
[AvroEncoderUtil.java](https://github.com/apache/iceberg/blob/cdbaf5754e0cd2ee7439a1d94cb4cfc92fc29731/core/src/main/java/org/apache/iceberg/avro/AvroEncoderUtil.java#L57)
 for examples where the Avro encoder is specified.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rustyconover commented on pull request #8084: Python: Improve AVRO reading speed

Reply via email to