[GitHub] [iceberg] rbalamohan opened a new issue, #6320: ArrowBuf boundary checks causing CPU burn and slowness in vectorized parq reading

GitBox Wed, 30 Nov 2022 01:36:06 -0800


rbalamohan opened a new issue, #6320:
URL: https://github.com/apache/iceberg/issues/6320

### Apache Iceberg version

1.1.0 (latest release)

### Query engine

Spark

### Please describe the bug 🐞

When running queries like Q27 in iceberg V2 with vectorized parquet reading,
it was observed that it is slower than traditional
spark+vectorized_parq_reading. Profile more revealed that cache allocation was
causing pressure on JVM which is covered in
https://github.com/apache/iceberg/issues/6319.

I added a local patch to disable the cache and profile for CPU. This was
done to get past this issue and look for other bottlenecks. This revealed that
good amount of CPU was spent on ArrowBuf boundary checks. This can be disabled
by having "-Darrow.enable_unsafe_memory_access=true" in the JVM options. I
observed 25% improvement in runtime with vectorized processing in q27 with both
these issues addressed. Need to check if this option can be enabled in iceberg
directly, or it needs to be documented so that users can include it in executor
& driver JVM options.

![q27_ice_alloc_cpu_v2](https://user-images.githubusercontent.com/7969713/204759995-feaea6f6-1fd8-45a3-915c-9a7294863ff3.png)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rbalamohan opened a new issue, #6320: ArrowBuf boundary checks causing CPU burn and slowness in vectorized parq reading

Reply via email to