liurenjie1024 commented on PR #373: URL: https://github.com/apache/iceberg-rust/pull/373#issuecomment-2264974929
> Hey @ZENOTME! Sure. If you check out my `perf-suite` branch from [my other PR](https://github.com/apache/iceberg-rust/pull/497), this is branched off main and can be used to get a baseline of `plan_files` performance without the changes in this PR. You'll need to run `cargo install just` if you don't have `just` already installed, and then `just perf-run`. This should create the docker setup for the performance testing environment, download some test data fro the NYC Taxi dataset, create an Iceberg table inside the perf testing env, insert the test data into the table, and then run the perf tests. > > The perf tests themselves consist of four scenarios that each execute `plan_files` in different scenarios: > > * a file plan for a scan that reads **all** rows from **all** data files in the current snapshot (ie no filter); > * a file plan for a scan that reads **some** rows from **all** data files in the current snapshot (`passenger_count` = 1); > * a file plan for a scan that reads **all** rows from **one** data file in the current snapshot (filter: `tpep_pickup_datetime` = '2024-02-01') > * a file plan for a scan that reads **some** rows from **one** data file in the current snapshot (filter: `tpep_pickup_datetime` = '2024-02-01' AND `passenger_count` = 1); > > The benchmarks get ran using criterion. Here's an example of the test output when I ran it just now on my M1 Pro Macbook: > > <img alt="image" width="681" src="https://private-user-images.githubusercontent.com/323439/354498323-3d521b80-b3ab-4421-8889-e39cc1a0ec91.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjI1OTE1NzgsIm5iZiI6MTcyMjU5MTI3OCwicGF0aCI6Ii8zMjM0MzkvMzU0NDk4MzIzLTNkNTIxYjgwLWIzYWItNDQyMS04ODg5LWUzOWNjMWEwZWM5MS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwODAyJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDgwMlQwOTM0MzhaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT05ZjY1YTJlMWY5MmM4MjYxMDM1MDc2YmI1M2E1NzcxM2U2MWYzMzQ2ZDA0MWE2NjFmYmY4ZGNhNDM2NzRjMGM0JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.VNpmn5XuoJI--VVxypJOyKHZh1JR3kqkjhBQhuRhZps"> > Now, if you cherry-pick the commit from this concurrent table scan branch into the `perf-suite` branch: > > ```shell > git cherry-pick 8eef484094dbb7c55ac3181bbd552090308591c9 > ``` > > And re-run the performance tests, you'll see something similar to the following: > > <img alt="image" width="689" src="https://private-user-images.githubusercontent.com/323439/354499847-28e41e05-b1e2-48a9-a26c-20ca940e688a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjI1OTE1NzgsIm5iZiI6MTcyMjU5MTI3OCwicGF0aCI6Ii8zMjM0MzkvMzU0NDk5ODQ3LTI4ZTQxZTA1LWIxZTItNDhhOS1hMjZjLTIwY2E5NDBlNjg4YS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwODAyJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDgwMlQwOTM0MzhaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT00NDU4Zjc5MmVlYzVhNzJkZmRkM2Y3ZmY3ZWJjYmJmNGY5ODI0OGE1ZWM4ZmRhNmJkZGNiMDZmYzk4YzU2ZDAyJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.cHYHn1aK6dHlbI9_uWiRoqcNXp8nBPKqxLcsVHyeQhI"> > As you can see the times come down from around 1.5s to around **0.5s**, a big improvement! > > In fact, if you then cherry-pick the commit from my [follow-on PR with the object cache](https://github.com/apache/iceberg-rust/pull/512): > > ```shell > git cherry-pick c696a3f9a7dacf822e6d5bc76f13d24e0b50ee31 > ``` > > and re-run the performance tests again, you'll see something even more remarkable: > > <img alt="image" width="578" src="https://private-user-images.githubusercontent.com/323439/354501480-12e91179-2ea3-4dad-9581-d7698b80b653.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjI1OTE1NzgsIm5iZiI6MTcyMjU5MTI3OCwicGF0aCI6Ii8zMjM0MzkvMzU0NTAxNDgwLTEyZTkxMTc5LTJlYTMtNGRhZC05NTgxLWQ3Njk4YjgwYjY1My5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwODAyJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDgwMlQwOTM0MzhaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1lYTE5YWY4ZGNjM2EzYTRhOGYzNzgyZTkyMDhkYTgwMDNiNGZiZDFlZDVhOTk1MDMzMWNhYzMwZTMyMTk5ODY1JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.VZgK87tZQkAESZ7JeVZ8A9XAWXUe-6ICVkhSRqtTDD0"> > Now that caching of the retrieval and parsing of the `Manifest` and `ManifestList`s are taking place, the first run in the performance test will take the same time as above but all of the subsequent runs that are averaged to get an average time for each test are much, much faster, reducing the average time from 500ms to about 1.5ms. 🤯 Good work! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org