Re: [PR] Concurrent table scans [iceberg-rust]


liurenjie1024 commented on PR #373:
URL: https://github.com/apache/iceberg-rust/pull/373#issuecomment-2264974929


   > Hey @ZENOTME! Sure. If you check out my `perf-suite` branch from [my other 
PR](https://github.com/apache/iceberg-rust/pull/497), this is branched off main 
and can be used to get a baseline of `plan_files` performance without the 
changes in this PR. You'll need to run `cargo install just` if you don't have 
`just` already installed, and then `just perf-run`. This should create the 
docker setup for the performance testing environment, download some test data 
fro the NYC Taxi dataset, create an Iceberg table inside the perf testing env, 
insert the test data into the table, and then run the perf tests.
   > 
   > The perf tests themselves consist of four scenarios that each execute 
`plan_files` in different scenarios:
   > 
   > * a file plan for a scan that reads **all** rows from **all** data files 
in the current snapshot (ie no filter);
   > * a file plan for a scan that reads **some** rows from **all** data files 
in the current snapshot (`passenger_count` = 1);
   > * a file plan for a scan that reads **all** rows from **one** data file in 
the current snapshot (filter: `tpep_pickup_datetime` = '2024-02-01')
   > * a file plan for a scan that reads **some** rows from **one** data file 
in the current snapshot (filter: `tpep_pickup_datetime` = '2024-02-01' AND 
`passenger_count` = 1);
   > 
   > The benchmarks get ran using criterion. Here's an example of the test 
output when I ran it just now on my M1 Pro Macbook:
   > 
   > <img alt="image" width="681" 
src="https://private-user-images.githubusercontent.com/323439/354498323-3d521b80-b3ab-4421-8889-e39cc1a0ec91.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjI1OTE1NzgsIm5iZiI6MTcyMjU5MTI3OCwicGF0aCI6Ii8zMjM0MzkvMzU0NDk4MzIzLTNkNTIxYjgwLWIzYWItNDQyMS04ODg5LWUzOWNjMWEwZWM5MS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwODAyJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDgwMlQwOTM0MzhaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT05ZjY1YTJlMWY5MmM4MjYxMDM1MDc2YmI1M2E1NzcxM2U2MWYzMzQ2ZDA0MWE2NjFmYmY4ZGNhNDM2NzRjMGM0JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.VNpmn5XuoJI--VVxypJOyKHZh1JR3kqkjhBQhuRhZps";>
   > Now, if you cherry-pick the commit from this concurrent table scan branch 
into the `perf-suite` branch:
   > 
   > ```shell
   > git cherry-pick 8eef484094dbb7c55ac3181bbd552090308591c9
   > ```
   > 
   > And re-run the performance tests, you'll see something similar to the 
following:
   > 
   > <img alt="image" width="689" 
src="https://private-user-images.githubusercontent.com/323439/354499847-28e41e05-b1e2-48a9-a26c-20ca940e688a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjI1OTE1NzgsIm5iZiI6MTcyMjU5MTI3OCwicGF0aCI6Ii8zMjM0MzkvMzU0NDk5ODQ3LTI4ZTQxZTA1LWIxZTItNDhhOS1hMjZjLTIwY2E5NDBlNjg4YS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwODAyJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDgwMlQwOTM0MzhaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT00NDU4Zjc5MmVlYzVhNzJkZmRkM2Y3ZmY3ZWJjYmJmNGY5ODI0OGE1ZWM4ZmRhNmJkZGNiMDZmYzk4YzU2ZDAyJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.cHYHn1aK6dHlbI9_uWiRoqcNXp8nBPKqxLcsVHyeQhI";>
   > As you can see the times come down from around 1.5s to around **0.5s**, a 
big improvement!
   > 
   > In fact, if you then cherry-pick the commit from my [follow-on PR with the 
object cache](https://github.com/apache/iceberg-rust/pull/512):
   > 
   > ```shell
   > git cherry-pick c696a3f9a7dacf822e6d5bc76f13d24e0b50ee31
   > ```
   > 
   > and re-run the performance tests again, you'll see something even more 
remarkable:
   > 
   > <img alt="image" width="578" 
src="https://private-user-images.githubusercontent.com/323439/354501480-12e91179-2ea3-4dad-9581-d7698b80b653.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjI1OTE1NzgsIm5iZiI6MTcyMjU5MTI3OCwicGF0aCI6Ii8zMjM0MzkvMzU0NTAxNDgwLTEyZTkxMTc5LTJlYTMtNGRhZC05NTgxLWQ3Njk4YjgwYjY1My5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwODAyJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDgwMlQwOTM0MzhaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1lYTE5YWY4ZGNjM2EzYTRhOGYzNzgyZTkyMDhkYTgwMDNiNGZiZDFlZDVhOTk1MDMzMWNhYzMwZTMyMTk5ODY1JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.VZgK87tZQkAESZ7JeVZ8A9XAWXUe-6ICVkhSRqtTDD0";>
   > Now that caching of the retrieval and parsing of the `Manifest` and 
`ManifestList`s are taking place, the first run in the performance test will 
take the same time as above but all of the subsequent runs that are averaged to 
get an average time for each test are much, much faster, reducing the average 
time from 500ms to about 1.5ms. 🤯
   
   Good work!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Concurrent table scans [iceberg-rust]

Reply via email to