Re: [PR] Concurrent table scans [iceberg-rust]

via GitHub Fri, 02 Aug 2024 00:09:44 -0700


sdd commented on PR #373:
URL: https://github.com/apache/iceberg-rust/pull/373#issuecomment-2264717700


   Hey @ZENOTME! Sure. If you check out my `perf-suite` branch from [my other 
PR](https://github.com/apache/iceberg-rust/pull/497), this is branched off main 
and can be used to get a baseline of `plan_files` performance without the 
changes in this PR. You'll need to run `cargo install just` if you don't have 
`just` already installed, and then `just perf-run`. This should create the 
docker setup for the performance testing environment, download some test data 
fro the NYC Taxi dataset, create an Iceberg table inside the perf testing env, 
insert the test data into the table, and then run the perf tests.
   
   The perf tests themselves consist of four scenarios that each execute 
`plan_files` in different scenarios:
   * a file plan for a scan that reads **all** rows from **all** data files in 
the current snapshot (ie no filter);
   * a file plan for a scan that reads **some** rows from **all** data files in 
the current snapshot (`passenger_count` = 1);
   * a file plan for a scan that reads **all** rows from **one** data file in 
the current snapshot (filter: `tpep_pickup_datetime` = '2024-02-01')
   * a file plan for a scan that reads **some** rows from **one** data file in 
the current snapshot (filter: `tpep_pickup_datetime` = '2024-02-01' AND 
`passenger_count` = 1);
   
   The benchmarks get ran using criterion. Here's an example of the test output 
when I ran it just now on my M1 Pro Macbook:
   
   <img width="681" alt="image" 
src="https://github.com/user-attachments/assets/3d521b80-b3ab-4421-8889-e39cc1a0ec91";>
   
   Now, if you cherry-pick the commit from this concurrent table scan branch 
into the `perf-suite` branch:
   
   ```bash
   git cherry-pick 8eef484094dbb7c55ac3181bbd552090308591c9
   ```
   
   And re-run the performance tests, you'll see something similar to the 
following:
   
   <img width="689" alt="image" 
src="https://github.com/user-attachments/assets/28e41e05-b1e2-48a9-a26c-20ca940e688a";>
   
   As you can see the times come down from around 1.5s to around **0.5s**, a 
big improvement!
   
   In fact, if you then cherry-pick the commit from my [follow-on PR with the 
object cache](https://github.com/apache/iceberg-rust/pull/512):
   
   ```bash
   git cherry-pick c696a3f9a7dacf822e6d5bc76f13d24e0b50ee31
   ```
   
   
   and re-run the performance tests again, you'll see something even more 
remarkable:
   
   <img width="578" alt="image" 
src="https://github.com/user-attachments/assets/12e91179-2ea3-4dad-9581-d7698b80b653";>
   
   Now that the retrieval and parsing of the `Manifest` and `ManifestList`s are 
taking place, the first run in the performance test will take the same time as 
above but all of the subsequent runs that are averaged to get an average time 
for each test are much, much faster, reducing the average time from 500ms to 
about 1.5ms. 🤯 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Concurrent table scans [iceberg-rust]

Reply via email to