eric-maynard opened a new pull request, #21:
URL: https://github.com/apache/polaris-tools/pull/21
This implements a new scenario, `WeightedWorkloadOnTreeDataset`, that
supports the configuration of multiple **distributions** over which to weight
reads & writes against the catalog.
Compared with `ReadUpdateTreeDataset`, this allows us to understand how
performance changes when reads or writes frequently hit the same tables.
### Sampling
The distributions are defined in the config file like so:
```
# Distributions for readers
# Each distribution will have `count` threads assigned to it
# mean / variance describe the properties of the normal distribution
# Readers will read a random table in the table space based on sampling
# Default: [{ count = 8, mean = 0.3, variance = 0.0278 }]
readers = [
{ count = 8, mean = 0.3, variance = 0.0278 }
]
```
`count` is simply the number of threads which will sample from the
distribution, while `mean` and `variance` describe the Gaussian distribution to
sample from. These values are generally expected to fall between 0 and 1.0 and
when they don't the distribution will be repeatedly **resampled**.
For an extreme example, refer to the following:
<img width="400" alt="Screenshot 2025-04-30 at 1 27 43 AM"
src="https://github.com/user-attachments/assets/d77e98f1-7a94-463d-be82-0c47bbda92a1"
/>
In this case, about 50% of samples should fall below 0.0 and therefore be
resampled.
Once a value between 0 and 1 is obtained, this is mapped to a table, where
1.0 is the highest table (e.g. T_2048) in the tree dataset and 0.0 is T_0.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]