This is an automated email from the ASF dual-hosted git repository. xudong963 pushed a commit to branch site/limit-pruning in repository https://gitbox.apache.org/repos/asf/datafusion-site.git
commit fac1645538a6ef7914a44a1e19380a850ca32f39 Author: xudong.w <[email protected]> AuthorDate: Fri Mar 20 22:42:41 2026 +0800 address comments --- content/blog/2026-03-10-limit-pruning.md | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/content/blog/2026-03-10-limit-pruning.md b/content/blog/2026-03-10-limit-pruning.md index 10f0db0..3770a45 100644 --- a/content/blog/2026-03-10-limit-pruning.md +++ b/content/blog/2026-03-10-limit-pruning.md @@ -53,6 +53,16 @@ Reading data efficiently means touching as little data as possible. The fastest In this post, we describe a new optimization called **limit pruning** that makes this pipeline aware of SQL `LIMIT` clauses. By identifying row groups where *every* row is guaranteed to match the predicate, DataFusion can satisfy a `LIMIT` query without ever touching partially matching row groups — eliminating wasted I/O entirely. +For example, given a query like: + +```sql +SELECT * FROM tracking_data +WHERE species LIKE 'Alpine%' AND s >= 50 +LIMIT 3 +``` + +If the pruning pipeline already knows that certain row groups fully satisfy the `WHERE` clause, those groups alone may contain enough rows to fill the `LIMIT` — making it unnecessary to scan anything else. + This work was inspired by the "Pruning for LIMIT Queries" section of Snowflake's paper [*Pruning in Snowflake: Working Smarter, Not Harder*](https://arxiv.org/pdf/2504.11540). ## DataFusion's Pruning Pipeline @@ -103,7 +113,7 @@ Even when fully matched row groups alone contain enough rows to satisfy the `LIM <figure> <img src="/blog/images/limit-pruning/wasted-io.svg" width="80%" alt="Traditional pruning decodes partially matching groups with no LIMIT awareness"/> -<figcaption>Figure 3: Without limit awareness, partially matching groups are scanned and filtered even when fully matched groups already have enough rows.</figcaption> +<figcaption>Figure 3: Without limit awareness, partially matching groups are scanned and filtered even when fully matched groups already have enough rows. The left section shows 5 fully matched rows (enough to satisfy <code>LIMIT 5</code>), while the right section with the dashed red border represents a partially matching group that is still decoded — wasting CPU and I/O on rows that may not match at all.</figcaption> </figure> If five fully matched rows in a fully matched group already satisfy `LIMIT 5`, why bother decoding groups where we're not even sure any rows qualify? @@ -272,12 +282,12 @@ DataFusion exposes limit pruning activity through query metrics. When running a ``` row_groups_pruned_statistics=4 total → 3 matched -> 1 fully matched -limit_pruned_row_groups=2 total → 0 matched +limit_pruned_row_groups=3 total → 1 matched ``` This tells us: - 4 row groups were evaluated, 3 survived statistics pruning, 1 was identified as fully matching -- 2 additional row groups were pruned by the limit optimization +- Of the 3 row groups that entered limit pruning, only 1 survived — 2 were pruned by the limit optimization ## Future Directions --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
