(fory) branch main updated: perf: add fory performance optimization skill (#3463)

chaokunyang Tue, 10 Mar 2026 02:37:52 -0700

This is an automated email from the ASF dual-hosted git repository.

chaokunyang pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/fory.git



The following commit(s) were added to refs/heads/main by this push:
     new c41182c68 perf: add fory performance optimization skill (#3463)
c41182c68 is described below

commit c41182c68e041ba4e86bf5eb0b02de274a64ad4e
Author: Shawn Yang <[email protected]>
AuthorDate: Tue Mar 10 17:37:06 2026 +0800

    perf: add fory performance optimization skill (#3463)
    
    ## Why?
    
    
    
    ## What does this PR do?
    
    add fory perf optimization skill for ai agent
    
    ## Related issues
    
    #3397 #3355 #3012 #1993 #2982
    #1017
    
    ## Does this PR introduce any user-facing change?
    
    
    
    - [ ] Does this PR introduce any public API change?
    - [ ] Does this PR introduce any binary protocol compatibility change?
    
    ## Benchmark
---
 .agents/skills                                     |   1 +
 .../skills/fory-performance-optimization/SKILL.md  | 119 +++++++++++++++++
 .../agents/openai.yaml                             |  21 +++
 .../references/bottleneck-playbook.md              | 142 +++++++++++++++++++++
 .../references/language-command-matrix.md          |  89 +++++++++++++
 .../references/round-template.md                   |  83 ++++++++++++
 .../references/workflow-checklist.md               |  63 +++++++++
 7 files changed, 518 insertions(+)

diff --git a/.agents/skills b/.agents/skills
new file mode 120000
index 000000000..454b8427c
--- /dev/null
+++ b/.agents/skills
@@ -0,0 +1 @@
+../.claude/skills
\ No newline at end of file
diff --git a/.claude/skills/fory-performance-optimization/SKILL.md 
b/.claude/skills/fory-performance-optimization/SKILL.md
new file mode 100644
index 000000000..879bb39ed
--- /dev/null
+++ b/.claude/skills/fory-performance-optimization/SKILL.md
@@ -0,0 +1,119 @@
+---
+name: fory-performance-optimization
+description: Run profile-driven bottleneck optimization across Apache Fory 
implementations (Java, C++, Python/Cython, Go, Rust, Swift, C#, 
JavaScript/TypeScript, Dart, Kotlin, Scala). Use when improving 
serialize/deserialize throughput or latency, recovering regressions against a 
reference commit, diagnosing flamegraphs, fixing perf-related CI failures, or 
porting proven optimizations across languages without protocol or API 
regressions.
+---
+
+# Fory Performance Optimization
+
+## Mission
+
+Deliver measurable performance improvements in Apache Fory without protocol 
drift, correctness regressions, benchmark-shape tricks, or accidental API 
rollback.
+
+## Operating Principles
+
+- Start from data, not intuition.
+- Profile before changing hot code.
+- Change one bottleneck at a time.
+- Benchmark sequentially on the same machine state (one benchmark process at a 
time).
+- Keep only measured wins or explicitly requested architecture cleanups.
+- Revert speculative changes that do not pay off.
+- Align with reference runtimes (usually C++ first, then Rust/Java) when 
behavior and ownership models differ.
+
+## Enforce Hard Constraints
+
+- Preserve wire protocol unless explicitly requested.
+- Preserve cross-language semantics and xlang compatibility.
+- Never run two benchmarks at the same time on one host; run exactly one 
benchmark command at a time.
+- Do not optimize by changing benchmark payload definitions, field encodings, 
or benchmark methodology.
+- Do not add payload-identity or repeated-input caches that depend on 
benchmark shape.
+- Do not restore removed APIs/legacy wrappers when the user forbids it.
+- Do not preserve legacy/dead code or stale docs in optimization rounds; 
remove them when touched.
+- Keep API surface minimal: do not add new API unless required by 
protocol/correctness or explicitly requested.
+- Never add public hacky API for performance shortcuts; keep optimization 
helpers internal/private and conceptually clean.
+- Do not hide regressions behind unsafe compiler flags or benchmark-only code 
paths.
+- Keep optimization surfaces nested-safe; avoid root-only shortcuts unless 
they are architecturally valid and requested.
+
+## Execute Workflow
+
+1. Read context and constraints.
+
+- Read `tasks/perf_optimization_rounds.md` and `tasks/lessons.md`.
+- Read the relevant spec in `docs/specification/` for any path that may affect 
wire behavior.
+- Record explicit user constraints (forbidden APIs, naming, architecture, 
protocol rules).
+
+2. Define target and baseline.
+
+- Identify one primary KPI (for example `Struct Serialize ns/op` or ops/sec).
+- Benchmark current `HEAD`.
+- If a reference commit is provided, benchmark it once and persist the result 
in a file (for example `tasks/perf_baselines/<id>.md`) to avoid repeated reruns.
+
+3. Profile the hotspot.
+
+- Capture a flamegraph or sampled stacks on the exact benchmark command.
+- Quantify top costs by bucket (runtime bookkeeping, dispatch, 
allocation/copy, map/cache operations, buffer growth, metadata 
parse/validation).
+- Tie each bucket to concrete file/line ownership before proposing changes.
+
+4. Form one round hypothesis.
+
+- State one bottleneck and one expected effect.
+- Prefer structural fixes over micro-tweaks.
+- If another runtime already solved the same bottleneck, port its design shape 
first.
+
+5. Implement minimal change.
+
+- Touch the smallest surface that can validate the hypothesis.
+- Keep invariants explicit: protocol bytes, ownership, cache lifetime, 
reference semantics, nullability, schema-compatible behavior.
+
+6. Verify correctness.
+
+- Run language-local build/test/lint for the touched implementation.
+- Run cross-language checks when runtime/type/protocol behavior can affect 
xlang.
+- Confirm serialized sizes and compatibility expectations where applicable.
+
+7. Benchmark and compare.
+
+- Run targeted benchmark at least twice sequentially.
+- Use longer duration when signal is noisy.
+- Run one short full-suite sanity benchmark to catch collateral regressions.
+
+8. Decide keep or revert.
+
+- Keep only if gain is repeatable or cleanup is explicitly requested and 
accepted with measured tradeoff.
+- Revert if performance regresses or gain is within noise and complexity 
increases.
+- If a required cleanup regresses, redesign inside the new architecture 
instead of restoring banned patterns.
+
+9. Log every round.
+
+- Append one round entry to `tasks/perf_optimization_rounds.md` before 
starting the next round.
+- Include hypothesis, code change, exact commands, before/after numbers, and 
keep/revert decision.
+- Commit retained non-trivial rounds immediately.
+
+10. Re-plan on instability.
+
+- Stop and re-plan when benchmark runs conflict, machine contention is 
suspected, or profile does not match hypothesis.
+- Re-ground on current `HEAD` after reset/rebase/checkout events before making 
further changes.
+
+## Apply Decision Rules
+
+- Treat <1-2% movement as noise unless repeated under controlled runs.
+- Require explicit proof for complexity-increasing optimizations.
+- Prefer deleting dead APIs and dead state quickly after refactors.
+- Keep naming/API cleanup only if performance remains in band.
+- Never run before/after comparisons in parallel.
+
+## Use References
+
+- Use [`references/workflow-checklist.md`](references/workflow-checklist.md) 
for execution checklists and stop conditions.
+- Use 
[`references/language-command-matrix.md`](references/language-command-matrix.md)
 for per-language build/test/benchmark/profile commands.
+- Use [`references/bottleneck-playbook.md`](references/bottleneck-playbook.md) 
for hotspot-to-fix mapping.
+- Use [`references/round-template.md`](references/round-template.md) to log 
each optimization round consistently.
+
+## Produce Output
+
+When finishing an optimization task, report:
+
+- Baseline command and numbers.
+- Final command and numbers.
+- Net delta on primary KPI.
+- Correctness and compatibility verification run.
+- Kept vs reverted rounds and rationale.
diff --git a/.claude/skills/fory-performance-optimization/agents/openai.yaml 
b/.claude/skills/fory-performance-optimization/agents/openai.yaml
new file mode 100644
index 000000000..df9c84a1e
--- /dev/null
+++ b/.claude/skills/fory-performance-optimization/agents/openai.yaml
@@ -0,0 +1,21 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+interface:
+  display_name: "Fory Perf Optimizer"
+  short_description: "Profile-first Apache Fory perf optimization playbook"
+  default_prompt: "Optimize Apache Fory bottlenecks with a profile-driven, 
benchmark-verified, cross-language workflow that preserves protocol 
correctness."
diff --git 
a/.claude/skills/fory-performance-optimization/references/bottleneck-playbook.md
 
b/.claude/skills/fory-performance-optimization/references/bottleneck-playbook.md
new file mode 100644
index 000000000..a81b5c3fd
--- /dev/null
+++ 
b/.claude/skills/fory-performance-optimization/references/bottleneck-playbook.md
@@ -0,0 +1,142 @@
+# Bottleneck Playbook
+
+## 1) Dispatch And Runtime Bookkeeping
+
+Symptoms:
+
+- High samples in runtime access/exclusivity or witness dispatch bookkeeping.
+
+Actions:
+
+- Reduce repeated mutable accesses in tight loops.
+- Collapse helper layering on hot paths.
+- Move costly work from per-field/per-element paths to one-time setup.
+- Prefer concrete/local cursor mutation in critical loops.
+
+Avoid:
+
+- API splits that add extra existential/cross-protocol dispatch in hottest 
generic paths.
+
+## 2) Buffer Growth And Materialization
+
+Symptoms:
+
+- High time in allocation, copy, or final materialization to output buffers.
+
+Actions:
+
+- Grow once for max possible bytes when encoding variable-width fields.
+- Use local write cursor and commit once.
+- Keep copy boundaries explicit and minimize conversion churn.
+
+Avoid:
+
+- Rewrites that increase allocation count or add copy steps despite 
lower-level pointer usage.
+
+## 3) Varint Encode/Decode Overhead
+
+Symptoms:
+
+- Repeated size prepass plus repeated encode work for the same value.
+- Slow varint branches dominating primitive-heavy structs.
+
+Actions:
+
+- Remove value-dependent prepass when safe by reserving maximum bytes.
+- Use packed/loop-based slow paths where appropriate.
+- Keep exact writer-index commit after block write.
+
+Avoid:
+
+- Double-checking varint widths per field when one max-size reservation can 
cover the block.
+
+## 4) Type Resolver And Metadata Path
+
+Symptoms:
+
+- Heavy cost in compatible type-info lookup, parsing, or temporary wrappers.
+
+Actions:
+
+- Keep canonical type info ownership in resolver/context aligned with 
reference runtimes.
+- Cache by stable protocol keys (for example, headers), not benchmark payload 
identity.
+- Reduce redundant wrappers and duplicated metadata ownership.
+
+Avoid:
+
+- Side caches that leak abstractions to callsites (`push/pop/clear` 
bookkeeping in user-facing flow).
+
+## 5) Context Reset And Map/Array Maintenance
+
+Symptoms:
+
+- Noticeable time in context reset, map clear, array churn, or cache 
maintenance.
+
+Actions:
+
+- Use O(1) reset for reusable containers.
+- Keep data structures cache-local and simple for hot-path operations.
+- Remove dead fields/methods quickly after refactors.
+
+Avoid:
+
+- Over-engineered multi-path caches unless proven necessary and mirrored by 
reference runtimes.
+
+## 6) Compatible Schema Read/Write Flow
+
+Symptoms:
+
+- Large compatible-path overhead or regressions after cleanup.
+
+Actions:
+
+- Keep flow aligned with C++/Rust ownership and dispatch model.
+- Move expensive matching/validation to type-info parse stage when possible.
+- Keep typed scoping of pending compatible metadata to avoid nested decode 
corruption.
+
+Avoid:
+
+- Untyped global compatible slots.
+- Broad helper-shaped replacement paths that bypass established protocol flow.
+
+## 7) Cleanup-Driven Regressions
+
+Symptoms:
+
+- API/abstraction cleanup causes throughput drop.
+
+Actions:
+
+- Keep cleanup only if in benchmark noise band or user explicitly accepts 
tradeoff.
+- Redesign inside the cleaned architecture to recover performance.
+
+Avoid:
+
+- Reverting to banned legacy shapes.
+- Preserving cleanup that harms hot paths without follow-up recovery plan.
+
+## 8) Cross-Language Porting
+
+Actions:
+
+- Identify the exact structure in the reference runtime (owner, cache key, 
lifetime, loop shape).
+- Port behavior and data-flow model, not language syntax.
+- Verify xlang semantics after porting.
+
+Avoid:
+
+- Language-specific shortcuts that diverge from shared protocol/runtime 
concepts.
+
+## Keep/Revert Rubric
+
+Keep when:
+
+- Improvement is repeatable and non-trivial.
+- Correctness/lint/tests remain green.
+- Complexity increase is justified by measured gain.
+
+Revert when:
+
+- Regression is clear or gain is noise.
+- Change introduces benchmark-only behavior.
+- Change violates explicit user constraints.
diff --git 
a/.claude/skills/fory-performance-optimization/references/language-command-matrix.md
 
b/.claude/skills/fory-performance-optimization/references/language-command-matrix.md
new file mode 100644
index 000000000..6b5926f9b
--- /dev/null
+++ 
b/.claude/skills/fory-performance-optimization/references/language-command-matrix.md
@@ -0,0 +1,89 @@
+# Language Command Matrix
+
+Use this as the default verification matrix after performance changes. Run 
commands from the language directory unless noted.
+
+## Swift
+
+- Build: `swift build`
+- Tests: `swift test`
+- Lint: `swiftlint lint --config .swiftlint.yml`
+- Benchmark: `cd benchmarks/swift && swift build -c release && 
./.build/release/swift-benchmark --duration <N>`
+- Profile (macOS sample): run benchmark with long duration, then `sample <pid> 
10 1 -mayDie -file /tmp/<name>.sample.txt`
+
+## C++
+
+- Build: `bazel build //cpp/...`
+- Tests: `bazel test $(bazel query //cpp/...)`
+- Perf tests: `bazel test $(bazel query //cpp/fory/serialization/...)`
+- Profile: use repository-approved sampling tooling from `CONTRIBUTING.md` and 
`docs/cpp_debug.md`
+
+## Java
+
+- Build: `mvn -T16 package`
+- Tests: `mvn -T16 test`
+- Format/style checks as needed: `spotless:check`, `checkstyle:check`
+- Profile: JFR or async-profiler on the exact benchmark/test workload
+
+## Python/Cython
+
+- Install: `pip install -v -e .`
+- Tests (python mode): `ENABLE_FORY_CYTHON_SERIALIZATION=0 pytest -v -s .`
+- Tests (cython mode): `ENABLE_FORY_CYTHON_SERIALIZATION=1 pytest -v -s .`
+- Format/lint: `ruff format . && ruff check --fix .`
+- Profile: `py-spy`, `cProfile`, and Cython annotations as needed
+
+## Rust
+
+- Build: `cargo build`
+- Check: `cargo check`
+- Lint: `cargo clippy --all-targets --all-features -- -D warnings`
+- Tests: `cargo test --features tests`
+- Profile: flamegraph/perf tooling on benchmark or targeted test
+
+## Go
+
+- Build: `go build`
+- Tests: `go test -v ./...`
+- Format: `go fmt ./...`
+- Profile: `pprof` (`go test -bench` + cpu/mem profiles)
+
+## C#
+
+- Build: `dotnet build Fory.sln -c Release --no-restore`
+- Tests: `dotnet test Fory.sln -c Release`
+- Format check: `dotnet format Fory.sln --verify-no-changes`
+- Profile: `dotnet-trace` / `dotnet-counters` on benchmark/test runs
+
+## JavaScript/TypeScript
+
+- Install: `npm install`
+- Tests: `node ./node_modules/.bin/jest --ci --reporters=default 
--reporters=jest-junit`
+- Lint: `git ls-files -- '*.ts' | xargs -P 5 node ./node_modules/.bin/eslint`
+
+## Dart
+
+- Generate: `dart run build_runner build`
+- Tests: `dart test`
+- Analyze/fix: `dart analyze && dart fix --dry-run`
+
+## Kotlin
+
+- Build: `mvn clean package`
+- Tests: `mvn test`
+
+## Scala
+
+- Build: `sbt compile`
+- Tests: `sbt test`
+- Format: `sbt scalafmt`
+
+## Cross-Language Xlang Verification
+
+When changing xlang/runtime semantics, run relevant Java-driven xlang tests 
from `java/fory-core` with debug output enabled, for impacted languages:
+
+- `CPPXlangTest`
+- `CSharpXlangTest`
+- `RustXlangTest`
+- `GoXlangTest`
+- `PythonXlangTest`
+- `SwiftXlangTest`
diff --git 
a/.claude/skills/fory-performance-optimization/references/round-template.md 
b/.claude/skills/fory-performance-optimization/references/round-template.md
new file mode 100644
index 000000000..170834b20
--- /dev/null
+++ b/.claude/skills/fory-performance-optimization/references/round-template.md
@@ -0,0 +1,83 @@
+# Performance Round Template
+
+Use this template for every optimization round in 
`tasks/perf_optimization_rounds.md`.
+
+````markdown
+## <Date> Round <N> - <Short Title>
+
+- Goal:
+  - <Single bottleneck target and KPI>
+
+- Hypothesis:
+  - <Why this change should improve measured cost>
+
+- Code change:
+  - <Files and structural changes>
+  - <Ownership/cache/dispatch impact>
+
+- Verification commands:
+  ```bash
+  <build/test/lint commands>
+  <targeted benchmark command>
+  <repeat targeted benchmark command>
+  <optional short full-suite sanity command>
+  ```
+````
+
+- Before:
+  - <KPI metrics and command context>
+
+- After:
+  - <KPI metrics and command context>
+
+- Result:
+  - <Delta with interpretation: gain/loss/noise>
+  - <Correctness and compatibility status>
+
+- Decision:
+  - <Kept/Reverted>
+  - <Reason tied to data and constraints>
+
+````
+
+## Baseline Snapshot Template
+
+```markdown
+## Baseline <commit-sha>
+
+- Command:
+  ```bash
+  <exact command>
+````
+
+- Environment:
+  - <machine, OS, build mode, duration>
+- Metrics:
+  - <ops/sec or ns/op>
+- Notes:
+  - <why this baseline is the comparison anchor>
+
+````
+
+## Final Summary Template
+
+```markdown
+## Final Summary
+
+- Primary KPI:
+  - before: <value>
+  - after: <value>
+  - delta: <value>
+
+- Retained rounds:
+  - <list>
+
+- Reverted rounds:
+  - <list>
+
+- Correctness checks:
+  - <tests/lint/xlang>
+
+- Remaining risk:
+  - <open bottlenecks or noise caveats>
+````
diff --git 
a/.claude/skills/fory-performance-optimization/references/workflow-checklist.md 
b/.claude/skills/fory-performance-optimization/references/workflow-checklist.md
new file mode 100644
index 000000000..1a129411b
--- /dev/null
+++ 
b/.claude/skills/fory-performance-optimization/references/workflow-checklist.md
@@ -0,0 +1,63 @@
+# Workflow Checklist
+
+## 1) Intake
+
+- Capture exact user objective and KPI.
+- Capture explicit bans and constraints.
+- Capture reference commit(s), if provided.
+- Confirm target implementation language and benchmark command.
+
+## 2) Context Loading
+
+- Read `tasks/perf_optimization_rounds.md` for prior attempts and measured 
outcomes.
+- Read `tasks/lessons.md` for repeated failure patterns and guardrails.
+- Read relevant spec docs under `docs/specification/` before touching 
protocol-adjacent code.
+
+## 3) Baseline
+
+- Benchmark current `HEAD` first.
+- Benchmark the requested reference commit once and persist numbers in a 
baseline file.
+- Use identical command, duration, and machine state for comparisons.
+- Run only one benchmark process at a time; never overlap benchmark commands.
+
+## 4) Profiling
+
+- Profile the exact bottleneck benchmark (not a proxy test).
+- Attribute top cost buckets to concrete code paths.
+- Write one hypothesis tied to one measurable bottleneck.
+
+## 5) Round Execution
+
+- Implement one focused change.
+- Keep protocol bytes and semantics unchanged unless explicitly requested.
+- Keep API surface minimal and internal-first; avoid adding new public APIs 
unless explicitly required.
+- Remove touched legacy/dead code and stale docs instead of preserving 
compatibility scaffolding in perf rounds.
+- Run local build/test/lint for the touched language.
+- Run targeted benchmark sequentially (at least 2 runs).
+- Run one short full-suite sanity benchmark.
+- Keep or revert based on measured data.
+- Append full round entry to `tasks/perf_optimization_rounds.md`.
+
+## 6) Finalization
+
+- Summarize before/after with exact commands.
+- State kept/reverted rounds and rationale.
+- List residual risks and follow-up rounds if target is not met.
+
+## Stop And Re-Plan Triggers
+
+- Results are non-deterministic across repeated sequential runs.
+- Profile findings do not match expected bottleneck after a code change.
+- Proposed fix requires violating user constraints or protocol semantics.
+- Workspace state changed unexpectedly (reset/rebase/checkout); re-check 
`HEAD`.
+- Improvement is within noise but complexity increased.
+
+## Anti-Patterns To Reject
+
+- Benchmark-only hacks (payload identity cache, intern-bytes tricks, hardcoded 
fixture paths).
+- Protocol changes to manufacture benchmark wins.
+- Reintroducing removed API surface against explicit direction.
+- Adding new public "performance" APIs that expose benchmark-driven shortcuts.
+- Preserving dead/legacy code/docs after optimization refactors.
+- Parallel before/after benchmarking on one machine.
+- Keeping speculative complexity without repeatable gains.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(fory) branch main updated: perf: add fory performance optimization skill (#3463)

Reply via email to