name: cpp-performance description: "Use when optimizing C++ runtime performance, hot paths, allocations, cache efficiency, vectorization, branch reduction, parallelization, data-oriented design, and concurrency overhead."
C++ Performance Skill
Purpose
Optimize C++ code for runtime performance, memory efficiency, and predictable behavior without weakening correctness.
When To Use
Use for hot-path optimization, allocator pressure reduction, cache efficiency, vectorization opportunities, branch reduction, routine parallelization, concurrency overhead analysis, and data-oriented redesign.
Priorities
- Preserve correctness.
- Measure before optimizing.
- Optimize hot paths before cold code.
- Prefer algorithmic and data-layout wins before micro-optimizations.
- Prefer the smallest change that is likely to produce measurable impact.
Workflow
- Identify the hot path from profiling data or benchmarks.
- Confirm the performance goal: latency, throughput, memory footprint, or tail behavior.
- Inspect algorithmic complexity before tuning low-level details.
- Check allocations, copies, temporary objects, and ownership patterns.
- Check data layout, cache locality, branch behavior, access patterns, and whether a data-oriented representation would better match the workload.
- Check whether unpredictable branches can be removed, hoisted, turned into table lookups, masks, conditional moves, or other branchless forms without harming correctness.
- Check whether the routine can be parallelized safely across data, tasks, or pipeline stages, and estimate whether the workload size justifies threading overhead.
- Check whether abstractions prevent inlining, vectorization, or efficient code generation.
- Check synchronization, contention, false sharing, and scheduling overhead in concurrent code.
- Propose the smallest high-confidence change first.
- Re-measure under the same conditions and report absolute and relative impact.
Review Checklist
- Is this code on a measured hot path?
- Is the current algorithm appropriate for the workload?
- Are unnecessary allocations, copies, or temporaries present?
- Is data layout aligned with access patterns?
- Would a data-oriented layout improve locality, SIMD use, or traversal efficiency?
- Are containers appropriate for the usage pattern?
- Are unpredictable branches hurting throughput on the hot path?
- Can branch-heavy logic be rewritten in a branchless form without making the code unsafe or opaque?
- Can the routine be parallelized across independent work items?
- Is the problem size large enough to justify parallel overhead?
- Is the compiler likely blocked from inlining or vectorizing?
- Are branches predictable on the common path?
- Is synchronization heavier than necessary?
- Could false sharing or cache contention be present?
- Is the benchmark representative of production behavior?
Constraints
- Do not claim performance improvements without measurement.
- Do not trade clarity for low-level tricks unless the code is proven hot.
- Do not force data-oriented rewrites unless the measured bottleneck is really driven by layout or traversal costs.
- Do not force branchless code where branches are already predictable or the transformed code becomes less maintainable without measurable gain.
- Do not parallelize routines unless independence, granularity, and memory behavior make the speedup plausible.
- Do not optimize microbenchmarks while ignoring end-to-end impact.
- State tradeoffs in readability, portability, compile time, and maintenance cost.
- Call out numerical or semantic risks if optimizations may change behavior.
Output
Provide:
- the likely bottleneck
- the reason it is expensive
- the proposed optimization
- why it is expected to help
- the main risks or tradeoffs
- the validation plan