Benchmark Deep Dive: What the Numbers Actually Say
Benchmark Deep Dive: What the Numbers Actually Say
Performance claims are easy to overstate. This article keeps the interpretation narrow and tied to the benchmark source and log that exist in the repository.
The benchmark suite is in:
benches/comprehensive_benchmarks.rs
The raw benchmark log reviewed here is:
benches/run.log
The goal is not to claim that memscope-rs is always fast. The goal is to identify which operations are currently cheap, which operations scale linearly, and where the log already shows regressions.
1. Benchmark Coverage
The benchmark source includes groups such as:
- tracker creation;
- single variable tracking;
- multiple variable tracking;
- analysis;
- stats;
- backend alloc/dealloc/realloc/move;
- type classification;
- concurrent tracking;
- parallel tracking;
- shared tracker concurrent tracking;
- allocation patterns;
- analysis operations;
- tracking stats;
- IO operations.
This is broad coverage, but not every feature has an isolated benchmark. For example, there is no standalone benchmark specifically isolating StackOwner grouping or async attribution overhead.
2. Tracker Creation
The log reports:
tracker_creation time: [867.65 ns 873.02 ns 879.91 ns]
change: [+551.47% +559.50% +568.23%]
Performance has regressed.
Interpretation:
- absolute time is still sub-microsecond;
- relative performance regressed significantly against the previous baseline;
- this should not be described as an unqualified win.
3. Single Variable Tracking
Representative results:
| Benchmark | Approximate Median |
|---|---|
track_single/vec/64 | 653.02 ns |
track_single/vec/256 | 662.81 ns |
track_single/vec/1024 | 666.22 ns |
track_single/vec/4096 | 725.49 ns |
track_single/vec/65536 | 1.1087 µs |
track_single/vec/1048576 | 4.9307 µs |
The small-object tracking path is sub-microsecond, but the log marks several of these as regressions relative to earlier runs.
Interpretation:
- single tracking overhead is practically usable for profiling;
- overhead is not zero;
- recent changes appear to have increased latency for smaller payloads.
4. Multiple Variable Tracking
Representative results:
| Variables | Approximate Median |
|---|---|
| 10 | 6.5364 µs |
| 50 | 33.240 µs |
| 100 | 66.963 µs |
| 1000 | 669.67 µs |
| 5000 | 3.3126 ms |
| 10000 | 6.5949 ms |
This is close to linear scaling.
Interpretation:
- batch tracking cost grows predictably;
- the per-item cost is roughly stable;
- the log marks these paths as regressions against the prior baseline.
5. Analysis Cost
Representative analysis results:
| Records | Approximate Median |
|---|---|
| 10 | 5.2939 µs |
| 50 | 16.028 µs |
| 100 | 29.759 µs |
| 1000 | 285.64 µs |
| 5000 | 1.5684 ms |
| 10000 | 4.1880 ms |
| 50000 | 33.887 ms |
Interpretation:
- small and medium analysis is fast enough for interactive use;
- large analysis becomes millisecond-scale;
- 50,000 records is tens of milliseconds;
- several larger analysis cases show significant relative regressions.
6. Backend Event Construction
Representative backend allocation times:
| Backend | Alloc Median |
|---|---|
| Core | 23.022 ns |
| Lockfree | 39.265 ns |
| Async | 23.008 ns |
| Unified | 39.512 ns |
Representative deallocation times:
| Backend | Dealloc Median |
|---|---|
| Core | 22.859 ns |
| Lockfree | 38.586 ns |
| Async | 22.632 ns |
| Unified | 39.147 ns |
Interpretation:
- backend event construction is nanosecond-scale;
- Core and Async are similar in this benchmark;
- Lockfree and Unified are slower, roughly in the high-30ns range;
- this benchmark measures event construction, not full end-to-end application overhead.
7. Concurrent Tracking
Representative concurrent tracking results:
| Threads | Approximate Median |
|---|---|
| 1 | 19.174 µs |
| 2 | 40.599 µs |
| 4 | 55.303 µs |
| 8 | 134.74 µs |
| 16 | 372.96 µs |
| 32 | 961.58 µs |
| 64 | 1.8646 ms |
| 128 | 4.6714 ms |
Interpretation:
- concurrency works;
- scaling is not linear;
- thread scheduling and shared state costs are visible;
- 48-thread results show regression, while some larger thread-count cases show improvement against the previous baseline.
8. Shared Tracker Concurrent Tracking
Representative shared tracker results:
| Threads | Approximate Median |
|---|---|
| 1 | 98.300 µs |
| 2 | 231.96 µs |
| 4 | 363.82 µs |
| 8 | 924.19 µs |
| 16 | 1.8448 ms |
| 32 | 3.5680 ms |
| 64 | 7.0581 ms |
Interpretation:
- sharing one tracker across many threads is a stress scenario;
- costs grow clearly with thread count;
- the log shows improvements against previous runs, but absolute shared-state cost remains visible.
9. Allocation Patterns
Representative results:
| Pattern | Approximate Median | Log Status |
|---|---|---|
| many small allocations | 809.84 µs | regressed |
| few large allocations | 96.308 µs | improved |
| mixed size allocations | 111.44 µs | regressed |
| burst allocations | 789.08 µs | no significant change |
Interpretation:
- many small allocations remain expensive compared with few large allocations;
- allocation pattern matters;
- performance should be discussed by workload shape, not one global number.
10. Tracking Stats
Some statistics operations are extremely cheap:
| Operation | Approximate Median |
|---|---|
stats_record_attempt | 1.8200 ns |
stats_record_success | 1.8211 ns |
stats_record_miss | 3.2607 ns |
stats_get_completeness | 548.58 ps |
stats_get_detailed_stats | 1.6463 ns |
These are tiny operations, but they should not be confused with full tracking or analysis cost.
11. What the Benchmark Does Not Prove
The benchmark log does not prove:
- production overhead under all workloads;
- async attribution overhead in isolation;
StackOwnergrouping cost in isolation;- dashboard rendering cost under large reports;
- memory overhead under long-running services;
- correctness of relation inference.
Benchmarks measure performance of specific paths, not the whole tool in every environment.
12. Honest Summary
The accurate performance story is:
- core event construction is nanosecond-scale;
- explicit variable tracking is usually sub-microsecond for small values;
- batch tracking scales roughly linearly;
- analysis becomes millisecond-scale for large record counts;
- concurrency is supported but not free;
- shared tracker scenarios show real contention cost;
- benchmark logs include both improvements and regressions.
The most honest phrasing is:
memscope-rshas practical profiling overhead for many measured paths, but it is not zero-cost and the benchmark log shows active performance evolution.