Benchmark Deep Dive: What the Numbers Actually Say

Performance claims are easy to overstate. This article keeps the interpretation narrow and tied to the benchmark source and log that exist in the repository.

The benchmark suite is in:

benches/comprehensive_benchmarks.rs

The raw benchmark log reviewed here is:

benches/run.log

The goal is not to claim that memscope-rs is always fast. The goal is to identify which operations are currently cheap, which operations scale linearly, and where the log already shows regressions.


1. Benchmark Coverage

The benchmark source includes groups such as:

  • tracker creation;
  • single variable tracking;
  • multiple variable tracking;
  • analysis;
  • stats;
  • backend alloc/dealloc/realloc/move;
  • type classification;
  • concurrent tracking;
  • parallel tracking;
  • shared tracker concurrent tracking;
  • allocation patterns;
  • analysis operations;
  • tracking stats;
  • IO operations.

This is broad coverage, but not every feature has an isolated benchmark. For example, there is no standalone benchmark specifically isolating StackOwner grouping or async attribution overhead.


2. Tracker Creation

The log reports:

tracker_creation time: [867.65 ns 873.02 ns 879.91 ns]
change: [+551.47% +559.50% +568.23%]
Performance has regressed.

Interpretation:

  • absolute time is still sub-microsecond;
  • relative performance regressed significantly against the previous baseline;
  • this should not be described as an unqualified win.

3. Single Variable Tracking

Representative results:

BenchmarkApproximate Median
track_single/vec/64653.02 ns
track_single/vec/256662.81 ns
track_single/vec/1024666.22 ns
track_single/vec/4096725.49 ns
track_single/vec/655361.1087 µs
track_single/vec/10485764.9307 µs

The small-object tracking path is sub-microsecond, but the log marks several of these as regressions relative to earlier runs.

Interpretation:

  • single tracking overhead is practically usable for profiling;
  • overhead is not zero;
  • recent changes appear to have increased latency for smaller payloads.

4. Multiple Variable Tracking

Representative results:

VariablesApproximate Median
106.5364 µs
5033.240 µs
10066.963 µs
1000669.67 µs
50003.3126 ms
100006.5949 ms

This is close to linear scaling.

Interpretation:

  • batch tracking cost grows predictably;
  • the per-item cost is roughly stable;
  • the log marks these paths as regressions against the prior baseline.

5. Analysis Cost

Representative analysis results:

RecordsApproximate Median
105.2939 µs
5016.028 µs
10029.759 µs
1000285.64 µs
50001.5684 ms
100004.1880 ms
5000033.887 ms

Interpretation:

  • small and medium analysis is fast enough for interactive use;
  • large analysis becomes millisecond-scale;
  • 50,000 records is tens of milliseconds;
  • several larger analysis cases show significant relative regressions.

6. Backend Event Construction

Representative backend allocation times:

BackendAlloc Median
Core23.022 ns
Lockfree39.265 ns
Async23.008 ns
Unified39.512 ns

Representative deallocation times:

BackendDealloc Median
Core22.859 ns
Lockfree38.586 ns
Async22.632 ns
Unified39.147 ns

Interpretation:

  • backend event construction is nanosecond-scale;
  • Core and Async are similar in this benchmark;
  • Lockfree and Unified are slower, roughly in the high-30ns range;
  • this benchmark measures event construction, not full end-to-end application overhead.

7. Concurrent Tracking

Representative concurrent tracking results:

ThreadsApproximate Median
119.174 µs
240.599 µs
455.303 µs
8134.74 µs
16372.96 µs
32961.58 µs
641.8646 ms
1284.6714 ms

Interpretation:

  • concurrency works;
  • scaling is not linear;
  • thread scheduling and shared state costs are visible;
  • 48-thread results show regression, while some larger thread-count cases show improvement against the previous baseline.

8. Shared Tracker Concurrent Tracking

Representative shared tracker results:

ThreadsApproximate Median
198.300 µs
2231.96 µs
4363.82 µs
8924.19 µs
161.8448 ms
323.5680 ms
647.0581 ms

Interpretation:

  • sharing one tracker across many threads is a stress scenario;
  • costs grow clearly with thread count;
  • the log shows improvements against previous runs, but absolute shared-state cost remains visible.

9. Allocation Patterns

Representative results:

PatternApproximate MedianLog Status
many small allocations809.84 µsregressed
few large allocations96.308 µsimproved
mixed size allocations111.44 µsregressed
burst allocations789.08 µsno significant change

Interpretation:

  • many small allocations remain expensive compared with few large allocations;
  • allocation pattern matters;
  • performance should be discussed by workload shape, not one global number.

10. Tracking Stats

Some statistics operations are extremely cheap:

OperationApproximate Median
stats_record_attempt1.8200 ns
stats_record_success1.8211 ns
stats_record_miss3.2607 ns
stats_get_completeness548.58 ps
stats_get_detailed_stats1.6463 ns

These are tiny operations, but they should not be confused with full tracking or analysis cost.


11. What the Benchmark Does Not Prove

The benchmark log does not prove:

  • production overhead under all workloads;
  • async attribution overhead in isolation;
  • StackOwner grouping cost in isolation;
  • dashboard rendering cost under large reports;
  • memory overhead under long-running services;
  • correctness of relation inference.

Benchmarks measure performance of specific paths, not the whole tool in every environment.


12. Honest Summary

The accurate performance story is:

  • core event construction is nanosecond-scale;
  • explicit variable tracking is usually sub-microsecond for small values;
  • batch tracking scales roughly linearly;
  • analysis becomes millisecond-scale for large record counts;
  • concurrency is supported but not free;
  • shared tracker scenarios show real contention cost;
  • benchmark logs include both improvements and regressions.

The most honest phrasing is:

memscope-rs has practical profiling overhead for many measured paths, but it is not zero-cost and the benchmark log shows active performance evolution.