profiling by facet-rs
Profile code performance using callgrind and valgrind with nextest integration for analyzing instruction counts, cache behavior, and identifying bottlenecks
Coding
2.3K Stars
99 Forks
Updated Jan 12, 2026, 01:07 AM
Why Use This
This skill provides specialized capabilities for facet-rs's codebase.
Use Cases
- Developing new features in the facet-rs repository
- Refactoring existing code to follow facet-rs standards
- Understanding and working with facet-rs's codebase structure
Skill Snapshot
Auto scan of skill assets. Informational only.
Valid SKILL.md
Checks against SKILL.md specification
Source & Community
Skill Stats
SKILL.md 481 Lines
Total Files 1
Total Size 0 B
License NOASSERTION
---
name: profiling
description: Profile code performance using callgrind and valgrind with nextest integration for analyzing instruction counts, cache behavior, and identifying bottlenecks
---
# Profiling with Valgrind, Callgrind, and Nextest
The facet project has pre-configured valgrind integration for debugging crashes, memory leaks, and performance profiling.
## Quick Usage
```bash
# Run test under valgrind (memory errors + leaks)
cargo nextest run --profile valgrind -p PACKAGE TEST_FILTER
# Run test under callgrind (profiling)
valgrind --tool=callgrind --callgrind-out-file=callgrind.out \
cargo nextest run --no-fail-fast -p PACKAGE TEST_FILTER
# Analyze callgrind output
callgrind_annotate callgrind.out
# or with GUI
kcachegrind callgrind.out # Linux
qcachegrind callgrind.out # macOS
```
## Nextest Valgrind Profile
The project has a pre-configured valgrind profile in `.config/nextest.toml`:
### Configuration
```toml
[scripts.wrapper.valgrind]
# Leak checking configuration
command = 'valgrind --leak-check=full --show-leak-kinds=all --errors-for-leak-kinds=definite,indirect --error-exitcode=1'
[profile.valgrind]
# Apply to all tests on Linux
platform = 'cfg(target_os = "linux")'
filter = 'all()'
run-wrapper = 'valgrind'
```
**What it does:**
- `--leak-check=full` - Show details for each leak
- `--show-leak-kinds=all` - Show all leak types for diagnostics
- `--errors-for-leak-kinds=definite,indirect` - Only fail on real leaks (not "still reachable")
- `--error-exitcode=1` - Exit with code 1 if errors found
### Usage
```bash
# Run specific test
cargo nextest run --profile valgrind -p facet-format-json test_simple_struct
# Run all tests in a file
cargo nextest run --profile valgrind -p facet-format-json --test jit_deserialize
# Run with filter
cargo nextest run --profile valgrind -p facet-json booleans
```
**Benefits:**
- ✅ Automatic configuration - no manual valgrind commands
- ✅ Consistent flags across team
- ✅ Integrated with nextest filtering
- ✅ Clean, formatted output
## Profiling with Callgrind
Callgrind is a valgrind tool for profiling instruction counts and function call graphs.
### Basic Profiling
```bash
# Profile a specific test
valgrind --tool=callgrind \
--callgrind-out-file=callgrind.out \
cargo nextest run --no-fail-fast -p PACKAGE TEST_NAME
# Analyze output
callgrind_annotate callgrind.out
```
### Advanced Options
```bash
# Collect cache simulation data (slower but more detailed)
valgrind --tool=callgrind \
--cache-sim=yes \
--branch-sim=yes \
--callgrind-out-file=callgrind.out \
cargo nextest run --no-fail-fast -p PACKAGE TEST_NAME
# Focus on specific function
valgrind --tool=callgrind \
--toggle-collect=main \
--callgrind-out-file=callgrind.out \
cargo nextest run --no-fail-fast -p PACKAGE TEST_NAME
# Compress output (can get large)
valgrind --tool=callgrind \
--compress-strings=yes \
--compress-pos=yes \
--callgrind-out-file=callgrind.out.gz \
cargo nextest run --no-fail-fast -p PACKAGE TEST_NAME
```
### Analyzing Callgrind Output
#### Command Line (callgrind_annotate)
```bash
# Full report
callgrind_annotate callgrind.out
# Focus on specific functions
callgrind_annotate --include='facet::' callgrind.out
# Show only top functions
callgrind_annotate --auto=yes --threshold=1 callgrind.out
# Compare two runs
callgrind_annotate --diff callgrind.old.out callgrind.new.out
```
**Reading the output:**
```
Ir # Instruction reads (total)
I1mr # L1 instruction cache misses
ILmr # Last-level instruction cache misses
Dr # Data reads
Dw # Data writes
D1mr, D1mw # L1 data cache read/write misses
DLmr, DLmw # Last-level data cache read/write misses
--------------------------------------------------------------------------------
Ir file:function
--------------------------------------------------------------------------------
1,234,567 (45%) facet_format_json::deserialize
987,654 (35%) facet_format::parse_value
...
```
#### GUI (KCachegrind/QCachegrind)
Install:
```bash
# Linux
sudo apt install kcachegrind
# macOS
brew install qcachegrind
# Windows (WSL)
sudo apt install kcachegrind
```
Launch:
```bash
kcachegrind callgrind.out # Linux
qcachegrind callgrind.out # macOS
```
**GUI features:**
- Call graph visualization
- Flamegraph-like views
- Source code annotation (if debug symbols available)
- Caller/callee relationships
- Multiple metrics (instructions, cache misses, branches)
## Profiling Benchmarks
The generated benchmark tests (from `benchmarks.json`) can be profiled:
### 1. As Tests (Recommended for Callgrind)
```bash
# Profile a benchmark test under callgrind
valgrind --tool=callgrind \
--callgrind-out-file=callgrind_simple_struct.out \
cargo nextest run --profile valgrind -p facet-json test_simple_struct
# Analyze
callgrind_annotate callgrind_simple_struct.out
```
**Why use tests:**
- Single iteration = cleaner callgrind output
- No benchmark harness overhead
- Easier to focus on hot path
- Faster to run
### 2. As Benchmarks (For Realistic Instruction Counts)
The benchmark harness (gungraun) already uses valgrind internally:
```bash
# Run gungraun benchmark (uses callgrind automatically)
cargo bench --bench unified_benchmarks_gungraun --features jit simple_struct
# Check output in bench-reports/gungraun-*.txt
```
**gungraun automatically collects:**
- Instructions executed
- Estimated cycles
- L1/LL cache hits
- RAM hits
- Total read/write operations
This data appears in `bench-reports/perf/RESULTS.md`.
## Common Profiling Workflows
### Debug a Crash
```bash
# 1. Run under valgrind to find memory error
cargo nextest run --profile valgrind -p PACKAGE TEST_NAME
# 2. Read valgrind output for exact error location
# Example: "Invalid read of size 8 at 0x123456"
# 3. Fix the bug
# 4. Verify fix
cargo nextest run -p PACKAGE TEST_NAME
```
### Find Performance Bottleneck
```bash
# 1. Profile with callgrind
valgrind --tool=callgrind \
--callgrind-out-file=profile.out \
cargo nextest run --no-fail-fast -p facet-json test_booleans
# 2. Analyze
callgrind_annotate --auto=yes profile.out | head -30
# 3. Identify hot functions (high instruction counts)
# 4. Optimize hot functions
# 5. Re-profile and compare
valgrind --tool=callgrind \
--callgrind-out-file=profile_after.out \
cargo nextest run --no-fail-fast -p facet-json test_booleans
callgrind_annotate --diff profile.out profile_after.out
```
### Optimize Tier-2 JIT
```bash
# 1. Check RESULTS.md for slow benchmarks
grep "⚠" bench-reports/perf/RESULTS.md
# 2. Profile the slow benchmark test
valgrind --tool=callgrind \
--callgrind-out-file=jit_profile.out \
cargo nextest run --profile valgrind -p facet-json test_long_strings --features jit
# 3. Analyze with GUI for visual call graph
kcachegrind jit_profile.out
# 4. Look for:
# - Helper function calls in tight loops
# - Redundant alignment checks
# - Allocation hot spots
# 5. Optimize based on findings
# 6. Verify with benchmarks
cargo xtask bench long_strings
```
### Compare Before/After Optimization
```bash
# Before
git checkout main
valgrind --tool=callgrind --callgrind-out-file=before.out \
cargo nextest run --no-fail-fast -p facet-json test_target
# After
git checkout my-optimization-branch
valgrind --tool=callgrind --callgrind-out-file=after.out \
cargo nextest run --no-fail-fast -p facet-json test_target
# Compare
callgrind_annotate --diff before.out after.out
```
## Interpreting Valgrind Output
### Memory Error Example
```
==12345== Invalid read of size 8
==12345== at 0x123456: facet_format_json::parse_number (parse.rs:42)
==12345== by 0x234567: facet_format_json::deserialize (lib.rs:123)
==12345== Address 0x789abc is 0 bytes after a block of size 16 alloc'd
==12345== at 0x345678: alloc (alloc.rs:88)
==12345== by 0x456789: Vec::push (vec.rs:1234)
```
**Translation:**
- Reading 8 bytes from invalid address
- Happened in `parse_number` at line 42
- Address is just past end of 16-byte allocation
- **Fix:** Check bounds before reading, or fix off-by-one error
### Leak Example
```
==12345== 128 bytes in 1 blocks are definitely lost in loss record 1 of 10
==12345== at 0x123456: malloc (vg_replace_malloc.c:299)
==12345== by 0x234567: alloc (alloc.rs:88)
==12345== by 0x345678: Box::new (boxed.rs:123)
==12345== by 0x456789: setup_jit (jit.rs:456)
```
**Translation:**
- 128 bytes allocated but never freed
- Allocated in `setup_jit` function
- **Fix:** Ensure cleanup/Drop implementation
### Cachegrind Output Example
```
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
--------------------------------------------------------------------------------
1,234,567 123 45 456,789 234 12 123,456 67 8 facet::deserialize
987,654 98 32 345,678 189 9 98,765 43 5 - facet::parse_value
234,567 23 10 98,765 45 2 23,456 12 1 - facet::parse_string
```
**Key metrics:**
- `Ir` - Instructions executed (most important for optimization)
- `D1mr/D1mw` - L1 data cache misses (indicates poor locality)
- `DLmr/DLmw` - Last-level cache misses (very expensive)
**Optimization targets:**
1. High `Ir` count = time-consuming function
2. High `D1mr` = poor data locality, consider restructuring
3. High `DLmr` = main memory accesses, critical to optimize
## Profiling Flags
### Valgrind (Memory Debugging)
```bash
--leak-check=full # Detailed leak info
--show-leak-kinds=all # Show all leak types
--track-origins=yes # Track uninitialized values (slower)
--verbose # More diagnostic info
--log-file=valgrind.log # Save output to file
```
### Callgrind (Profiling)
```bash
--callgrind-out-file=FILE # Output file (default: callgrind.out.<pid>)
--cache-sim=yes # Simulate cache behavior
--branch-sim=yes # Simulate branch prediction
--collect-jumps=yes # Collect jump information
--dump-instr=yes # Dump instruction info
--compress-strings=yes # Compress output (smaller files)
```
### Cargo Nextest
```bash
--no-fail-fast # Continue running after first failure
--profile valgrind # Use valgrind profile from nextest.toml
--test-threads=1 # Run single-threaded (better for profiling)
```
## Tips and Tricks
### Speed Up Profiling
1. **Profile in release mode** (but keep debug symbols):
```bash
# Add to Cargo.toml
[profile.release]
debug = true
```
2. **Use `--no-fail-fast` to avoid stopping early**
3. **Filter to specific tests** - don't profile everything at once
4. **Disable address randomization** for reproducible runs:
```bash
setarch $(uname -m) -R valgrind --tool=callgrind ...
```
### Read Callgrind Data Programmatically
```python
# Example: Parse callgrind output for automation
def parse_callgrind(filename):
import re
costs = {}
with open(filename) as f:
for line in f:
if m := re.match(r'(\d+)\s+(.+)', line):
cost, func = m.groups()
costs[func] = int(cost)
return costs
# Compare two profiles
before = parse_callgrind('before.out')
after = parse_callgrind('after.out')
for func in before:
if func in after:
delta = after[func] - before[func]
percent = (delta / before[func]) * 100
if abs(percent) > 5: # More than 5% change
print(f"{func}: {percent:+.1f}% ({delta:+,} instructions)")
```
## Don't Do This
❌ Run valgrind without nextest profile - inconsistent flags
❌ Profile debug builds - too slow and unrepresentative
❌ Ignore "still reachable" leaks in FFI code - sometimes OK
❌ Profile with multiple test threads - non-deterministic results
❌ Forget to clean between profiling runs - stale data
## Do This Instead
✅ Use `--profile valgrind` for memory debugging
✅ Use callgrind for performance profiling
✅ Profile release builds with debug symbols
✅ Focus on hot paths (high `Ir` counts)
✅ Compare before/after with `--diff`
✅ Use GUI tools (kcachegrind) for complex call graphs
## Files and Locations
```
.config/nextest.toml # Valgrind profile configuration
callgrind.out.* # Callgrind output files (gitignored)
bench-reports/gungraun-*.txt # Gungraun output (includes instruction counts)
```
## Troubleshooting
### Valgrind complains about "unrecognized instruction"
- Update valgrind: `sudo apt update && sudo apt install valgrind`
- Or use `--vex-iropt-register-updates=allregs-at-mem-access`
### Callgrind output is huge
- Use `--compress-strings=yes --compress-pos=yes`
- Or filter to specific functions with `--toggle-collect=function_name`
### Profile doesn't match benchmark results
- Ensure you're profiling the same code path
- Check if JIT compilation is cached (use setup functions in gungraun)
- Profile release build, not debug
### Can't open callgrind file in GUI
- Check file permissions
- Ensure file isn't corrupted (run `callgrind_annotate` first)
- Try different viewer (kcachegrind vs qcachegrind)
## See Also
- Valgrind manual: https://valgrind.org/docs/manual/manual.html
- Callgrind manual: https://valgrind.org/docs/manual/cl-manual.html
- Nextest wrapper scripts: https://nexte.st/docs/configuration/wrapper-scripts/
- KCachegrind handbook: https://docs.kde.org/stable5/en/kcachegrind/
- Project nextest config: `.config/nextest.toml`
- Benchmark debugging: See `benchmarking.md`
Name Size