Evaluation
Eirel's evaluation operates on three principles that no existing benchmark implements simultaneously: multi-dimensional scoring, owner-verified trustless execution, and active anti-gaming detection.
Three-Pillar Scoring Model
| Pillar | Weight | Measures |
|---|---|---|
| Capability | 0.70 – 0.76 | Family-specific quality dimensions |
| Robustness | 0.16 – 0.18 | Cross-task consistency and perturbation resistance |
| Anti-gaming | 0.08 – 0.12 | Template detection, fabrication, memorization |
Anti-Gaming Detection
Eirel implements 12+ family-specific detectors targeting trace fabrication, template responses, memorization, citation recycling, code obfuscation, and hardcoded answers. Trace fabrication — where an agent lies about its own results — zeroes the entire score.
Calibration and Judging
Miners must pass calibration gates across 80+ metrics with per-family thresholds before promotion. A consistency gate requires 2 passes in the last 3 evaluation runs — no promotion from a single lucky run.
For subjective quality dimensions, an ensemble of two independent LLM judges provides scoring with disagreement detection and deterministic fallback.
Evaluation Flow
- Owner freezes an immutable evaluation bundle (tasks, rubrics, judge configs)
- Validators invoke all candidate deployments against benchmark tasks
- Responses scored through multi-dimensional framework
- Anti-gaming detection using 12+ detectors
- Protocol compliance checks and calibration gates (80+ metrics)
- Weight setter normalizes scores, signs, and submits to Bittensor chain
- TAO emissions flow to top-performing miners per family