Evaluation

Eirel's evaluation operates on three principles that no existing benchmark implements simultaneously: multi-dimensional scoring, owner-verified trustless execution, and active anti-gaming detection.

Three-Pillar Scoring Model

Pillar	Weight	Measures
Capability	0.70 – 0.76	Family-specific quality dimensions
Robustness	0.16 – 0.18	Cross-task consistency and perturbation resistance
Anti-gaming	0.08 – 0.12	Template detection, fabrication, memorization

Anti-Gaming Detection

Eirel implements 12+ family-specific detectors targeting trace fabrication, template responses, memorization, citation recycling, code obfuscation, and hardcoded answers. Trace fabrication — where an agent lies about its own results — zeroes the entire score.

Calibration and Judging

Miners must pass calibration gates across 80+ metrics with per-family thresholds before promotion. A consistency gate requires 2 passes in the last 3 evaluation runs — no promotion from a single lucky run.

For subjective quality dimensions, an ensemble of two independent LLM judges provides scoring with disagreement detection and deterministic fallback.

Evaluation Flow

Owner freezes an immutable evaluation bundle (tasks, rubrics, judge configs)
Validators invoke all candidate deployments against benchmark tasks
Responses scored through multi-dimensional framework
Anti-gaming detection using 12+ detectors
Protocol compliance checks and calibration gates (80+ metrics)
Weight setter normalizes scores, signs, and submits to Bittensor chain
TAO emissions flow to top-performing miners per family