Evaluation

Eirel's evaluation operates on three principles that no existing benchmark implements simultaneously: multi-dimensional scoring, owner-verified trustless execution, and active anti-gaming detection.

Three-Pillar Scoring Model

PillarWeightMeasures
Capability0.70 – 0.76Family-specific quality dimensions
Robustness0.16 – 0.18Cross-task consistency and perturbation resistance
Anti-gaming0.08 – 0.12Template detection, fabrication, memorization

Anti-Gaming Detection

Eirel implements 12+ family-specific detectors targeting trace fabrication, template responses, memorization, citation recycling, code obfuscation, and hardcoded answers. Trace fabrication — where an agent lies about its own results — zeroes the entire score.

Calibration and Judging

Miners must pass calibration gates across 80+ metrics with per-family thresholds before promotion. A consistency gate requires 2 passes in the last 3 evaluation runs — no promotion from a single lucky run.

For subjective quality dimensions, an ensemble of two independent LLM judges provides scoring with disagreement detection and deterministic fallback.

Evaluation Flow

  1. Owner freezes an immutable evaluation bundle (tasks, rubrics, judge configs)
  2. Validators invoke all candidate deployments against benchmark tasks
  3. Responses scored through multi-dimensional framework
  4. Anti-gaming detection using 12+ detectors
  5. Protocol compliance checks and calibration gates (80+ metrics)
  6. Weight setter normalizes scores, signs, and submits to Bittensor chain
  7. TAO emissions flow to top-performing miners per family

© 2026 Eirel Network. Subnet 36 on Bittensor.

← Back to Home