What's attested

Every run pins 7 hashes.

Most "verified" benchmark sites just sign a number. We sign the code that produced the number. Every run.json embeds a 7-field provenance block, all SHA-256 hashes, all individually re-derivable by anyone with our public source. The on-chain ZK batch (when anchored on Aligned) commits to a single digest over all seven, so a single chain ID pins the entire test setup.

The 7 hashes

1 · benchmark

Dataset hash · datasetHash

SHA-256 of the canonical HuggingFace sample set. If anyone changes which problems we ran against, this hash changes and the proof breaks. Verify by re-fetching the dataset and hashing it.

2 · benchmark

Methodology hash · methodologyHash

SHA-256 of the scoring rules (e.g. "exact match on stripped output," "pass@1 on Python AST equivalence"). Pinning this means we can't quietly soften the grader between runs.

3 · execution

Transcript Merkle root · transcriptMerkleRoot

Root of a Merkle tree over every (prompt, response, judge_verdict) tuple. The SP1 program in our zkVM re-derives this root from the leaves and asserts equality, so a counterfeit transcript would fail to prove.

4 · execution

Claimed score · score

The value that becomes the public number. Pinned to 6 decimal places inside the proof so floating-point drift can't hide a different result.

5 · code state

Runner commit · runner_provenance.runner_commit

Git SHA of the runner repo that produced this run. git show $runner_commit reproduces the exact source tree.

6 · code state

Adapter hash · runner_provenance.adapter_hash

SHA-256 of the adapter source file (e.g. runner/adapters/humaneval.py) that loaded the dataset, queried the model, and graded responses. Adapter source is MIT-licensed and on GitHub.

7 · code state

Judge / lockfile / template hashes

SHA-256 of the judge function (where applicable), the requirements.txt lockfile, the system prompt, and the chat template. All bundled into the runner_provenance.digest that becomes a public input to the SP1 proof.

The on-chain commitment

When a run is anchored via Aligned Layer, the SP1 zkVM proof commits to one final digest derived from all 7 hashes:

final_digest = SHA-256(
  dataset_hash | methodology_hash | merkle_root | claimed_score | runner_provenance.digest
)

That single digest is what Aligned's BatcherPaymentService records on Ethereum L1. To dispute a run, an adversary has to either: (a) prove the SP1 zkVM is broken (hard), (b) find a SHA-256 collision (essentially impossible), or (c) fabricate a different runner repo that hashes to the same commit (impossible without breaking SHA-256).

How to re-verify any run

git clone https://github.com/benchlist/runner
cd runner
git checkout <runner_commit_from_run.json>

# 1. Re-hash the adapter
sha256sum adapters/<benchmark_id>.py
# → must match runner_provenance.adapter_hash

# 2. Re-hash the lockfile
sha256sum requirements.txt
# → must match runner_provenance.lockfile_hash

# 3. Re-derive the digest
python -c "
import hashlib
fields = ['$RUNNER_VERSION', '$RUNNER_COMMIT', '$ADAPTER_HASH', '$JUDGE_HASH',
          '$LOCKFILE_HASH', '$SYSPROMPT_HASH', '$TEMPLATE_HASH']
print('sha256:' + hashlib.sha256('|'.join(fields).encode()).hexdigest())
"
# → must match runner_provenance.digest

If any of the above doesn't match, the run is invalid. File at /disputes for a 0.02 ETH bounty if upheld.

What we don't sign

We don't sign the model weights themselves. Open-weight models are reproducible by their HuggingFace commit; closed models (Claude, GPT) are pinned by the API string only — the provider could quietly swap the underlying weights without us noticing, which is exactly what Provider Verified is designed to detect via continuous canonical-vs-host drift attestation.