Methodology

What a Benchlist attestation
actually proves.

Every score on this site is a composite of three orthogonal guarantees — cryptographic integrity, economic accountability, and social replay. This page spells out exactly what each one covers, and where each one ends. No handwaving.

01Three layers, composed.

A Benchlist attestation is not one claim. It’s three layered claims, each covering what the others can’t. The composition is what makes a number trustworthy.

Layer 1 · Cryptographic
ZK proof of scoring
An SP1 zkVM proof that the pinned scoring function applied to the pinned dataset over the committed transcripts produces the claimed score. Settled on Ethereum L1 via Aligned Layer.
Layer 2 · Economic
Attestor signature + stake
An Ed25519 signature from a staked attestor saying “I queried this service with this config and these are the responses I saw.” Stake is slashed on upheld disputes.
Layer 3 · Social
Community replay
Every run publishes a replay.command. Anyone can rerun bit-for-bit. Divergence > 2σ opens a dispute and risks slashing the original attestor.

No single layer is enough alone. ZK proofs can’t know whether an attestor faked transcripts. Attestor stake can’t prevent math errors in the scorer. Replay can’t happen on private test sets. Together they cover the space.

02What the ZK proof proves.

A Benchlist ZK proof is generated inside an SP1 (or Risc0) zkVM. The zkVM runs the scoring function — the real one, bit-for-bit — and emits a proof that the output is correct given the committed inputs.

✓ Proven cryptographically

  • The scoring function at methodologyHash was applied
  • To a dataset matching datasetHash
  • Over transcripts matching the transcriptMerkleRoot
  • Producing this specific score
  • Not some other score that would be easy to confuse it with
  • Aggregation (mean, pass-rate, etc.) is arithmetic-correct

✗ Not proven by ZK alone

  • That the transcripts came from the real service (vs. fabricated)
  • That the model config matches its public claim
  • That the scoring function is semantically correct — only that it’s deterministically correct
  • That the benchmark isn’t contaminated

This is the point. ZK gives you “no math errors, no silent substitution.” It does not give you “this number is a good measure of intelligence.” That last question is semantic; the middle question is cryptographic; we only promise the middle.

03What the attestor signature proves.

The ZK proof assumes transcripts are real. An attestor signature + stake is what makes that assumption costly to break.

Every run is signed by a registered attestor’s Ed25519 key. The signing payload is the full Merkle root, so the attestor can’t post-hoc swap transcripts. The attestor has ≥1 ETH staked in StakeVault; a dispute upheld by community replay slashes the stake.

  • What this buys you: A known party economically bets that the transcripts are real.
  • What this does not buy you: Absolute certainty. A well-funded adversary could fake transcripts and forfeit the stake. Defense: multiple independent attestors (Quorum mode, +50%), TEE attestation for confidential evals (+$999), and the social-replay layer below.

04What community replay proves.

Every run publishes a replay.command. Anyone with access to the service API and the pinned dataset can rerun it. If someone’s fresh run diverges from the attested score by more than 2σ, that’s a dispute-worthy signal.

Disputes cost 0.1 ETH to file (anti-spam bond, refunded on valid disputes). An accepted dispute slashes up to 100% of the original attestor’s stake, annuls the score, and flags the service listing.

05Sufficiency — exactly what it takes.

For a Benchlist score to be sufficient — meaning a reasonable buyer can treat it as dispositive — the following conditions compose:

  1. Pinned dataset. datasetHash resolves to a public IPFS object. You can re-download and re-hash.
  2. Pinned methodology. methodologyHash is a specific git commit of the runner repo. The repo is MIT-licensed and forkable.
  3. ZK proof verified on-chain. The proof landed in an Aligned Layer batch whose root is committed to Ethereum mainnet (via ServiceManager.verifyBatchInclusion at 0xeF2A…606c).
  4. Attestor staked & not slashed. The signing attestor holds ≥1 ETH in StakeVault with no pending disputes.
  5. Replay window elapsed. At least 72 hours have passed since the attestation without a successful dispute. This is the silent grace window during which a cheap adversary fake gets caught.
  6. Contamination-aware. For any benchmark on the known-contaminated list (HumanEval above GPT-4-class, MBPP above Claude-3-class), the score carries a lower-bound flag rather than a rank-order flag.

Conditions 1-4 are cryptographic/economic. Condition 5 is temporal. Condition 6 is editorial. The union is what we mean when we show a Verified ⛓ badge.

06Explicit gaps we don’t cover (yet).

A truthful methodology page lists its own holes. Ours:

  • Model identity. We can prove a specific API endpoint was queried, not that the weights match the vendor’s public claim. Catching a silent model swap is vendor-cooperation or TEE-attestation territory.
  • Prompt steering. Benchmarks with few-shot or chain-of-thought prompts are sensitive to prompt phrasing. Our methodology pins the exact prompts used. Different prompts → different benchmark, not a comparable score.
  • Temperature & sampling. Pinned in the methodology. Runs that vary must annotate.
  • Judge drift. LLM-judged benchmarks pin the judge model by fingerprint. If the judge model silently changes behind an API, we’ll re-hash and fork the benchmark.

When a buyer asks “should I trust this number?” the answer is: to the extent these gaps matter for your use case. A research-grade bench comparison tolerates them. A compliance audit may not; for that we offer Private benchmarks + TEE attestation (see Enterprise).

07Dataset pinning — the inputs.

Upstream benchmark repos get re-versioned constantly: questions are added, judges get fixed, labels drift. We snapshot, serialize canonically (JSON-Lines, sorted keys, UTF-8 NFC, LF line endings), compute SHA-256 over the concatenation, and pin. Every run references that hash. Raw bytes live on IPFS.

Changing a single character in the canonical serialization changes the hash. Any downstream proof that references the old hash fails verification automatically.

08Methodology pinning — the code.

The runner is a pipx-installable Python package (benchlist-runner). Each run pins:

  • runnerRepo: URL of the source repo
  • runnerCommit: seven-character short SHA of the commit
  • runnerVersion: semver tag for human reference

The committed Merkle root includes a hash of the compiled runner binary too — so a compile-time substitution is caught.

09Decoding params.

Every benchmark declares canonical decoding: temperature, top-p, max tokens, stop sequences, presence/frequency penalties, system prompt. Runs that deviate carry a non-canonical flag and do not show on the default leaderboard.

10LLM judges.

Judge-required benchmarks (LongMemEval, FRAMES) pin the judge model by exact fingerprint and the judge prompt by hash. The same judge is used by every attestor for that benchmark version; if the upstream judge model rotates (e.g. OpenAI updates gpt-4o), we fork the benchmark to a new methodologyHash rather than silently accepting drift.

11Training-set contamination.

Benchmarks leak into training sets over time. We flag runs as contaminated where public evidence (model release notes, third-party analysis) suggests this. HumanEval and MBPP are flagged above GPT-4-class models. Contaminated scores are shown as lower bounds, not rank-order signals.

12Eval gaming resistance.

Benchmark gaming includes: cherry-picking runs, tuning to the test set, prompt engineering against a specific leaderboard. Our defenses:

  • Runs are published in order of attestation, not score. No post-hoc selection.
  • Each run commits to the full transcript, not just the score. Cherry-picking shows up as truncated transcripts.
  • Disputes can cite specific problems. If a run’s score comes disproportionately from contaminated items, the dispute stands.

13Attestor sybil resistance.

Attestors post 1-5 ETH stake in StakeVault. At ETH ≈ $3,600 that’s a $3,600-$18,000 sybil barrier per attestor identity. Quorum mode requires 3-of-5 independent attestors, raising the bar to collusion across multiple staked identities. For compliance-grade work (regulatory, insurance), Benchlist Enterprise operates a vetted pool of KYC’d attestors with additional legal recourse.

14Replay command.

Every run exposes a single command that reproduces it:

benchlist run longmemeval \
  --service rem-labs \
  --model claude-opus-4-7 \
  --runs 3 \
  --dataset-hash sha256:a1b3… \
  --methodology-hash sha256:c3d5…

If your score diverges from the attested one by > 2σ, file a dispute. Your 0.1 ETH bond is refunded when the dispute is upheld.

15Dispute & slash.

Dispute resolution is on-chain via DisputeManager. The resolution function runs a re-scoring inside SP1 using the disputant’s fresh transcripts. If the re-score diverges, the original attestor’s stake is slashed proportionally to the deviation. The bond system is identical to Optimism’s fault-proof model — adversarial replay as the ultimate cheap gatekeeper.

Per-benchmark specifics

Every benchmark page carries its exact datasetHash, methodologyHash, runner repo, canonical decoding, judge config, and contamination flag. Every run page also shows the client-side recomputation of the commitment — in your browser, using SubtleCrypto, so you don’t have to trust us to display the numbers.