For open-weight model labs

Ship your release with a receipt. $99.

Mistral, DeepSeek, Qwen, Liquid, Hermes, AI21, Cohere, Snowflake, Databricks. If you publish open-weight models, your launch numbers compete with frontier vendors who own their narrative. Benchlist gives you a third-party signed leaderboard receipt within 4 hours of paying — embeddable badge for your HuggingFace card, OG-tagged proof page that previews in every Slack thread your model lands in.

Buy launch certificate · $99 → Or queue a release → See attested launches →

n=50 across 8 benchmarks Ed25519 signed receipts 4-hour turnaround Money-back if we miss SLA

Launch certificate

$99 · pay once

Get receipts for one model release. Includes n=50 runs on 8 canonical benchmarks (GSM8K, MMLU-Pro, GPQA, ARC-Challenge, HellaSwag, Winogrande, OpenBookQA, MATH-500), one Ed25519-signed proof page per benchmark, and an embeddable SVG badge for your model card. Delivered within 4 hours of payment or your money back.

Stripe Checkout · USD · receipt emailed

What ships

n=50 GSM8K✓

n=50 MMLU-Pro✓

n=50 GPQA✓

n=50 ARC-Challenge✓

n=50 HellaSwag✓

n=50 Winogrande✓

n=50 OpenBookQA✓

n=50 MATH-500✓

SVG badge for model card✓

4-hour SLA or refund✓

How it works.

Submit on release day.

Drop your Hugging Face model URL + an OpenRouter or Together inference key. We don't run the model on your hardware, we don't need GPU access.

We run n=50 each.

8 canonical benchmarks (GSM8K, MMLU-Pro, GPQA, ARC-Challenge, HellaSwag, Winogrande, OpenBookQA, MATH-500). Real Hugging Face datasets, deterministic seeded sampling.

Signed within 4 hours.

Each result Ed25519-signed over a Merkle commitment of every transcript. Stored at /verify/<id>. Browser-replayable. Reproducible offline.

Embed the badge.

Drop <img src="https://benchlist.ai/badge/<model>.svg"> into your model card. Live attestation count, last-checked, links to all 8 receipts.

Why it matters.

"Self-reported" is the stigma.

Benchmarks in vendor announcements are taken with a grain of salt. Buyers wait for third-party validation. A Benchlist receipt is the third-party validation, signed and replayable.

Mistral, DeepSeek, Qwen all face this.

Open-weight labs without a frontier-lab brand have to overcome scepticism on every release. A receipt levels the playing field. The score speaks for itself, signed.

Hugging Face downloads ≠ trust.

Download counts measure marketing, not capability. A signed attestation measures capability. Side by side on your model card, the receipt is the convincing piece.

See a real receipt.

This is what every benchmark in your launch certificate looks like once issued. Score, sample count, Wilson 95% CI, Ed25519 signature, attestor pubkey, and a 'Re-run for $0.50' button anyone can click. Hosted at /verify/<id>, OG-tagged so it previews cleanly in Slack and X.

live attested run · openbookqa

qwen3.6-27b-dense-q5km

n=50 · Wilson 95% CI ±9.6 · signed · replayable

view receipt →

Common questions.

What if my model has a custom chat template / system prompt?+

We pin the canonical chat template per benchmark. If your model needs a non-standard one, send the Jinja template at submission and we run it verbatim. The template hash ships in the signed receipt so anyone replaying the run uses the same prompts.

What if benchmarks are contaminated in my training set?+

We surface the contamination tier per benchmark on the receipt. GSM8K is high-contamination, GPQA is low. Buyers calibrate accordingly — we don't suppress the score, we contextualize it. See /contamination for the full index.

Can I dispute a number that looks wrong?+

Yes. File at /disputes with the run id and reason. We run a re-attestation on a fresh attestor and pay 0.02 ETH if the dispute is upheld. The replay primitive ($0.50 to re-run anyone's number) is the underlying mechanism — we don't gatekeep.

What does $99 actually cover?+

All 8 benchmarks at n=50 each (400 signed prompts), Ed25519 signing per receipt, an SVG badge for your model card, an OG-tagged proof page per benchmark, and 4-hour delivery SLA with refund-on-miss. Inference cost runs through whichever provider key you supply (OpenRouter / Together / Hugging Face Inference Endpoints). Compute is yours; the receipt is ours.

Need extra benchmarks (BigCodeBench, SWE-bench, MTEB)?+

$15 per additional benchmark, n=50 default. Request at submission or after the initial 8 ship. Coding evals (BigCodeBench, LCB, SWE-bench Lite) and embedding evals (MTEB) are wired and ready. Email dev@remlabs.ai for quotes.

Submit your release.

Drop in your model URL and a contact email. We'll respond within an hour with a Stripe link for the $99 launch certificate. Once paid, we run the 8-benchmark package within 4 hours and email you the receipt URLs.