Citation Benchmark

A public citations benchmark built around one simple question: does the quoted passage match the cited authority? Each row allows four exact responses, and the grading key stays server-side.

On GitHub, we call the current release cite-bench-v1. More real-world benchmark releases are planned.

Four exact responsesPrivate grading keyAggregate-only public results

Download public packJSON pack for the public citations benchmark

Download submission templateCSV with the exact upload contract

View GitHub repoPublic runner, pack, and benchmark docs

cite-bench-v1500 public rows4 exact responses

Reference runs

Same benchmark. Different answers.

The LawEngine reference lane and the frontier-model baseline answer the same public benchmark differently. That is the point of the benchmark.

LawEngine reference run

Deterministic verification lane

Accuracy100%

Macro F11.0000

Weighted F11.0000

Correct500 / 500

Deterministic verification against primary sources. No AI sits in the verification path.

Frontier AI baseline

GPT-5.4-mini on the same public pack

Accuracy72%

Macro F10.5210

Weighted F10.7320

Correct360 / 500

Roughly 3 out of 10 citation checks wrong on this task.

How it works

Run locally. Score against the protected key.

Download the public pack

Pull the 500-row public citations benchmark and the submission template directly from the public repo.

Run your system locally

Emit one exact response per row from the same four-label contract used by the public scorer.

Upload for aggregate scoring

LawEngine scores against the protected key and returns accuracy, F1, label counts, and a confusion matrix.

Submission contract

Your model must choose one of four exact responses.

Each benchmark row contains a citation and quoted passage. Your system should output a CSV with exactly two columns, id and predicted_status. The code values below are exact. The plain-English titles explain the public meaning.

VERIFIED

Verified

Use VERIFIED when the quoted passage appears in the cited authority and the citation is substantively correct.

NOT_FOUND

Not found

Use NOT_FOUND when the quoted passage cannot be found in the cited authority or in the current public benchmark corpus.

MISATTRIBUTED

Found elsewhere

Use MISATTRIBUTED when the quoted language is real but belongs to a different authority than the one provided.

CITATION_UNRESOLVED

Citation unresolved

Use CITATION_UNRESOLVED when the citation string itself cannot be tied to a live authority in the benchmark corpus.

Upload

Score a submission

Public pack upload and aggregate scoring.

Benchmark pack

Public pack essentials

Versioncite-bench-v1

Rows500

FormatsCSV / JSON

ScoringPrivate key

Label contract

VERIFIEDNOT_FOUNDMISATTRIBUTEDCITATION_UNRESOLVED

Accepted submission formats

CSV upload with id,predicted_status
JSON for local or private tooling