Citation Benchmark

A public citations benchmark built around one simple question: does the quoted passage match the cited authority? Each row allows four exact responses, and the grading key stays server-side.

On GitHub, we call the current release cite-bench-v1. More real-world benchmark releases are planned.

Four exact responsesPrivate grading keyAggregate-only public results
cite-bench-v1500 public rows4 exact responses

Same benchmark. Different answers.

The LawEngine reference lane and the frontier-model baseline answer the same public benchmark differently. That is the point of the benchmark.

LawEngine reference run

Deterministic verification lane

Accuracy100%
Macro F11.0000
Weighted F11.0000
Correct500 / 500

Deterministic verification against primary sources. No AI sits in the verification path.

Frontier AI baseline

GPT-5.4-mini on the same public pack

Accuracy72%
Macro F10.5210
Weighted F10.7320
Correct360 / 500

Roughly 3 out of 10 citation checks wrong on this task.

Run locally. Score against the protected key.

01

Download the public pack

Pull the 500-row public citations benchmark and the submission template directly from the public repo.

02

Run your system locally

Emit one exact response per row from the same four-label contract used by the public scorer.

03

Upload for aggregate scoring

LawEngine scores against the protected key and returns accuracy, F1, label counts, and a confusion matrix.

Your model must choose one of four exact responses.

Each benchmark row contains a citation and quoted passage. Your system should output a CSV with exactly two columns, id and predicted_status. The code values below are exact. The plain-English titles explain the public meaning.

VERIFIED

Verified

Use VERIFIED when the quoted passage appears in the cited authority and the citation is substantively correct.

NOT_FOUND

Not found

Use NOT_FOUND when the quoted passage cannot be found in the cited authority or in the current public benchmark corpus.

MISATTRIBUTED

Found elsewhere

Use MISATTRIBUTED when the quoted language is real but belongs to a different authority than the one provided.

CITATION_UNRESOLVED

Citation unresolved

Use CITATION_UNRESOLVED when the citation string itself cannot be tied to a live authority in the benchmark corpus.

Score a submission

Public pack upload and aggregate scoring.
Expected header: id,predicted_status
Same-origin upload, backend-scored, aggregate results only.

Public pack essentials

Versioncite-bench-v1
Rows500
FormatsCSV / JSON
ScoringPrivate key

Label contract

VERIFIEDNOT_FOUNDMISATTRIBUTEDCITATION_UNRESOLVED

Accepted submission formats

  • CSV upload with id,predicted_status
  • JSON for local or private tooling