Evaluations.

Benchmarks for measuring AI capabilities across formal reasoning tasks.

Natural Language to Formal Specification

Autoformalization

Measuring how well models capture true semantics when translating natural language to formal statements.

Semantic preservation across mathematical domains
Type inference and constraint satisfaction
Cross-language formalization (Lean, Coq, Isabelle)
Formalize the following statement in Lean 4:
"A family of sets forms a delta-system if there exists a common kernel set K such that for any two distinct sets A, B in the family, their intersection equals K."
def IsDeltaSystem (F : Finset (Finset α)) : Prop :=
  ∃ K : Finset α, ∀ A ∈ F, ∀ B ∈ F, A ≠ B → A ∩ B = K
Proof Search and Tactic Generation

Automated Theorem Proving

Evaluating model capabilities at wielding tactics to solve mathematical problems with verified correctness.

Multi-step proof construction and verification
Tactic selection and parameter prediction
Lemma discovery and auxiliary goal generation
Complete the following Lean 4 proof:
theorem deltaSystem_pair (A B : Finset α) (hne : A ≠ B) :
    IsDeltaSystem ({A, B} : Finset (Finset α)) := by
theorem deltaSystem_pair (A B : Finset α) (hne : A ≠ B) :
    IsDeltaSystem ({A, B} : Finset (Finset α)) :=
  ⟨A ∩ B, fun X hX Y hY hXY => by
    simp only [Finset.mem_insert, Finset.mem_singleton] at hX hY
    rcases hX with rfl | rfl <;> rcases hY with rfl | rfl
    · exact (hXY rfl).elim
    · rfl
    · apply Finset.inter_comm
    · exact (hXY rfl).elim⟩
Proof Comprehension and Explanation

Interpretability

Distilling machine-generated proofs into human-readable solutions using relevant mathematical abstractions.

Proof optimization and simplification
Natural language explanation generation
Human preference alignment via proof ranking
Which proof of even_add_even is clearest?
A
simp [Even] at *; omega
B
obtain ⟨k, hk⟩ := ha
obtain ⟨l, hl⟩ := hb
use k + l; linarith
C
unfold Even at ha hb ⊢
rcases ha with ⟨k, rfl⟩
rcases hb with ⟨l, rfl⟩
refine ⟨k + l, ?_⟩; ring
Human selected:Proof B

Evaluate your models

Get access to our evaluation suite and benchmark your AI systems.