Evaluations.

Benchmarks for measuring AI capabilities across formal reasoning tasks.

Natural Language to Formal Specification

Autoformalization

Measuring how well models capture true semantics when translating natural language to formal statements.

Semantic preservation across mathematical domains

Type inference and constraint satisfaction

Cross-language formalization (Lean, Coq, Isabelle)

Formalize the following statement in Lean 4:

"A family of sets forms a delta-system if there exists a common kernel set K such that for any two distinct sets A, B in the family, their intersection equals K."

def IsDeltaSystem (F : Finset (Finset α)) : Prop :=
  ∃ K : Finset α, ∀ A ∈ F, ∀ B ∈ F, A ≠ B → A ∩ B = K

Proof Search and Tactic Generation

Automated Theorem Proving

Evaluating model capabilities at wielding tactics to solve mathematical problems with verified correctness.

Multi-step proof construction and verification

Tactic selection and parameter prediction

Lemma discovery and auxiliary goal generation

Complete the following Lean 4 proof:

theorem deltaSystem_pair (A B : Finset α) (hne : A ≠ B) :
    IsDeltaSystem ({A, B} : Finset (Finset α)) := by

theorem deltaSystem_pair (A B : Finset α) (hne : A ≠ B) :
    IsDeltaSystem ({A, B} : Finset (Finset α)) :=
  ⟨A ∩ B, fun X hX Y hY hXY => by
    simp only [Finset.mem_insert, Finset.mem_singleton] at hX hY
    rcases hX with rfl | rfl <;> rcases hY with rfl | rfl
    · exact (hXY rfl).elim
    · rfl
    · apply Finset.inter_comm
    · exact (hXY rfl).elim⟩

Proof Comprehension and Explanation

Interpretability

Distilling machine-generated proofs into human-readable solutions using relevant mathematical abstractions.

Proof optimization and simplification

Natural language explanation generation

Human preference alignment via proof ranking

Which proof of even_add_even is clearest?

simp [Even] at *; omega

obtain ⟨k, hk⟩ := ha
obtain ⟨l, hl⟩ := hb
use k + l; linarith

unfold Even at ha hb ⊢
rcases ha with ⟨k, rfl⟩
rcases hb with ⟨l, rfl⟩
refine ⟨k + l, ?_⟩; ring

Human selected:Proof B

Evaluate your models

Get access to our evaluation suite and benchmark your AI systems.

Book a Demo View Products