Benchmarks for measuring AI capabilities across formal reasoning tasks.
Measuring how well models capture true semantics when translating natural language to formal statements.
"A family of sets forms a delta-system if there exists a common kernel set K such that for any two distinct sets A, B in the family, their intersection equals K."
def IsDeltaSystem (F : Finset (Finset α)) : Prop := ∃ K : Finset α, ∀ A ∈ F, ∀ B ∈ F, A ≠ B → A ∩ B = K
Evaluating model capabilities at wielding tactics to solve mathematical problems with verified correctness.
theorem deltaSystem_pair (A B : Finset α) (hne : A ≠ B) :
IsDeltaSystem ({A, B} : Finset (Finset α)) := bytheorem deltaSystem_pair (A B : Finset α) (hne : A ≠ B) :
IsDeltaSystem ({A, B} : Finset (Finset α)) :=
⟨A ∩ B, fun X hX Y hY hXY => by
simp only [Finset.mem_insert, Finset.mem_singleton] at hX hY
rcases hX with rfl | rfl <;> rcases hY with rfl | rfl
· exact (hXY rfl).elim
· rfl
· apply Finset.inter_comm
· exact (hXY rfl).elim⟩Distilling machine-generated proofs into human-readable solutions using relevant mathematical abstractions.
simp [Even] at *; omega
obtain ⟨k, hk⟩ := ha obtain ⟨l, hl⟩ := hb use k + l; linarith
unfold Even at ha hb ⊢ rcases ha with ⟨k, rfl⟩ rcases hb with ⟨l, rfl⟩ refine ⟨k + l, ?_⟩; ring
Get access to our evaluation suite and benchmark your AI systems.