Verified Programs
The promise of theorem provers extends beyond mathematics. We can verify that software does what we claim it does. This article demonstrates verification techniques where the verified code and the production code are the same: everything lives within Lean.
Intrinsically-Typed Interpreters
The standard approach to building interpreters involves two phases. First, parse text into an untyped abstract syntax tree. Second, run a type checker that rejects malformed programs. This works, but the interpreter must still handle the case where a program passes the type checker but evaluates to nonsense. The runtime carries the burden of the type system’s failure modes. It is like a bouncer who checks IDs at the door but still has to deal with troublemakers inside.
Intrinsically-typed interpreters refuse to play this game. The abstract syntax tree itself encodes typing judgments. An ill-typed program cannot be constructed. The type system statically excludes runtime type errors, not by checking them at runtime, but by making them unrepresentable. The bouncer is replaced by architecture: there is no door for troublemakers to enter.
Consider a small expression language with natural numbers, booleans, arithmetic, and conditionals. We start by defining the types our language supports and a denotation function that maps them to Lean types.
inductive Ty where
| nat : Ty
| bool : Ty
deriving Repr, DecidableEq
@[reducible] def Ty.denote : Ty → Type
| .nat => Nat
| .bool => Bool
The denote function is key. It interprets our object-level types (Ty) as meta-level types (Type). When our expression language says something has type nat, we mean it evaluates to a Lean Nat. When it says bool, we mean a Lean Bool. This type-level interpretation function is what makes the entire approach work.
Expressions
The expression type indexes over the result type. Each constructor precisely constrains which expressions can be built and what types they produce.
inductive Expr : Ty → Type where
| nat : Nat → Expr .nat
| bool : Bool → Expr .bool
| add : Expr .nat → Expr .nat → Expr .nat
| mul : Expr .nat → Expr .nat → Expr .nat
| lt : Expr .nat → Expr .nat → Expr .bool
| eq : Expr .nat → Expr .nat → Expr .bool
| and : Expr .bool → Expr .bool → Expr .bool
| or : Expr .bool → Expr .bool → Expr .bool
| not : Expr .bool → Expr .bool
| ite : {t : Ty} → Expr .bool → Expr t → Expr t → Expr t
Every constructor documents its typing rule. The add constructor requires both arguments to be natural number expressions and produces a natural number expression. The ite constructor requires a boolean condition and two branches of matching type.
This encoding makes ill-typed expressions unrepresentable. You cannot write add (nat 1) (bool true) because the types do not align. The Lean type checker rejects such expressions before they exist.
/-
def bad : Expr .nat := .add (.nat 1) (.bool true)
-- Error: type mismatch
-/
Evaluation
The evaluator maps expressions to their denotations. Because expressions are intrinsically typed, the evaluator is total. It never fails, never throws exceptions, never encounters impossible cases. Every pattern match is exhaustive.
def Expr.eval : {t : Ty} → Expr t → t.denote
| _, .nat n => n
| _, .bool b => b
| _, .add e₁ e₂ => e₁.eval + e₂.eval
| _, .mul e₁ e₂ => e₁.eval * e₂.eval
| _, .lt e₁ e₂ => e₁.eval < e₂.eval
| _, .eq e₁ e₂ => e₁.eval == e₂.eval
| _, .and e₁ e₂ => e₁.eval && e₂.eval
| _, .or e₁ e₂ => e₁.eval || e₂.eval
| _, .not e => !e.eval
| _, .ite c t e => if c.eval then t.eval else e.eval
The return type t.denote varies with the expression’s type index. A natural number expression evaluates to Nat. A boolean expression evaluates to Bool. This dependent return type is what makes the evaluator type-safe by construction.
def ex1 : Expr .nat := .add (.nat 2) (.nat 3)
def ex2 : Expr .bool := .lt (.nat 2) (.nat 3)
def ex3 : Expr .nat := .ite (.lt (.nat 2) (.nat 3)) (.nat 10) (.nat 20)
def ex4 : Expr .nat := .mul (.add (.nat 2) (.nat 3)) (.nat 4)
#eval ex1.eval -- 5
#eval ex2.eval -- true
#eval ex3.eval -- 10
#eval ex4.eval -- 20
Verified Optimization
Interpreters become interesting when we transform programs. Compilers do this constantly: dead code elimination, loop unrolling, strength reduction. Each transformation promises to preserve meaning while improving performance. But how do we know the promise is kept? A constant folder simplifies expressions by evaluating constant subexpressions at compile time. Adding two literal numbers produces a literal. Conditionals with constant conditions eliminate the untaken branch.
def Expr.constFold : {t : Ty} → Expr t → Expr t
| _, .nat n => .nat n
| _, .bool b => .bool b
| _, .add e₁ e₂ =>
match e₁.constFold, e₂.constFold with
| .nat n, .nat m => .nat (n + m)
| e₁', e₂' => .add e₁' e₂'
| _, .mul e₁ e₂ =>
match e₁.constFold, e₂.constFold with
| .nat n, .nat m => .nat (n * m)
| e₁', e₂' => .mul e₁' e₂'
| _, .lt e₁ e₂ => .lt e₁.constFold e₂.constFold
| _, .eq e₁ e₂ => .eq e₁.constFold e₂.constFold
| _, .and e₁ e₂ => .and e₁.constFold e₂.constFold
| _, .or e₁ e₂ => .or e₁.constFold e₂.constFold
| _, .not e => .not e.constFold
| _, .ite c t e =>
match c.constFold with
| .bool true => t.constFold
| .bool false => e.constFold
| c' => .ite c' t.constFold e.constFold
The optimization preserves types. If e : Expr t, then e.constFold : Expr t. The type indices flow through unchanged. The type system enforces this statically.
But type preservation is a weak property. We want semantic preservation: the optimized program computes the same result as the original. This requires a proof.
theorem constFold_correct : ∀ {t : Ty} (e : Expr t), e.constFold.eval = e.eval := by
intro t e
induction e with
| nat n => rfl
| bool b => rfl
| add e₁ e₂ ih₁ ih₂ =>
simp only [Expr.constFold, Expr.eval]
cases he₁ : e₁.constFold <;> cases he₂ : e₂.constFold <;>
simp only [Expr.eval, ← ih₁, ← ih₂, he₁, he₂]
| mul e₁ e₂ ih₁ ih₂ =>
simp only [Expr.constFold, Expr.eval]
cases he₁ : e₁.constFold <;> cases he₂ : e₂.constFold <;>
simp only [Expr.eval, ← ih₁, ← ih₂, he₁, he₂]
| lt e₁ e₂ ih₁ ih₂ => simp only [Expr.constFold, Expr.eval, ih₁, ih₂]
| eq e₁ e₂ ih₁ ih₂ => simp only [Expr.constFold, Expr.eval, ih₁, ih₂]
| and e₁ e₂ ih₁ ih₂ => simp only [Expr.constFold, Expr.eval, ih₁, ih₂]
| or e₁ e₂ ih₁ ih₂ => simp only [Expr.constFold, Expr.eval, ih₁, ih₂]
| not e ih => simp only [Expr.constFold, Expr.eval, ih]
| ite c t e ihc iht ihe =>
simp only [Expr.constFold, Expr.eval]
cases hc : c.constFold <;> simp only [Expr.eval, ← ihc, ← iht, ← ihe, hc]
case bool b => cases b <;> rfl
The theorem states that for any expression, evaluating the constant-folded expression yields the same result as evaluating the original. The proof proceeds by structural induction on the expression. Most cases follow directly from the induction hypotheses.
A Verified Compiler
The intrinsically-typed interpreter demonstrates type safety. But real systems compile to lower-level representations. Can we verify the compiler itself? The answer is yes, and it requires remarkably little code. In roughly 40 lines, we can define a source language, a target language, compilation, and prove the compiler correct. This is CompCert in miniature.
The source language is arithmetic expressions: literals, addition, and multiplication. The target language is a stack machine with push, add, and multiply instructions. The compilation strategy is straightforward: literals become pushes, binary operations compile their arguments and then emit the operator.
-- Source: arithmetic expressions
inductive Expr where
| lit : Nat → Expr
| add : Expr → Expr → Expr
| mul : Expr → Expr → Expr
deriving Repr
-- Target: stack machine instructions
inductive Instr where
| push : Nat → Instr
| add : Instr
| mul : Instr
deriving Repr
-- What an expression means: evaluate to a number
def eval : Expr → Nat
| .lit n => n
| .add a b => eval a + eval b
| .mul a b => eval a * eval b
-- Compile expression to stack code
def compile : Expr → List Instr
| .lit n => [.push n]
| .add a b => compile a ++ compile b ++ [.add]
| .mul a b => compile a ++ compile b ++ [.mul]
-- Execute stack code
def run : List Instr → List Nat → List Nat
| [], stack => stack
| .push n :: is, stack => run is (n :: stack)
| .add :: is, b :: a :: stack => run is ((a + b) :: stack)
| .mul :: is, b :: a :: stack => run is ((a * b) :: stack)
| _ :: is, stack => run is stack
-- Lemma: execution distributes over concatenation
theorem run_append (is js : List Instr) (s : List Nat) :
run (is ++ js) s = run js (run is s) := by
induction is generalizing s with
| nil => rfl
| cons i is ih =>
cases i with
| push n => exact ih _
| add => cases s with
| nil => exact ih _
| cons b s => cases s with
| nil => exact ih _
| cons a s => exact ih _
| mul => cases s with
| nil => exact ih _
| cons b s => cases s with
| nil => exact ih _
| cons a s => exact ih _
-- THE THEOREM: the compiler is correct
-- Running compiled code pushes exactly the evaluated result
theorem compile_correct (e : Expr) (s : List Nat) :
run (compile e) s = eval e :: s := by
induction e generalizing s with
| lit n => rfl
| add a b iha ihb => simp [compile, run_append, iha, ihb, run, eval]
| mul a b iha ihb => simp [compile, run_append, iha, ihb, run, eval]
The key insight is the run_append lemma: executing concatenated instruction sequences is equivalent to executing them in order. This lets us prove correctness compositionally. The main theorem, compile_correct, states that running compiled code pushes exactly the evaluated result onto the stack.
The proof proceeds by structural induction on expressions. Literal compilation is trivially correct. For binary operations, we use run_append to split the execution: first we run the compiled left argument, then the compiled right argument, then the operator. The induction hypotheses tell us each subexpression evaluates correctly. The operator instruction combines them as expected.
-- Try it: (2 + 3) * 4
def expr : Expr := .mul (.add (.lit 2) (.lit 3)) (.lit 4)
#eval eval expr -- 20
#eval compile expr -- [push 2, push 3, add, push 4, mul]
#eval run (compile expr) [] -- [20]
-- The theorem guarantees these always match. Not by testing. By proof.
This is verified compiler technology at its most distilled. The same principles scale to CompCert, which verifies a production C compiler. The gap between 40 lines and 100,000 lines is mostly the complexity of real languages and optimizations, not the verification methodology.
Proof-Carrying Parsers
The intrinsically-typed interpreter guarantees type safety. The verified compiler guarantees semantic preservation. But what about parsers? A parser takes untrusted input and produces structured data. The traditional approach is to hope the parser is correct and test extensively. The verified approach is to make the parser carry its own proof of correctness.
A proof-carrying parser returns both the parsed result and evidence that the result matches the grammar. Invalid parses become type errors rather than runtime errors. The proof is constructed during parsing and verified by the type checker.
We define a grammar as an inductive type with constructors for characters, sequencing, alternation, repetition, and the empty string:
inductive Grammar where
| char : Char → Grammar
| seq : Grammar → Grammar → Grammar
| alt : Grammar → Grammar → Grammar
| many : Grammar → Grammar
| eps : Grammar
inductive Matches : Grammar → List Char → Prop where
| char {c} : Matches (.char c) [c]
| eps : Matches .eps []
| seq {g₁ g₂ s₁ s₂} : Matches g₁ s₁ → Matches g₂ s₂ → Matches (.seq g₁ g₂) (s₁ ++ s₂)
| altL {g₁ g₂ s} : Matches g₁ s → Matches (.alt g₁ g₂) s
| altR {g₁ g₂ s} : Matches g₂ s → Matches (.alt g₁ g₂) s
| manyNil {g} : Matches (.many g) []
| manyCons {g s₁ s₂} : Matches g s₁ → Matches (.many g) s₂ → Matches (.many g) (s₁ ++ s₂)
The Matches relation defines when a string matches a grammar. Each constructor corresponds to a grammar production: a character matches itself, sequences match concatenations, alternatives match either branch, and repetition matches zero or more occurrences.
A parse result bundles the consumed input, remaining input, and a proof that the consumed portion matches the grammar:
structure ParseResult (g : Grammar) where
consumed : List Char
rest : List Char
proof : Matches g consumed
abbrev Parser (g : Grammar) := List Char → Option (ParseResult g)
The parser combinators construct these proof terms as they parse. When pchar 'a' succeeds, it returns a ParseResult containing proof that 'a' matches Grammar.char 'a'. When pseq combines two parsers, it combines their proofs using the Matches.seq constructor:
def pchar (c : Char) : Parser (.char c) := fun
| x :: xs => if h : x = c then some ⟨[c], xs, h ▸ .char⟩ else none
| [] => none
variable {g₁ g₂ g : Grammar}
def pseq (p₁ : Parser g₁) (p₂ : Parser g₂) : Parser (.seq g₁ g₂) := fun input =>
p₁ input |>.bind fun ⟨s₁, r, pf₁⟩ => p₂ r |>.map fun ⟨s₂, r', pf₂⟩ => ⟨s₁ ++ s₂, r', .seq pf₁ pf₂⟩
def palt (p₁ : Parser g₁) (p₂ : Parser g₂) : Parser (.alt g₁ g₂) := fun input =>
(p₁ input |>.map fun ⟨s, r, pf⟩ => ⟨s, r, .altL pf⟩) <|>
(p₂ input |>.map fun ⟨s, r, pf⟩ => ⟨s, r, .altR pf⟩)
partial def pmany (p : Parser g) : Parser (.many g) := fun input =>
match p input with
| none => some ⟨[], input, .manyNil⟩
| some ⟨s₁, r, pf₁⟩ =>
if s₁.isEmpty then some ⟨[], input, .manyNil⟩
else (pmany p r).map fun ⟨s₂, r', pf₂⟩ => ⟨s₁ ++ s₂, r', .manyCons pf₁ pf₂⟩
infixl:60 " *> " => pseq
infixl:50 " <+> " => palt
postfix:90 "⁺" => pmany
Soundness is trivial. Every successful parse carries its proof:
theorem soundness (p : Parser g) (input : List Char) (r : ParseResult g) :
p input = some r → Matches g r.consumed := fun _ => r.proof
The theorem says: if a parser returns a result, then the consumed input matches the grammar. The proof is the identity function, because the evidence is already in the result. Proof-carrying data constructs correctness alongside the computation rather than establishing it after the fact.
The Stack Machine
We continue with another Lean-only verification example: a stack machine, the fruit fly of computer science. Like the fruit fly in genetics, stack machines are simple enough to study exhaustively yet complex enough to exhibit interesting behavior. The machine has five operations: push a value, pop the top, add the top two values, multiply them, or duplicate the top.
inductive Op where
| push : Int → Op
| pop : Op
| add : Op
| mul : Op
| dup : Op
deriving Repr, DecidableEq
The run function executes a program against a stack:
def run : List Op → List Int → List Int
| [], stack => stack
| .push n :: ops, stack => run ops (n :: stack)
| .pop :: ops, _ :: stack => run ops stack
| .pop :: ops, [] => run ops []
| .add :: ops, b :: a :: stack => run ops ((a + b) :: stack)
| .add :: ops, stack => run ops stack
| .mul :: ops, b :: a :: stack => run ops ((a * b) :: stack)
| .mul :: ops, stack => run ops stack
| .dup :: ops, x :: stack => run ops (x :: x :: stack)
| .dup :: ops, [] => run ops []
#eval run [.push 3, .push 4, .add] []
#eval run [.push 3, .push 4, .mul] []
#eval run [.push 5, .dup, .mul] []
#eval run [.push 10, .push 3, .pop] []
Universal Properties
The power of theorem proving lies not in verifying specific programs but in proving properties about all programs. Consider the composition theorem: running two programs in sequence equals running their concatenation.
theorem run_append (p1 p2 : List Op) (s : List Int) :
run (p1 ++ p2) s = run p2 (run p1 s) := by
induction p1 generalizing s with
| nil => rfl
| cons op ops ih =>
cases op with
| push n => exact ih _
| pop => cases s <;> exact ih _
| add => match s with
| [] | [_] => exact ih _
| _ :: _ :: _ => exact ih _
| mul => match s with
| [] | [_] => exact ih _
| _ :: _ :: _ => exact ih _
| dup => cases s <;> exact ih _
This theorem quantifies over all programs p1 and p2 and all initial stacks s. The proof proceeds by induction on the first program, with case analysis on each operation and the stack state. The result is a guarantee that holds for the infinite space of all possible programs.
Stack Effects
Each operation has a predictable effect on stack depth. Push and dup add one element; pop, add, and mul remove one (add and mul consume two and produce one). We can compute the total effect of a program statically:
def effect : Op → Int
| .push _ => 1
| .pop => -1
| .add => -1
| .mul => -1
| .dup => 1
def totalEffect : List Op → Int
| [] => 0
| op :: ops => effect op + totalEffect ops
theorem effect_append (p1 p2 : List Op) :
totalEffect (p1 ++ p2) = totalEffect p1 + totalEffect p2 := by
induction p1 with
| nil => simp [totalEffect]
| cons op ops ih => simp [totalEffect, ih]; ring
The effect_append theorem proves that stack effects compose additively. If program p1 changes the stack depth by n and p2 changes it by m, then p1 ++ p2 changes it by n + m. This is another universal property, holding for all programs.
Program Equivalence
We can also prove that certain program transformations preserve semantics. Addition and multiplication are commutative, so swapping the order of pushes does not change the result:
theorem add_comm (n m : Int) (rest : List Op) (s : List Int) :
run (.push n :: .push m :: .add :: rest) s =
run (.push m :: .push n :: .add :: rest) s := by
simp [run, Int.add_comm]
theorem mul_comm (n m : Int) (rest : List Op) (s : List Int) :
run (.push n :: .push m :: .mul :: rest) s =
run (.push m :: .push n :: .mul :: rest) s := by
simp [run, Int.mul_comm]
theorem dup_add_eq_double (n : Int) (rest : List Op) (s : List Int) :
run (.push n :: .dup :: .add :: rest) s =
run (.push (n + n) :: rest) s := by
simp [run]
theorem dup_mul_eq_square (n : Int) (rest : List Op) (s : List Int) :
run (.push n :: .dup :: .mul :: rest) s =
run (.push (n * n) :: rest) s := by
simp [run]
These theorems justify program transformations. An optimizer that reorders pushes before adds is provably correct. The dup_add_eq_double and dup_mul_eq_square theorems show that push n; dup; add computes 2n and push n; dup; mul computes n². A compiler could use these equivalences for strength reduction.
What We Proved
The stack machine demonstrates verification of universal properties. We proved that running concatenated programs equals sequential execution (composition), that stack effects compose predictably (effect additivity), that push order does not affect addition or multiplication (commutativity), and that certain instruction sequences compute the same result (equivalences).
These theorems quantify over the entire space of programs, unlike tests of specific inputs. The composition theorem alone covers infinitely many cases that no test suite could enumerate. A passing test establishes an existential claim (“there exists an input where the program works”), while a theorem establishes a universal claim (“for all inputs, the program works”). Tests sample behavior, proofs characterize it completely.
The Verification Gap
Everything so far lives entirely within Lean. The interpreter is correct by construction. The compiler preserves semantics. The parser carries its proof. The stack machine obeys universal laws. These are real theorems about real programs. And yet they share a fundamental limitation: the verified code and the production code are the same code. There is no gap to bridge because there is no bridge to cross.
Real systems are not written in Lean. They are written in Rust, C, Go, or whatever language the team knows and the platform demands. The next article explores how to bridge the gap between a verified model and a production implementation.