Overfitting in Synthesis



Saswat Padhi$ \,^1 $

Todd Millstein$ ^1 $    Aditya Nori$ \,^{2\,\texttt{UK}} $     Rahul Sharma$ ^{2\,\texttt{IN}} $

$ ^1 $ UCLA logo University of California, Los Angeles, USA

$ ^2 $ Microsoft logo Microsoft Research   $ ^\texttt{UK}\, $Cambridge, UK   $ ^\texttt{IN}\, $Bengaluru, India

Synthesis

Specification Grammar Learner Verifier Counterexample Likely Implementation Implementation Verified Implementation SyGuS Framework [Alur et al, FMCAD'13] CEGIS Framework [Solar-Lezama, STTT'13]

How does the choice of the grammar
affect the performance of synthesis tools?

Theoretical results & experiments
that demonstrate overfitting in CEGIS

Practical mitigation strategies
inspired by ML techniques

State of The Art

  • Grammars : $ 6 $ commonly used arithmetic grammars
  • Benchmarks : $ 180 $ invariant-synthesis tasks over integers
  • Tools : $ 5 $ participants from SyGuS-Comp'18
LoopInvGen [↗] [PLDI'16] SketchAC [↗] [Jeon et al, CAV'15] EUSolver [↗] [Alur et al, TACAS'17] CVC4 [↗] [Reynolds et al, CAV'15] Stoch [Alur et al, FMCAD'13] Equalities $$ \boldsymbol{(x = y)} $$ $$ \subset $$ Inequalities $$ \boldsymbol{(x \leq y)} $$ $$ \subset $$ Octagons $$ \boldsymbol{(x > y + 2)} $$ $$ \subset $$ Polyhedra $$ \boldsymbol{(x \geq 3 * y - 5)} $$ $$ \subset $$ Polynomials $$ \boldsymbol{(x \leq 7 * y * z + 11)} $$ $$ \subset $$ Peano $$ \boldsymbol{(x = y \,\%\, z\, - 13)} $$

(Timeout = $ 30 $ mins per benchmark per tool per grammar)

With more expressive power, every tool fails on many benchmarks it could previously solve!

But is the performance degradation
simply due to larger search spaces? …

Overfitting

Specification Grammar Learner Verifier Counterexample Likely Implementation Verified Implementation Rounds: The number of times the learner queries the verifier

ML notion:   A function does not correctly generalize beyond the training data

On increasing expressiveness: Increase No Change
Increase in # Rounds $ \ \Rightarrow\ $ ________ in Synth. Time $ 79 \,\% $ $ 6 \,\% $
Increase in Synth. Time $ \ \Rightarrow\ $ ________ in # Rounds $ 27 \,\% $ $ 67 \,\% $

(For LoopInvGen on all $ 180 $ benchmarks and all $ 6 $ grammars with $ 30 $ mins timeout per benchmark per grammar)

 Synthesizers not only spend more time

  • searching for a function within a large space,
  • but also collecting more examples from the verifier

Contributions

 Theoretical Insights

  • Formal notions of learnability and overfitting for SyGuS
  • No free lunch — overfitting is inevitable at high expressiveness

 Practical Solutions

  • PLearn, a black-box technique inspired by ensemble learning — explores multiple grammars in parallel
  • Hybrid enumeration (HE) — emulates PLearn by interleaving exploration of multiple grammars in a single thread
  • When combined with HE, the winning solver from the Inv track of SyGuS-Comp'18 is $ 5\times $ faster and solves $2$ more benchmarks

$ \boldsymbol{m} $-Learnability

 (Learning from $ m $ observations / examples)

Machine Learning

  • Learned functions only need to be approximately correct
  • Typically require learning from any set of $ m $ i.i.d. samples

CEGIS-Based SyGuS

  • Learned functions must match the specification exactly
  • Too strong of a requirement for the CEGIS setting

A specification $ \phi $ is $ m $-learnable by a learner $ \mathcal{L} $
if there exist a set of $ m $ examples for $ \phi $ with which
$ \mathcal{L} $ can learn a correct function for $ \phi $.

(a significantly weaker notion of learnability)

No Free Lunch

Explicit tradeoff between grammar expressiveness and the minimum number of rounds required

Let $ X $ and $ Y $ be arbitrary domains, $ m \in \mathbb{Z} $ be s.t. $ 0 \leq m < |X| $, and $ \mathcal{E} $ be an arbitrary grammar. Then, either:

  • $ \mathcal{E} $ contains at most $\texttt{bound}(m)$ number of distinct $X \to Y$ functions,
  • or, for every learner $ \mathcal{L} $, there exists a specification $ \phi $ that admits a solution in $ \mathcal{E} $, but is not $ m $-learnable by $ \mathcal{L} $.

More details in our paper — finite and infinite $ X $ and $ Y $, the precise $ \texttt{bound} $, etc.

No Free Lunch: Examples

Two extreme cases:

 ► A singleton grammar $ \mathcal{E} $:

  • Any specification that admits a solution in $ \mathcal{E} $
    is $ 0 $-learnable by any learner
  • Only one $ X \to Y $ function is expressible in $ \mathcal{E} $

 ► A fully expressive grammar $ \mathcal{E} $:

  • Every $ X \to Y $ function is expressible in $ \mathcal{E} $
  • For every learner, there exists a specification
    that requires all $ \lvert X \rvert $ examples to be learnable

Overfitting

 (Why some specifications require more examples to be learnable)

ML notion: When a learned function does not correctly generalize beyond the training data

SyGuS notion: When a learned function is consistent with the observed examples, but does not satisfy the given specification

Potential for Overfitting = Number of such functions in the grammar

The potential for overfitting increases
with grammar expressiveness

More details in our paper — precise bounds on number of examples and expressiveness

Contributions

 Theoretical Insights

  • Formal notions of learnability and overfitting for SyGuS
  • No free lunch — overfitting is inevitable at high expressiveness

 Practical Solutions

  • PLearn, a black-box technique inspired by ensemble learning — explores multiple grammars in parallel
  • Hybrid enumeration (HE) — emulates PLearn by interleaving exploration of multiple grammars in a single thread
  • When combined with HE, the winning solver from the Inv track of SyGuS-Comp'18 is $ 5\times $ faster and solves $2$ more benchmarks

PLearn

A technique inspired by ensemble methods [Dietterich, MCS’00] — run several learners and aggregate their results

Given a SyGuS problem $ \langle \phi, \mathcal{E} \rangle $ and grammars $ \mathcal{E}_1, \ldots, \mathcal{E}_n $ s.t. $ \mathcal{E}_i \subseteq \mathcal{E} $, create problems $ \langle \phi, \mathcal{E}_i \rangle $ and solve each in parallel.

  • A thin wrapper that generates subproblems
  • Agnostic to the underlying SyGuS learner

PLearn Reduces Overfitting:   Every subproblem has a lower potential for overfitting than the original problem.
(on any set of examples for the specification)

PLearn: Evaluation

LoopInvGen [↗] CVC4 [↗] Stoch SketchAC [↗] EUSolver [↗]

(✖) Original failure count
   (U-shaped trend)

(●) Failure count with PLearn
   (decreasing trend)

(Timeout = $ 30 $ mins per benchmark per tool per grammar)

$ \mathcal{E}_1 $ = Equalities $ \mathcal{E}_2 $ = Inequalities $ \mathcal{E}_3 $ = Octagons
$ \mathcal{E}_4 $ = Polyhedra $ \mathcal{E}_5 $ = Polynomials $ \mathcal{E}_6 $ = Peano

Limitations of PLearn

  • Extremely resource intensive
    runs multiple SyGuS instances in parallel

  • Many redundant computations
    functions common to different grammars are enumerated multiple times

Can we:

  • reuse expressions across different grammars
  • enumerate each expression at most once

in a single-threaded algorithm?

$ \mathcal{E}_1 $ $ \subseteq $ $ \mathcal{E}_2 $ $ \subseteq $ $ \mathcal{E}_3 $ $ \subseteq $ $ \mathcal{E}_4 $ $ \subseteq $ $ \mathcal{E}_5 $ $ \subseteq $ $ \mathcal{E}_6 $ Grammars Sizes $ 1 $ $ 2 $ $ 3 $ $ 4 $ • An ordering of all the points on this grid • Subexpressions should be enumerated before an expression

Expressions of size $ 4 $ that appear only in $ \mathcal{E}_6 $, but not in $ \mathcal{E}_1, \ldots, \mathcal{E}_5$

We call a relation $ \lhd $ on $ \mathcal{E}_{1,\ldots,n} \times \mathbb{N} $ a well order
if $ \forall \mathcal{E}_1, \mathcal{E}_2, k_1, k_2 : [\mathcal{E}_1 \subseteq \mathcal{E}_2 \wedge k_1 < k_2] \Rightarrow (\mathcal{E}_1,k_1) \lhd (\mathcal{E}_2,k_2) $.

Hybrid Enumeration (HE)

An efficient implementation of this 2-D search for component-based grammars [Jha et al, ICSE'10]

Arguments to HE:

  • A SyGuS Problem: a specification $ \phi $, a grammar $ \mathcal{E} $
  • Component-based grammars: $ \mathcal{E}_1 \subset \cdots \subset \mathcal{E}_n \subseteq \mathcal{E} $
  • A well-ordering relation: $ \lhd $
  • A size bound: $ q $

Completeness:   HE can enumerate every expression in $ \mathcal{E}_n $ up to any size bound $ q $.

Efficiency:   HE enumerates each syntactically
distinct expression in $ \mathcal{E}_n $ at most once.

HE: Performance

Inequalities Octagons Polyhedra Polynomials Peano [L] [H] [P]
Grammar $ {\small \textbf{median}}\Big[\frac{\tau(\textbf{P})}{\tau(\textbf{H})}\Big] $ $ {\small \textbf{median}}\Big[\frac{\tau(\textbf{H})}{\tau(\textbf{L})}\Big] $
Equalities $ 1.00 $ $ 1.00 $
Inequalities $ 1.91 $ $ 1.04 $
Octagons $ 2.84 $ $ 1.03 $
Polyhedra $ 3.72 $ $ 1.01 $
Polynomials $ 4.62 $ $ 1.00 $
Peano $ 5.49 $ $ 0.97 $

(On all $ 180 $ benchmarks with $ 30 $ mins timeout per benchmark per tool per grammar)

  • Significantly more resilient against increasing expressiveness
  • Negligible impact on total time $ \tau = $ wall-clock time $ \times $ # threads
  • When combined with HE, the winning solver from the Inv track of SyGuS-Comp'18 is $ 5\times $ faster and solves $2$ more benchmarks

► Bias-variance tradeoffs in program analysis [Sharma et al, POPL'14]

  • Observe overfitting empirically
    • We formally define it
  • Propose cross validation for hyperparameter tuning
    • We incrementally increase expressiveness

► A theory of formal synthesis [Jha and Seshia, Acta Informatica'17]

  • Synthesis with different kinds of counterexamples
  • And under various resource (memory) constraints
    • But do not consider variations in grammar

Conclusion

  • No free lunch in SyGuS — a fundamental tradeoff between expressiveness vs counterexamples required for learning
  • This is due to overfitting — the potential for overfitting increases with expressiveness
  • PLearn is a black-box technique to combat overfitting by exploring multiple grammars in parallel
  • Hybrid enumeration (HE) emulates the behavior of PLearn, but with negligible performance overhead