## Overfitting in Synthesis

Saswat Padhi$\,^1$

Todd Millstein$^1$    Aditya Nori$\,^{2\,\texttt{UK}}$     Rahul Sharma$^{2\,\texttt{IN}}$

$^1$ University of California, Los Angeles, USA

$^2$ Microsoft Research   $^\texttt{UK}\,$Cambridge, UK   $^\texttt{IN}\,$Bengaluru, India

### Synthesis

How does the choice of the grammar
affect the performance of synthesis tools?

Theoretical results & experiments
that demonstrate overfitting in CEGIS

Practical mitigation strategies
inspired by ML techniques

### State of The Art

• Grammars : $6$ commonly used arithmetic grammars
• Benchmarks : $180$ invariant-synthesis tasks over integers
• Tools : $5$ participants from SyGuS-Comp'18

(Timeout = $30$ mins per benchmark per tool per grammar)

With more expressive power, every tool fails on many benchmarks it could previously solve!

simply due to larger search spaces? …

### Overfitting

ML notion:   A function does not correctly generalize beyond the training data

On increasing expressiveness: Increase No Change
Increase in # Rounds $\ \Rightarrow\$ ________ in Synth. Time $79 \,\%$ $6 \,\%$
Increase in Synth. Time $\ \Rightarrow\$ ________ in # Rounds $27 \,\%$ $67 \,\%$

(For LoopInvGen on all $180$ benchmarks and all $6$ grammars with $30$ mins timeout per benchmark per grammar)

Synthesizers not only spend more time

• searching for a function within a large space,
• but also collecting more examples from the verifier

### Contributions

Theoretical Insights

• Formal notions of learnability and overfitting for SyGuS
• No free lunch — overfitting is inevitable at high expressiveness

Practical Solutions

• PLearn, a black-box technique inspired by ensemble learning — explores multiple grammars in parallel
• Hybrid enumeration (HE) — emulates PLearn by interleaving exploration of multiple grammars in a single thread
• When combined with HE, the winning solver from the Inv track of SyGuS-Comp'18 is $5\times$ faster and solves $2$ more benchmarks

### $\boldsymbol{m}$-Learnability

(Learning from $m$ observations / examples)

#### Machine Learning

• Learned functions only need to be approximately correct
• Typically require learning from any set of $m$ i.i.d. samples

#### CEGIS-Based SyGuS

• Learned functions must match the specification exactly
• Too strong of a requirement for the CEGIS setting

A specification $\phi$ is $m$-learnable by a learner $\mathcal{L}$
if there exist a set of $m$ examples for $\phi$ with which
$\mathcal{L}$ can learn a correct function for $\phi$.

(a significantly weaker notion of learnability)

### No Free Lunch

Explicit tradeoff between grammar expressiveness and the minimum number of rounds required

Let $X$ and $Y$ be arbitrary domains, $m \in \mathbb{Z}$ be s.t. $0 \leq m < |X|$, and $\mathcal{E}$ be an arbitrary grammar. Then, either:

• $\mathcal{E}$ contains at most $\texttt{bound}(m)$ number of distinct $X \to Y$ functions,
• or, for every learner $\mathcal{L}$, there exists a specification $\phi$ that admits a solution in $\mathcal{E}$, but is not $m$-learnable by $\mathcal{L}$.

More details in our paper — finite and infinite $X$ and $Y$, the precise $\texttt{bound}$, etc.

### No Free Lunch: Examples

Two extreme cases:

► A singleton grammar $\mathcal{E}$:

• Any specification that admits a solution in $\mathcal{E}$
is $0$-learnable by any learner
• Only one $X \to Y$ function is expressible in $\mathcal{E}$

► A fully expressive grammar $\mathcal{E}$:

• Every $X \to Y$ function is expressible in $\mathcal{E}$
• For every learner, there exists a specification
that requires all $\lvert X \rvert$ examples to be learnable

### Overfitting

(Why some specifications require more examples to be learnable)

ML notion: When a learned function does not correctly generalize beyond the training data

SyGuS notion: When a learned function is consistent with the observed examples, but does not satisfy the given specification

Potential for Overfitting = Number of such functions in the grammar

The potential for overfitting increases
with grammar expressiveness

More details in our paper — precise bounds on number of examples and expressiveness

### Contributions

Theoretical Insights

• Formal notions of learnability and overfitting for SyGuS
• No free lunch — overfitting is inevitable at high expressiveness

Practical Solutions

• PLearn, a black-box technique inspired by ensemble learning — explores multiple grammars in parallel
• Hybrid enumeration (HE) — emulates PLearn by interleaving exploration of multiple grammars in a single thread
• When combined with HE, the winning solver from the Inv track of SyGuS-Comp'18 is $5\times$ faster and solves $2$ more benchmarks

### PLearn

A technique inspired by ensemble methods [Dietterich, MCS’00] — run several learners and aggregate their results

Given a SyGuS problem $\langle \phi, \mathcal{E} \rangle$ and grammars $\mathcal{E}_1, \ldots, \mathcal{E}_n$ s.t. $\mathcal{E}_i \subseteq \mathcal{E}$, create problems $\langle \phi, \mathcal{E}_i \rangle$ and solve each in parallel.

• A thin wrapper that generates subproblems
• Agnostic to the underlying SyGuS learner

PLearn Reduces Overfitting:   Every subproblem has a lower potential for overfitting than the original problem.
(on any set of examples for the specification)

### PLearn: Evaluation

(Timeout = $30$ mins per benchmark per tool per grammar)

 $\mathcal{E}_1$ = Equalities $\mathcal{E}_2$ = Inequalities $\mathcal{E}_3$ = Octagons $\mathcal{E}_4$ = Polyhedra $\mathcal{E}_5$ = Polynomials $\mathcal{E}_6$ = Peano

### Limitations of PLearn

• Extremely resource intensive
runs multiple SyGuS instances in parallel

• Many redundant computations
functions common to different grammars are enumerated multiple times

Can we:

• reuse expressions across different grammars
• enumerate each expression at most once

We call a relation $\lhd$ on $\mathcal{E}_{1,\ldots,n} \times \mathbb{N}$ a well order
if $\forall \mathcal{E}_1, \mathcal{E}_2, k_1, k_2 : [\mathcal{E}_1 \subseteq \mathcal{E}_2 \wedge k_1 < k_2] \Rightarrow (\mathcal{E}_1,k_1) \lhd (\mathcal{E}_2,k_2)$.

### Hybrid Enumeration (HE)

An efficient implementation of this 2-D search for component-based grammars [Jha et al, ICSE'10]

Arguments to HE:

• A SyGuS Problem: a specification $\phi$, a grammar $\mathcal{E}$
• Component-based grammars: $\mathcal{E}_1 \subset \cdots \subset \mathcal{E}_n \subseteq \mathcal{E}$
• A well-ordering relation: $\lhd$
• A size bound: $q$

Completeness:   HE can enumerate every expression in $\mathcal{E}_n$ up to any size bound $q$.

Efficiency:   HE enumerates each syntactically
distinct expression in $\mathcal{E}_n$ at most once.

### HE: Performance

Grammar ${\small \textbf{median}}\Big[\frac{\tau(\textbf{P})}{\tau(\textbf{H})}\Big]$ ${\small \textbf{median}}\Big[\frac{\tau(\textbf{H})}{\tau(\textbf{L})}\Big]$
Equalities $1.00$ $1.00$
Inequalities $1.91$ $1.04$
Octagons $2.84$ $1.03$
Polyhedra $3.72$ $1.01$
Polynomials $4.62$ $1.00$
Peano $5.49$ $0.97$

(On all $180$ benchmarks with $30$ mins timeout per benchmark per tool per grammar)

• Significantly more resilient against increasing expressiveness
• Negligible impact on total time $\tau =$ wall-clock time $\times$ # threads
• When combined with HE, the winning solver from the Inv track of SyGuS-Comp'18 is $5\times$ faster and solves $2$ more benchmarks

► Bias-variance tradeoffs in program analysis [Sharma et al, POPL'14]

• Observe overfitting empirically
• We formally define it
• Propose cross validation for hyperparameter tuning
• We incrementally increase expressiveness

► A theory of formal synthesis [Jha and Seshia, Acta Informatica'17]

• Synthesis with different kinds of counterexamples
• And under various resource (memory) constraints
• But do not consider variations in grammar

### Conclusion

• No free lunch in SyGuS — a fundamental tradeoff between expressiveness vs counterexamples required for learning
• This is due to overfitting — the potential for overfitting increases with expressiveness
• PLearn is a black-box technique to combat overfitting by exploring multiple grammars in parallel
• Hybrid enumeration (HE) emulates the behavior of PLearn, but with negligible performance overhead