Fuzzing with Constraints¶
In previous chapters, we have seen how Grammar-Based Fuzzing allows us to efficiently generate myriads of syntactically valid inputs. However, there are semantic input features that cannot be expressed in a context-free grammar, such as
- "$X$ is the length of $Y$";
- "$X$ is an identifier previously declared"; or
- "$X$ should be longer than 4,096 bytes".
In this chapter, we show how the ISLa framework allows us to express such features as constraints added to a grammar. By having ISLa solve these constraints automatically, we produce inputs that are not only syntactically valid, but actually semantically valid. Furthermore, such constraints allow us to very precisely shape the inputs we want for testing.
from bookutils import YouTubeVideo
YouTubeVideo("dgaGuwn-1OU")
Prerequisites
- You should have read the chapter on grammars.
- The chapter on generators and filters addresses a similar problem, but with program code instead of constraints.
import bookutils.setup
Synopsis¶
To use the code provided in this chapter, write
>>> from fuzzingbook.FuzzingWithConstraints import <identifier>
and then make use of the following features.
This chapter introduces the ISLa framework, consisting of
- the ISLa specification language, allowing to add constraints to a grammar
- the ISLa solver, solving these constraints to produce semantically (and syntactically) valid inputs
- the ISLa checker, checking given inputs for whether they satisfy these constraints.
A typical usage of the ISLa solver is as follows. First, install ISLa, using
$ pip install isla-solver
Then, you can import the solver as
>>> from isla.solver import ISLaSolver
The ISLa solver needs two things. First, a grammar - say, US phone numbers.
>>> from Grammars import US_PHONE_GRAMMAR
Second, you need constraints – a string expressing a condition over one or more grammar elements. Common functions include
str.len()
, returning the length of a stringstr.to.int()
, converting a string to an integer
Here, we instantiate the ISLa solver with a constraint stating that the area code should be above 900:
>>> solver = ISLaSolver(US_PHONE_GRAMMAR,
>>> """
>>> str.to.int(<area>) > 900
>>> """)
With that, invoking solver.solve()
returns a solution for the constraints.
>>> str(solver.solve())
'(902)805-6934'
solve()
returns a derivation tree, which typically is converted into a string using str()
as above. The print()
function does this implicitly.
Subsequent calls of solve()
return more solutions:
>>> for _ in range(10):
>>> print(solver.solve())
(902)671-8520
(902)308-8044
(902)737-2584
(902)500-2834
(902)429-5794
(902)292-0499
(902)977-9111
(902)209-4775
(902)565-2710
(909)223-7794
We see that the solver produces a number of inputs that all satisfy the constraint - the area code is always more than 900.
The ISLaSolver()
constructor provides several additional parameters to configure the solver, as documented below.
Additional ISLaSolver
methods allow checking inputs against constraints, and provide additional functionality.
The ISLa functionality is also available on the command line:
>>> !isla --help
usage: isla [-h] [-v]
{solve,fuzz,check,find,parse,repair,mutate,create,config} ...
The ISLa command line interface.
options:
-h, --help show this help message and exit
-v, --version Print the ISLa version number
Commands:
{solve,fuzz,check,find,parse,repair,mutate,create,config}
solve create solutions to ISLa constraints or check their
unsatisfiability
fuzz pass solutions to an ISLa constraint to a test subject
check check whether an input satisfies an ISLa constraint
find filter files satisfying syntactic & semantic
constraints
parse parse an input into a derivation tree if it satisfies
an ISLa constraint
repair try to repair an existing input such that it satisfies
an ISLa constraint
mutate mutate an input such that the result satisfies an ISLa
constraint
create create grammar and constraint stubs
config dumps the default configuration file
Semantic Input Properties¶
In this book, we have frequently used grammars to systematically generate inputs that cover input structure and more. But while it is relatively easy to express the syntax of an input using a grammar, there are input properties that cannot be expressed using a grammar. Such input properties are called semantic properties.
Let us illustrate semantic properties using a simple example. We want to test some system that is configured by two settings, a page size and a buffer size. Both these come as integer numbers as part of a human-readable configuration file. The syntax of this file is given by the following grammar:
from Grammars import Grammar, is_valid_grammar, syntax_diagram, crange
import string
CONFIG_GRAMMAR: Grammar = {
"<start>": ["<config>"],
"<config>": [
"pagesize=<pagesize>\n"
"bufsize=<bufsize>"
],
"<pagesize>": ["<int>"],
"<bufsize>": ["<int>"],
"<int>": ["<leaddigit><digits>"],
"<digits>": ["", "<digit><digits>"],
"<digit>": list("0123456789"),
"<leaddigit>": list("123456789")
}
assert is_valid_grammar(CONFIG_GRAMMAR)
Here's a visualization of this grammar as a railroad diagram, showing its structure:
Using this grammar, we can now use any of our grammar-based fuzzers to generate valid inputs. For instance:
from GrammarFuzzer import GrammarFuzzer, DerivationTree
fuzzer = GrammarFuzzer(CONFIG_GRAMMAR)
for i in range(10):
print(i)
print(fuzzer.fuzz())
So far, so good - and indeed, these random values will help us test our (hypothetical) system. But what if we want to control these values further, putting our system to the test?
A grammar gives us some control. If we want to ensure a page size of at least 100,000, for instance, a rule like
"<bufsize>": ["<leaddigit><digit><digit><digit><digit><digit>"]
would do the job. We could also express that the page size should be an odd number, by having it end in an odd digit. But if we want to state that the page size should be, say, a multiple of 8, or larger or less than the buffer size, we are out of luck.
In the chapter on fuzzing with generators, we have seen how to attach program code to individual rules - program code that would either generate individual elements right away or filter only these that satisfy specific conditions.
Attaching code makes things very flexible, but also has several disadvantages:
- First, it is pretty hard to generate inputs that satisfy multiple constraints at once. In essence, you have to code your own strategy for generating inputs, which at some point negates the advantage of having an abstract representation such as a grammar.
- Second, your code is not portable. While a grammar can be easily adapted to any grammar-based fuzzer, adding, say, Python code, ties you to the Python environment forever.
- Third, program code can only be used for producing inputs or checking inputs, but not both. This, again, is a downside compared to a pure grammar representation.
Hence, we are looking to a more general way to express semantic properties - and also a more declarative way to express semantic properties.
Unrestricted Grammars
One very general solution to this problem would be to use unrestricted grammars rather than the context-free grammars we have used so far. In an unrestricted grammar, one can have multiple symbols also on the left-hand side of an expansion rule, making them very flexible. In fact, unrestricted grammars are Turing-universal, meaning that they can express any feature that could also be expressed in program code; and they could thus check and produce arbitrary strings with arbitrary features. (If they finish, that is – unrestricted grammars also suffer from the halting problem.) The downside is that there is literally no programming support for unrestricted grammars – we'd have to implement all arithmetic, strings, and other functionality from scratch in a grammar, which is - well - not fun.
Specifying Constraints¶
In recent work, Dominic Steinhöfel and Andreas Zeller (one of the authors of this book) have presented an infrastructure that allows producing inputs with arbitrary properties, but without having to go through the trouble of implementing producers or checkers. Instead, they suggest a dedicated language for specifying inputs, named ISLa (for input specification language). ISLa combines a standard context-free grammar with constraints that express semantic properties of the inputs and their elements. ISLa can be used as a fuzzer (producing inputs that satisfy the constraints) as well as a checker (checking inputs whether they satisfy the given constraints).
Let us illustrate ISLa by example. ISLa comes as a Python package named isla-solver
that can be easily installed using pip
:
$ pip install isla-solver
This also installs all dependent packages.
The core of ISLa is the ISLa Solver – the component that actually solves constraints to produce satisfying inputs.
import isla
from isla.solver import ISLaSolver
The constructor of an ISLaSolver
takes two mandatory arguments.
- The grammar is the grammar the solver should produce inputs from.
- The constraint is the constraint the produced inputs should satisfy.
To express a constraint, we have a variety of functions and predicates at our disposition.
These can be applied to individual elements of the grammar, notably their nonterminals.
The function str.len()
, for instance, returns the length of a string.
If we want to have inputs in which the page size has at least 6 digits, we can write:
solver = ISLaSolver(CONFIG_GRAMMAR, 'str.len(<pagesize>) >= 6')
The method solve()
returns the next produced string from the ISLa solver, as a derivation tree (seen in the Chapter on fuzzing with grammars).
To convert these into a string, we can use the str()
converter:
str(solver.solve())
The print()
method converts its arguments to strings implicitly.
To get, say, the next 10 solutions, we can thus write
for _ in range(10):
print(solver.solve())
... and we see that, indeed, each page size has exactly six digits.
Derivation Trees
If you inspect a derivation tree as returned from solve()
directly, you will get quite a structure:
solution = solver.solve()
solution
We can easily visualize the tree, revealing its structure:
display_tree(solution)
By converting the derivation tree into a string, we get the represented string:
str(solution)
print()
does this implicitly, so print
ing the solution gives us the string:
print(solution)
Unless you want to inspect the derivation tree or access its elements, converting it into a string makes it more manageable.
To express a minimum numeric value, we can use a more elegant way.
The function str.to.int()
, for instance, converts a string into an integer.
To obtain a page size that of at least 100000, we can thus also write
solver = ISLaSolver(CONFIG_GRAMMAR,
'str.to.int(<pagesize>) >= 100000')
print(solver.solve())
If we want the page size to be in the range of 100 to 200, we can state this as a logical conjunction (using and
)
solver = ISLaSolver(CONFIG_GRAMMAR,
'''
str.to.int(<pagesize>) >= 100 and
str.to.int(<pagesize>) <= 200
''')
print(solver.solve())
And if we want the page size to be a multiple of seven, we can write
solver = ISLaSolver(CONFIG_GRAMMAR,
'''
str.to.int(<pagesize>) mod 7 = 0
''')
print(solver.solve())
from bookutils import quiz
Indeed, ISLa constraints can also involve multiple elements. Expressing equality between two elements is easy, and uses a single equal sign. (There's no assignment in ISLa the =
could be confused with.)
solver = ISLaSolver(CONFIG_GRAMMAR,
'''
<pagesize> = <bufsize>
''')
print(solver.solve())
We can also use numerical constraints, stating that the buffer size should always be exactly one more than the page size:
solver = ISLaSolver(CONFIG_GRAMMAR,
'''
str.to.int(<pagesize>) > 1024 and
str.to.int(<bufsize>) = str.to.int(<pagesize>) + 1
''')
print(solver.solve())
All the above functions (like str.to.int()
) actually stem from the SMT-LIB library for satisfiability modulo theories (SMT), a standard for expressing constraints for constraint solvers (like ISLa).
The list of all theories defined in SMT-LIB lists dozens of functions and predicates that can be used in ISLa constraints.
Using SMT-LIB Syntax
Instead of the above "infix" syntax which is familiar to programmers, ISLa also supports full SMT-LIB syntax. Instead of writing $f(x_1, x_2, x_3, \dots)$ for a function $f$ and its arguments $x_1 \dots x_n$, SMT-LIB uses a "prefix" LISP-like syntax in which all functions and operators are written as $(f x_1 x_2 x_3 \dots)$. The above predicate would thus be written as
solver = ISLaSolver(CONFIG_GRAMMAR,
'''
(and
(> (str.to.int <pagesize>) 1024)
(= (str.to.int <bufsize>)
(+ (str.to.int <pagesize>) 1)))
''')
print(solver.solve())
Note that for boolean operators such as and
, we still use the ISLa infix syntax; having ISLa handle these operators is more efficient than passing them on to the constraint solver.
ISLa on the Command Line¶
When you install isla-solver
, you also get an isla
command-line tool.
This allows you to create inputs from the command line or shell scripts.
Let us first create a grammar file suitable for isla
. isla
accepts grammars in Fuzzingbook format; they need to define a variable named grammar
.
with open('grammar.py', 'w') as f:
f.write('grammar = ')
f.write(str(CONFIG_GRAMMAR))
from bookutils import print_file
print_file('grammar.py')
With this, we can use isla
as a grammar fuzzer, plain and simple
By default, isla solve
produces one single matching output:
!isla solve grammar.py
The true power of isla
, however, comes to be as we (again) add constraints to be solved - either in separate constraint files or (easier) directly on the command line:
!isla solve grammar.py --constraint '<pagesize> = <bufsize>'
!isla solve grammar.py \
--constraint '<pagesize> = <bufsize> and str.to.int(<pagesize>) > 10000'
The isla
executable provides several options and commands, and is a great alternative on the command line.
!isla --help
Accessing Elements¶
So far, we have accessed nonterminals simply by referencing their name, as in <bufsize>
or <pagesize>
.
However, in some cases, this simple method is not sufficient.
In our configuration grammar, for instance, we may want to access (or constrain) <int>
elements.
However, we do not want to constrain all integers at once, but only those in a particular context – say, those that occur as a part of a <pagesize>
element, or only those that occur as part of a <config>
element.
To this end, ISLa allows referencing parts of a given element, using two special operators.
The expression <a>.<b>
refers to the immediate subpart <b>
of some element <a>
.
That is, <b>
has to appear in one of the expansion rules of <a>
.
For instance, <pagesize>.<int>
refers to the <int>
element of a page size.
Here is an example using the dot operator:
solver = ISLaSolver(CONFIG_GRAMMAR,
'''
<pagesize>.<int> = <bufsize>.<int>
''')
print(solver.solve())
The expression <a>..<b>
, however, refers to any subpart <b>
of some element <a>
. That is, <b>
can appear in the expansion of <a>
, but also in the expansion of any subelement (and any subelement thereof).
Here is an example using the double dot operator, enforcing every digit in a <config>
element to be 7
:
from ExpectError import ExpectError
solver = ISLaSolver(CONFIG_GRAMMAR,
'''
<config>..<digit> = "7" and <config>..<leaddigit> = "7"
''')
print(solver.solve())
To reason about dots and double dots, it helps to visualize the string in question as a derivation tree discussed in the chapter on grammar-based fuzzing. The derivation tree of the input
pagesize=12
bufsize=34
for instance, looks like this:
In this tree, the .
syntax refers to immediate children. <bufsize>.<int>
is the one <int>
node that is the immediate descendant of <bufsize>
(but not any other <int>
node).
In contrast, <config>..<digit>
refers to all <digit>
descendants of the <config>
node.
If an element has multiple immediate children of the same type, one can use the special <a>[$n$]
syntax to access the $n$-th child of type <a>
. To access the first child, $n$ is equal to one, not zero, as in the XPath abbreviated syntax. In our configuration grammar, there is no expansion including the same nonterminal more than once, so we do not need this feature.
For a demonstration of indexed dots, consider the following grammar, which produces lines of three "a" or "b" characters:
LINES_OF_THREE_AS_OR_BS_GRAMMAR: Grammar = {
'<start>': ['<A>'],
'<A>': ['<B><B><B>', '<B><B><B>\n<A>'],
'<B>': ['a', 'b']
}
fuzzer = GrammarFuzzer(LINES_OF_THREE_AS_OR_BS_GRAMMAR)
for i in range(5):
print(i)
print(fuzzer.fuzz())
We can force, say, the second character in a line to always be a "b:"
solver = ISLaSolver(LINES_OF_THREE_AS_OR_BS_GRAMMAR,
'''
<A>.<B>[2] = "b"
''')
for i in range(5):
print(i)
print(solver.solve())
Quantifiers¶
By default, all nonterminals in ISLa constraints are universally quantified - that is, any constraint applying to, say, some <int>
element applies to all <int>
elements in the resulting string.
If you only want to constrain one element, though, you have to (and can) specify this in ISLa, using an existential quantifier.
To use an existential quantifier in ISLa, use the construct
exists TYPE VARIABLE in CONTEXT:
(CONSTRAINT)
where VARIABLE
is some identifier, TYPE
is its type (as a nonterminal), and CONTEXT
is the context (again a nonterminal) in which the constraint should hold.
CONSTRAINT
is again a constraint expression, in which you now can make use of VARIABLE
as the element whose existence you assume.
Let us illustrate existential quantification again using a simple example. We want to make sure that at least one integer in our generated string has a value of more than 1000. So we write
solver = ISLaSolver(CONFIG_GRAMMAR,
'''
exists <int> i in <start>:
str.to.int(i) > 1000
''')
for _ in range(10):
print(solver.solve())
We note that all generated inputs satisfy the constraint of having at least one integer that satisfies the constraint.
Specifying a variable name is optional; if you omit it, you can use the quantified nonterminal instead. The above constraint can thus also be expressed in a more compact fashion:
solver = ISLaSolver(CONFIG_GRAMMAR,
'''
exists <int> in <start>:
str.to.int(<int>) > 1000
''')
print(solver.solve())
Besides existential quantification, there also is universal quantification, using the forall
keyword instead of exists
.
If we want all elements in some context to satisfy a constraint, this comes in handy.
solver = ISLaSolver(CONFIG_GRAMMAR,
'''
forall <int> in <start>:
str.to.int(<int>) > 1000
''')
for _ in range(10):
print(solver.solve())
We see that all <int>
elements satisfy the constraint.
By default, all nonterminals that are re-used directly in constraints are universally quantified within the <start>
symbol, so the above can actually be simplified to
solver = ISLaSolver(CONFIG_GRAMMAR,
'''
str.to.int(<int>) > 1000
''')
for _ in range(10):
print(solver.solve())
... and you realize that in all our initial constraints, we always had an implicit universal quantification.
Picking Expansions¶
Sometimes, we'd like a quantifier to apply only for a specific expansion alternative of a nonterminal. The form
forall TYPE VARIABLE=PATTERN in CONTEXT:
(CONSTRAINT)
means that the CONSTRAINT only applies to a VARIABLE that actually matches the expansion given in PATTERN.
(Again, we can replace forall
with exists
, and make this an existential quantification rather than a universal quantification.)
Here's an example of using forall
:
solver = ISLaSolver(CONFIG_GRAMMAR,
'''
forall <int>="<leaddigit><digits>" in <start>:
(<leaddigit> = "7")
''')
This ensures that when <int>
is expanded to a lead digit followed by more digits, the lead digit becomes 7
.
The effect is that all <int>
values now start with a 7
digit:
str(solver.solve())
Likewise, we can constrain <int>
as a whole, and thus ensure that all numbers are greater than 100:
solver = ISLaSolver(CONFIG_GRAMMAR,
'''
forall <int> in <start>:
(str.to.int(<int>) > 100)
''')
str(solver.solve())
By default, all variables are universally quantified in <start>
, so the above can also be expressed as
solver = ISLaSolver(CONFIG_GRAMMAR,
'''
str.to.int(<int>) > 100
''')
str(solver.solve())
Matching Expansion Elements¶
In a quantification pattern, we can also name individual nonterminal elements and use them in our constraints.
This is done by replacing the nonterminal <ID>
with the special form {<ID> VARIABLE}
(in curly braces) which then makes the variable VARIABLE
a placeholder for the value matched by ID
; VARIABLE
can then be used in constraints.
Here is an example. In the expansion <leaddigit><int>
, we want to ensure that the <leaddigit>
is always 9
.
Using the special brace form, we make lead
a variable holding the value of <leaddigit>
, and can then use it in a constraint:
solver = ISLaSolver(CONFIG_GRAMMAR,
'''
forall <int> i="{<leaddigit> lead}<digits>" in <start>:
(lead = "9")
''')
This (again) ensures that all lead digits should be 9
:
for _ in range(10):
print(solver.solve())
Could we express the above in a simpler fashion? Yes! For one, we can refer to <leaddigit>
directly rather than introducing variables like i
and lead
:
solver = ISLaSolver(CONFIG_GRAMMAR,
'''
forall <int>="<leaddigit><digits>" in <start>:
(<leaddigit> = "9")
''')
print(solver.solve())
Furthermore, using implicit universal quantification and the dot notation introduced earlier, we could write, for instance
solver = ISLaSolver(CONFIG_GRAMMAR,
'''
<int>.<leaddigit> = "9"
''')
print(solver.solve())
or just
solver = ISLaSolver(CONFIG_GRAMMAR,
'''
<leaddigit> = "9"
''')
print(solver.solve())
and obtain the same result (not necessarily the exact same values, though, due to randomness):
But while universal quantification and dot notation are sufficient for many cases, the pattern matching notation is more general and more flexible – even if it may be harder to read.
Checking Strings¶
Using an ISLaSolver
, we can also check if an string satisfies the constraints.
This can be applied to inputs, but also to outputs; ISLa constraints can thus server as oracles – that is, predicates that check a test result.
Let us check if in a given string, <pagesize>
and <bufsize>
are the same.
constraint = '<pagesize> = <bufsize>'
solver = ISLaSolver(CONFIG_GRAMMAR, constraint)
To check the tree, we can pass it into the evaluate()
method of the solver
– and find that the given input does not satisfy our constraint.
solver.check('pagesize=12\nbufsize=34')
If we repeat the above, however, with an input that satisfies the constraint, we obtain a True
result.
solver.check('pagesize=27\nbufsize=27')
Checking constraints is much more efficient than solving them, as ISLa does not have to search for possible solutions.
Case Studies¶
Let us further illustrate ISLa using a few case studies.
Matching Identifiers in XML¶
The Extensible Markup Language (XML) is a typical example of an input language that cannot be fully expressed using a context-free grammar. The problem is not so much expressing the syntax of XML – the basics are fairly easy:
XML_GRAMMAR: Grammar = {
"<start>": ["<xml-tree>"],
"<xml-tree>": ["<open-tag><xml-content><close-tag>"],
"<open-tag>": ["<<id>>"],
"<close-tag>": ["</<id>>"],
"<xml-content>": ["Text", "<xml-tree>"],
"<id>": ["<letter>", "<id><letter>"],
"<letter>": crange('a', 'z')
}
assert is_valid_grammar(XML_GRAMMAR)
The problem becomes evident when we produce inputs from the grammar: The <id>
elements in <open-tag>
and <close-tag>
do not match.
fuzzer = GrammarFuzzer(XML_GRAMMAR)
fuzzer.fuzz()
If we want the tag IDs to match, we need to come up with a finite set of tags (as in, say, HTML); then we can extend the grammar with one rule for each tag - <body>...</body>
, <p>...</p>
, <strong>...</strong>
, and so on.
For an infinite set of tags, though, as in our grammar, expressing that the two tag IDs must match is not possible in a context-free grammar.
With ISLa, however, constraining the grammar is easy.
All we need is the rule that constrains the <xml-tree>
:
solver = ISLaSolver(XML_GRAMMAR,
'''
<xml-tree>.<open-tag>.<id> = <xml-tree>.<close-tag>.<id>
''', max_number_smt_instantiations=1)
for _ in range(3):
print(solver.solve())
and we see that the <id>
tags now indeed match each other.
Solver Configuration Parameters
The configuration parameter max_number_smt_instantiations
we passed to the ISLaSolver
object above limits the number of calls to ISLa's underlying SMT solver. Generally, higher numbers lead to more inputs generated per time. Many of those will look structurally similar, though. If we aim for structurally diverse inputs and do not care about, e.g., the names of tags, it can make sense to choose a lower value for this parameter. This is what happens with max_number_smt_instantiations=10
, which is the current default:
solver = ISLaSolver(XML_GRAMMAR,
'''
<xml-tree>.<open-tag>.<id> = <xml-tree>.<close-tag>.<id>
''', max_number_smt_instantiations=10)
for _ in range(3):
print(solver.solve())
The parameter max_number_free_instantiations
serves a similar purpose: ISla randomly instantiates nonterminal symbols whose values are not restricted by a constraint. It chooses—surprise!—at most max_number_free_instantiations
such random instantiations.
Other configuration parameters of interest are structural_predicates
and semantic_predicates
, which let you extend the ISLa language by passing custom structural and semantic predicates to the solver. You can use all the predicates in these sets inside the ISLa constraint to solve. Per default, the semantic predicate count(in_tree, NEEDLE, NUM)
and the following structural predicates are available:
after(node_1, node_2)
before(node_1, node_2)
consecutive(node_1, node_2)
count(in_tree, NEEDLE, NUM)
different_position(node_1, node_2)
direct_child(node_1, node_2)
inside(node_1, node_2)
level(PRED, NONTERMINAL, node_1, node_2)
nth(N, node_1, node_2)
same_position(node_1, node_2)
In contrast to the "input generator" solution in the chapter on generators, our constraint-based solution is purely declarative - and can also be used to parse and check inputs. Plus, of course, we can easily add more constraints:
solver = ISLaSolver(XML_GRAMMAR,
'''
<xml-tree>.<open-tag>.<id> = <xml-tree>.<close-tag>.<id>
and
str.len(<id>) > 10
''', max_number_smt_instantiations=1)
for _ in range(3):
print(solver.solve())
Definitions and Usages in Programming Languages¶
When testing compilers with generated program code, one often encounters the problem that before using an identifier, one has to declare it first - specifying its type, some initial value, and more.
This problem is easily illustrated in the following grammar, which produces sequences of assignments. Variable names consist of a single lowercase letter; values can only be digits; assignments are separated by semicolons.
LANG_GRAMMAR: Grammar = {
"<start>":
["<stmt>"],
"<stmt>":
["<assgn>", "<assgn>; <stmt>"],
"<assgn>":
["<lhs> := <rhs>"],
"<lhs>":
["<var>"],
"<rhs>":
["<var>", "<digit>"],
"<var>": list(string.ascii_lowercase),
"<digit>": list(string.digits)
}
assert is_valid_grammar(LANG_GRAMMAR)
syntax_diagram(LANG_GRAMMAR)
Here are some assignment sequences produced by the grammar:
fuzzer = GrammarFuzzer(LANG_GRAMMAR)
for _ in range(10):
print(fuzzer.fuzz())
We see that the assignment syntax is similar to what we have in common programming languages. The semantics, however, are, well, questionable, as we commonly access variables whose values have not been previously defined. Again, this is a semantic property that cannot be expressed in a context-free grammar alone.
What we need here is a constraint specifying that on the right-hand side of an assignment, we can only have variable names that occur on the left-hand side. In ISLa, we achieve this through the following constraint:
solver = ISLaSolver(LANG_GRAMMAR,
'''
forall <rhs> in <assgn>:
exists <assgn> declaration:
<rhs>.<var> = declaration.<lhs>.<var>
''',
max_number_smt_instantiations=1,
max_number_free_instantiations=1)
for _ in range(10):
print(solver.solve())
This is much better already, but not perfect yet - we might still have assignments like a := a
or a := b; b := 5
.
That is because our constraints do not yet take care of ordering – in a <rhs>
element, we can only use variables that are defined earlier.
For this purpose, ISLa provides a before()
predicate: before(A, B)
expresses that the element A
must occur before the element B
.
With before()
, we can rewrite our constraint as
solver = ISLaSolver(LANG_GRAMMAR,
'''
forall <rhs> in <assgn>:
exists <assgn> declaration:
(before(declaration, <assgn>) and
<rhs>.<var> = declaration.<lhs>.<var>)
''',
max_number_free_instantiations=1,
max_number_smt_instantiations=1)
... and thus ensure that on the right-hand-side of assignments, we only use identifiers defined earlier.
for _ in range(10):
print(solver.solve())
In case you find that the assignment sequences are too short, you can use the ISLa count()
predicate.
count(VARIABLE, NONTERMINAL, N)
ensures that the number of NONTERMINALs in VARIABLE is exactly N.
To have statements with exactly 5 assignments, write
solver = ISLaSolver(LANG_GRAMMAR,
'''
forall <rhs> in <assgn>:
exists <assgn> declaration:
(before(declaration, <assgn>) and
<rhs>.<var> = declaration.<lhs>.<var>)
and
count(start, "<assgn>", "5")
''',
max_number_smt_instantiations=1,
max_number_free_instantiations=1)
for _ in range(10):
print(solver.solve())
Lessons Learned¶
- Using ISLa, we can add and solve constraints to grammars, allowing to express semantic properties of our test inputs
- Declaring constraints (and have a solver solve them) is much more versatile than adding generator code, and language-independent, too
- Using ISLa is fun :-)
Next Steps¶
In the next chapters, we will continue to focus on semantics. Among others, we will learn how to
- mine grammars from existing inputs
- use symbolic fuzzing - that is, using constraint solvers to reach particular locations
- use concolic fuzzing - that is, combining symbolic fuzzing with concrete runs for higher efficiency
Background¶
- ISLa is presented in the paper "Input Invariants" at ESEC/FSE 2022.
- The ISLa project contains the full source code and a complete reference.
Exercises¶
Exercise 1: String Encodings¶
A common way of representing strings is length-prefixed strings, a representation made popular by the PASCAL programming language.
A length-prefixed string starts with a few bytes that encode the length $L$ of the string, followed by the $L$ actual characters.
For instance, assuming that two bytes are used to encode the length, the string "Hello"
could be represented as the sequence
0x00 0x05 'H' 'e' 'l' 'l' 'o'
Part 1: Syntax¶
Write a grammar that defines the syntax of length-prefixed strings.
Part 2: Semantics¶
Use ISLa to produce valid length-prefixed strings. Make use of the SMT-LIB string library to find appropriate conversion functions.
The content of this project is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. The source code that is part of the content, as well as the source code used to format and display that content is licensed under the MIT License. Last change: 2024-11-09 17:07:29+01:00 • Cite • Imprint