In the chapter on "Mutation-Based Fuzzing", we have seen how to use extra hints – such as sample input files – to speed up test generation. In this chapter, we take this idea one step further, by providing a specification of the legal inputs to a program. Specifying inputs via a grammar allows for very systematic and efficient test generation, in particular for complex input formats. Grammars also serve as the base for configuration fuzzing, API fuzzing, GUI fuzzing, and many more.
All possible behaviors of a program can be triggered by its input. "Input" here can be a wide range of possible sources: We are talking about data that is read from files, from the environment, or over the network, data input by the user, or data acquired from interaction with other resources. The set of all these inputs determines how the program will behave – including its failures. When testing, it is thus very helpful to think about possible input sources, how to get them under control, and how to systematically test them.
For the sake of simplicity, we will assume for now that the program has only one source of inputs; this is the same assumption we have been using in the previous chapters, too. The set of valid inputs to a program is called a language. Languages range from the simple to the complex: the CSV language denotes the set of valid comma-separated inputs, whereas the Python language denotes the set of valid Python programs. We commonly separate data languages and programming languages, although any program can also be treated as input data (say, to a compiler). The Wikipedia page on file formats lists more than 1,000 different file formats, each of which is its own language.
To formally describe languages, the field of formal languages has devised a number of language specifications that describe a language. Regular expressions represent the simplest class of these languages to denote sets of strings: The regular expression [a-z]*
, for instance, denotes a (possibly empty) sequence of lowercase letters. Automata theory connects these languages to automata that accept these inputs; finite state machines, for instance, can be used to specify the language of regular expressions.
Regular expressions are great for not-too-complex input formats, and the associated finite state machines have many properties that make them great for reasoning. To specify more complex inputs, though, they quickly encounter limitations. At the other end of the language spectrum, we have universal grammars that denote the language accepted by Turing machines. A Turing machine can compute anything that can be computed; and with Python being Turing-complete, this means that we can also use a Python program $p$ to specify or even enumerate legal inputs. But then, computer science theory also tells us that each such testing program has to be written specifically for the program to be tested, which is not the level of automation we want.
The middle ground between regular expressions and Turing machines is covered by grammars. Grammars are among the most popular (and best understood) formalisms to formally specify input languages. Using a grammar, one can express a wide range of the properties of an input language. Grammars are particularly great for expressing the syntactical structure of an input, and are the formalism of choice to express nested or recursive inputs. The grammars we use are so-called context-free grammars, one of the easiest and most popular grammar formalisms.
A grammar consists of a start symbol and a set of expansion rules (or simply rules) which indicate how the start symbol (and other symbols) can be expanded. As an example, consider the following grammar, denoting a sequence of two digits:
<start> ::= <digit><digit>
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
To read such a grammar, start with the start symbol (<start>
). An expansion rule <A> ::= <B>
means that the symbol on the left side (<A>
) can be replaced by the string on the right side (<B>
). In the above grammar, <start>
would be replaced by <digit><digit>
.
In this string again, <digit>
would be replaced by the string on the right side of the <digit>
rule. The special operator |
denotes expansion alternatives (or simply alternatives), meaning that any of the digits can be chosen for an expansion. Each <digit>
thus would be expanded into one of the given digits, eventually yielding a string between 00
and 99
. There are no further expansions for 0
to 9
, so we are all set.
The interesting thing about grammars is that they can be recursive. That is, expansions can make use of symbols expanded earlier – which would then be expanded again. As an example, consider a grammar that describes integers:
<start> ::= <integer>
<integer> ::= <digit> | <digit><integer>
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Here, a <integer>
is either a single digit, or a digit followed by another integer. The number 1234
thus would be represented as a single digit 1
, followed by the integer 234
, which in turn is a digit 2
, followed by the integer 34
.
If we wanted to express that an integer can be preceded by a sign (+
or -
), we would write the grammar as
<start> ::= <number>
<number> ::= <integer> | +<integer> | -<integer>
<integer> ::= <digit> | <digit><integer>
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
These rules formally define the language: Anything that can be derived from the start symbol is part of the language; anything that cannot is not.
quiz("Which of these strings cannot be produced "
"from the above `<start>` symbol?",
[
"`007`",
"`-42`",
"`++1`",
"`3.14`"
], "[27 ** (1/3), 256 ** (1/4)]")
<start>
symbol?
Let us expand our grammar to cover full arithmetic expressions – a poster child example for a grammar. We see that an expression (<expr>
) is either a sum, or a difference, or a term; a term is either a product or a division, or a factor; and a factor is either a number or a parenthesized expression. Almost all rules can have recursion, and thus allow arbitrary complex expressions such as (1 + 2) * (3.4 / 5.6 - 789)
.
<start> ::= <expr>
<expr> ::= <term> + <expr> | <term> - <expr> | <term>
<term> ::= <term> * <factor> | <term> / <factor> | <factor>
<factor> ::= +<factor> | -<factor> | (<expr>) | <integer> | <integer>.<integer>
<integer> ::= <digit><integer> | <digit>
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
In such a grammar, if we start with <start>
and then expand one symbol after another, randomly choosing alternatives, we can quickly produce one valid arithmetic expression after another. Such grammar fuzzing is highly effective as it comes to produce complex inputs, and this is what we will implement in this chapter.
quiz("Which of these strings cannot be produced "
"from the above `<start>` symbol?",
[
"`1 + 1`",
"`1+1`",
"`+1`",
"`+(1)`",
], "4 ** 0.5")
<start>
symbol?
Our first step in building a grammar fuzzer is to find an appropriate format for grammars. To make the writing of grammars as simple as possible, we use a format that is based on strings and lists. Our grammars in Python take the format of a mapping between symbol names and expansions, where expansions are lists of alternatives. A one-rule grammar for digits thus takes the form
DIGIT_GRAMMAR = {
"<start>":
["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
}
We can capture the grammar structure in a Grammar
type, in which each symbol (a string) is mapped to a list of expansions (strings):
Grammar = Dict[str, List[Expansion]]
With this Grammar
type, the full grammar for arithmetic expressions looks like this:
EXPR_GRAMMAR: Grammar = {
"<start>":
["<expr>"],
"<expr>":
["<term> + <expr>", "<term> - <expr>", "<term>"],
"<term>":
["<factor> * <term>", "<factor> / <term>", "<factor>"],
"<factor>":
["+<factor>",
"-<factor>",
"(<expr>)",
"<integer>.<integer>",
"<integer>"],
"<integer>":
["<digit><integer>", "<digit>"],
"<digit>":
["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
}
In the grammar, every symbol can be defined exactly once. We can access any rule by its symbol...
EXPR_GRAMMAR["<digit>"]
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
....and we can check whether a symbol is in the grammar:
"<identifier>" in EXPR_GRAMMAR
False
Note that we assume that on the left-hand side of a rule (i.e., the key in the mapping) is always a single symbol. This is the property that gives our grammars the characterization of context-free.
We assume that the canonical start symbol is <start>
:
START_SYMBOL = "<start>"
The handy nonterminals()
function extracts the list of nonterminal symbols (i.e., anything between <
and >
, except spaces) from an expansion.
RE_NONTERMINAL = re.compile(r'(<[^<> ]*>)')
def nonterminals(expansion):
# In later chapters, we allow expansions to be tuples,
# with the expansion being the first element
if isinstance(expansion, tuple):
expansion = expansion[0]
return RE_NONTERMINAL.findall(expansion)
assert nonterminals("<term> * <factor>") == ["<term>", "<factor>"]
assert nonterminals("<digit><integer>") == ["<digit>", "<integer>"]
assert nonterminals("1 < 3 > 2") == []
assert nonterminals("1 <3> 2") == ["<3>"]
assert nonterminals("1 + 2") == []
assert nonterminals(("<1>", {'option': 'value'})) == ["<1>"]
Likewise, is_nonterminal()
checks whether some symbol is a nonterminal:
def is_nonterminal(s):
return RE_NONTERMINAL.match(s)
assert is_nonterminal("<abc>")
assert is_nonterminal("<symbol-1>")
assert not is_nonterminal("+")
Let us now put the above grammars to use. We will build a very simple grammar fuzzer that starts with a start symbol (<start>
) and then keeps on expanding it. To avoid expansion to infinite inputs, we place a limit (max_nonterminals
) on the number of nonterminals. Furthermore, to avoid being stuck in a situation where we cannot reduce the number of symbols any further, we also limit the total number of expansion steps.
class ExpansionError(Exception):
pass
def simple_grammar_fuzzer(grammar: Grammar,
start_symbol: str = START_SYMBOL,
max_nonterminals: int = 10,
max_expansion_trials: int = 100,
log: bool = False) -> str:
"""Produce a string from `grammar`.
`start_symbol`: use a start symbol other than `<start>` (default).
`max_nonterminals`: the maximum number of nonterminals
still left for expansion
`max_expansion_trials`: maximum # of attempts to produce a string
`log`: print expansion progress if True"""
term = start_symbol
expansion_trials = 0
while len(nonterminals(term)) > 0:
symbol_to_expand = random.choice(nonterminals(term))
expansions = grammar[symbol_to_expand]
expansion = random.choice(expansions)
# In later chapters, we allow expansions to be tuples,
# with the expansion being the first element
if isinstance(expansion, tuple):
expansion = expansion[0]
new_term = term.replace(symbol_to_expand, expansion, 1)
if len(nonterminals(new_term)) < max_nonterminals:
term = new_term
if log:
print("%-40s" % (symbol_to_expand + " -> " + expansion), term)
expansion_trials = 0
else:
expansion_trials += 1
if expansion_trials >= max_expansion_trials:
raise ExpansionError("Cannot expand " + repr(term))
return term
Let us see how this simple grammar fuzzer obtains an arithmetic expression from the start symbol:
simple_grammar_fuzzer(grammar=EXPR_GRAMMAR, max_nonterminals=3, log=True)
<start> -> <expr> <expr> <expr> -> <term> + <expr> <term> + <expr> <term> -> <factor> <factor> + <expr> <factor> -> <integer> <integer> + <expr> <integer> -> <digit> <digit> + <expr> <digit> -> 6 6 + <expr> <expr> -> <term> - <expr> 6 + <term> - <expr> <expr> -> <term> 6 + <term> - <term> <term> -> <factor> 6 + <factor> - <term> <factor> -> -<factor> 6 + -<factor> - <term> <term> -> <factor> 6 + -<factor> - <factor> <factor> -> (<expr>) 6 + -(<expr>) - <factor> <factor> -> (<expr>) 6 + -(<expr>) - (<expr>) <expr> -> <term> 6 + -(<term>) - (<expr>) <expr> -> <term> 6 + -(<term>) - (<term>) <term> -> <factor> 6 + -(<factor>) - (<term>) <factor> -> +<factor> 6 + -(+<factor>) - (<term>) <factor> -> +<factor> 6 + -(++<factor>) - (<term>) <term> -> <factor> 6 + -(++<factor>) - (<factor>) <factor> -> (<expr>) 6 + -(++(<expr>)) - (<factor>) <factor> -> <integer> 6 + -(++(<expr>)) - (<integer>) <expr> -> <term> 6 + -(++(<term>)) - (<integer>) <integer> -> <digit> 6 + -(++(<term>)) - (<digit>) <digit> -> 9 6 + -(++(<term>)) - (9) <term> -> <factor> * <term> 6 + -(++(<factor> * <term>)) - (9) <term> -> <factor> 6 + -(++(<factor> * <factor>)) - (9) <factor> -> <integer> 6 + -(++(<integer> * <factor>)) - (9) <integer> -> <digit> 6 + -(++(<digit> * <factor>)) - (9) <digit> -> 2 6 + -(++(2 * <factor>)) - (9) <factor> -> +<factor> 6 + -(++(2 * +<factor>)) - (9) <factor> -> -<factor> 6 + -(++(2 * +-<factor>)) - (9) <factor> -> -<factor> 6 + -(++(2 * +--<factor>)) - (9) <factor> -> -<factor> 6 + -(++(2 * +---<factor>)) - (9) <factor> -> -<factor> 6 + -(++(2 * +----<factor>)) - (9) <factor> -> <integer>.<integer> 6 + -(++(2 * +----<integer>.<integer>)) - (9) <integer> -> <digit> 6 + -(++(2 * +----<digit>.<integer>)) - (9) <integer> -> <digit> 6 + -(++(2 * +----<digit>.<digit>)) - (9) <digit> -> 1 6 + -(++(2 * +----1.<digit>)) - (9) <digit> -> 7 6 + -(++(2 * +----1.7)) - (9)
'6 + -(++(2 * +----1.7)) - (9)'
By increasing the limit of nonterminals, we can quickly get much longer productions:
for i in range(10):
print(simple_grammar_fuzzer(grammar=EXPR_GRAMMAR, max_nonterminals=5))
7 / +48.5 -5.9 / 9 - 4 * +-(-+++((1 + (+7 - (-1 * (++-+7.7 - -+-4.0))))) * +--4 - -(6) + 64) 8.2 - 27 - -9 / +((+9 * --2 + --+-+-((-1 * +(8 - 5 - 6)) * (-((-+(((+(4))))) - ++4) / +(-+---((5.6 - --(3 * -1.8 * +(6 * +-(((-(-6) * ---+6)) / +--(+-+-7 * (-0 * (+(((((2)) + 8 - 3 - ++9.0 + ---(--+7 / (1 / +++6.37) + (1) / 482) / +++-+0)))) * -+5 + 7.513)))) - (+1 / ++((-84)))))))) * ++5 / +-(--2 - -++-9.0)))) / 5 * --++090 1 - -3 * 7 - 28 / 9 (+9) * +-5 * ++-926.2 - (+9.03 / -+(-(-6) / 2 * +(-+--(8) / -(+1.0) - 5 + 4)) * 3.5) 8 + -(9.6 - 3 - -+-4 * +77) -(((((++((((+((++++-((+-37))))))))))))) / ++(-(+++(+6)) * -++-(+(++(---6 * (((7)) * (1) / (-7.6 * 535338) + +256) * 0) * 0))) - 4 + +1 5.43 (9 / -405 / -23 - +-((+-(2 * (13))))) + +6 - +8 - 934 -++2 - (--+715769550) / 8 / (1)
Note that while our fuzzer does the job in most cases, it has a number of drawbacks.
quiz("What drawbacks does `simple_grammar_fuzzer()` have?",
[
"It has a large number of string search and replace operations",
"It may fail to produce a string (`ExpansionError`)",
"It often picks some symbol to expand "
"that does not even occur in the string",
"All of the above"
], "1 << 2")
simple_grammar_fuzzer()
have?
Indeed, simple_grammar_fuzzer()
is rather inefficient due to the large number of search and replace operations, and it may even fail to produce a string. On the other hand, the implementation is straightforward and does the job in most cases. For this chapter, we'll stick to it; in the next chapter, we'll show how to build a more efficient one.
With grammars, we can easily specify the format for several of the examples we discussed earlier. The above arithmetic expressions, for instance, can be directly sent into bc
(or any other program that takes arithmetic expressions). Before we introduce a few additional grammars, let us give a means to visualize them, giving an alternate view to aid their understanding.
Railroad diagrams, also called syntax diagrams, are a graphical representation of context-free grammars. They are read left to right, following possible "rail" tracks; the sequence of symbols encountered on the track defines the language. To produce railroad diagrams, we implement a function syntax_diagram()
.
Let us use syntax_diagram()
to produce a railroad diagram of our expression grammar:
syntax_diagram(EXPR_GRAMMAR)
start
expr
term
factor
integer
digit
This railroad representation will come in handy as it comes to visualizing the structure of grammars – especially for more complex grammars.
Let us create (and visualize) some more grammars and use them for fuzzing.
Here's a grammar for cgi_decode()
introduced in the chapter on coverage.
CGI_GRAMMAR: Grammar = {
"<start>":
["<string>"],
"<string>":
["<letter>", "<letter><string>"],
"<letter>":
["<plus>", "<percent>", "<other>"],
"<plus>":
["+"],
"<percent>":
["%<hexdigit><hexdigit>"],
"<hexdigit>":
["0", "1", "2", "3", "4", "5", "6", "7",
"8", "9", "a", "b", "c", "d", "e", "f"],
"<other>": # Actually, could be _all_ letters
["0", "1", "2", "3", "4", "5", "a", "b", "c", "d", "e", "-", "_"],
}
syntax_diagram(CGI_GRAMMAR)
start
string
letter
plus
percent
hexdigit
other
In contrast to basic fuzzing or mutation-based fuzzing, the grammar quickly produces all sorts of combinations:
for i in range(10):
print(simple_grammar_fuzzer(grammar=CGI_GRAMMAR, max_nonterminals=10))
+%9a +++%ce+ +_ +%c6c ++ +%cd+5 1%ee %b9%d5 %96 %57d%42
The same properties we have seen for CGI input also hold for more complex inputs. Let us use a grammar to produce numerous valid URLs:
URL_GRAMMAR: Grammar = {
"<start>":
["<url>"],
"<url>":
["<scheme>://<authority><path><query>"],
"<scheme>":
["http", "https", "ftp", "ftps"],
"<authority>":
["<host>", "<host>:<port>", "<userinfo>@<host>", "<userinfo>@<host>:<port>"],
"<host>": # Just a few
["cispa.saarland", "www.google.com", "fuzzingbook.com"],
"<port>":
["80", "8080", "<nat>"],
"<nat>":
["<digit>", "<digit><digit>"],
"<digit>":
["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"],
"<userinfo>": # Just one
["user:password"],
"<path>": # Just a few
["", "/", "/<id>"],
"<id>": # Just a few
["abc", "def", "x<digit><digit>"],
"<query>":
["", "?<params>"],
"<params>":
["<param>", "<param>&<params>"],
"<param>": # Just a few
["<id>=<id>", "<id>=<nat>"],
}
syntax_diagram(URL_GRAMMAR)
start
url
scheme
authority
host
port
nat
digit
userinfo
path
id
query
params
param
Again, within milliseconds, we can produce plenty of valid inputs.
for i in range(10):
print(simple_grammar_fuzzer(grammar=URL_GRAMMAR, max_nonterminals=10))
https://user:password@cispa.saarland:80/ http://fuzzingbook.com?def=56&x89=3&x46=48&def=def ftp://cispa.saarland/?x71=5&x35=90&def=abc https://cispa.saarland:80/def?def=7&x23=abc https://fuzzingbook.com:80/ https://fuzzingbook.com:80/abc?def=abc&abc=x14&def=abc&abc=2&def=38 ftps://fuzzingbook.com/x87 https://user:password@fuzzingbook.com:6?def=54&x44=abc http://fuzzingbook.com:80?x33=25&def=8 http://fuzzingbook.com:8080/def
Finally, grammars are not limited to formal languages such as computer inputs, but can also be used to produce natural language. This is the grammar we used to pick a title for this book:
TITLE_GRAMMAR: Grammar = {
"<start>": ["<title>"],
"<title>": ["<topic>: <subtopic>"],
"<topic>": ["Generating Software Tests", "<fuzzing-prefix>Fuzzing", "The Fuzzing Book"],
"<fuzzing-prefix>": ["", "The Art of ", "The Joy of "],
"<subtopic>": ["<subtopic-main>",
"<subtopic-prefix><subtopic-main>",
"<subtopic-main><subtopic-suffix>"],
"<subtopic-main>": ["Breaking Software",
"Generating Software Tests",
"Principles, Techniques and Tools"],
"<subtopic-prefix>": ["", "Tools and Techniques for "],
"<subtopic-suffix>": [" for <reader-property> and <reader-property>",
" for <software-property> and <software-property>"],
"<reader-property>": ["Fun", "Profit"],
"<software-property>": ["Robustness", "Reliability", "Security"],
}
syntax_diagram(TITLE_GRAMMAR)
start
title
topic
fuzzing-prefix
subtopic
subtopic-main
subtopic-prefix
subtopic-suffix
reader-property
software-property
titles: Set[str] = set()
while len(titles) < 10:
titles.add(simple_grammar_fuzzer(
grammar=TITLE_GRAMMAR, max_nonterminals=10))
titles
{'Fuzzing: Generating Software Tests', 'Fuzzing: Principles, Techniques and Tools', 'Generating Software Tests: Breaking Software', 'Generating Software Tests: Breaking Software for Robustness and Robustness', 'Generating Software Tests: Principles, Techniques and Tools', 'Generating Software Tests: Principles, Techniques and Tools for Profit and Fun', 'Generating Software Tests: Tools and Techniques for Principles, Techniques and Tools', 'The Fuzzing Book: Breaking Software', 'The Fuzzing Book: Generating Software Tests for Profit and Profit', 'The Fuzzing Book: Generating Software Tests for Robustness and Robustness'}
(If you find that there is redundancy ("Robustness and Robustness") in here: In our chapter on coverage-based fuzzing, we will show how to cover each expansion only once. And if you like some alternatives more than others, probabilistic grammar fuzzing will be there for you.)
One very useful property of grammars is that they produce mostly valid inputs. From a syntactical standpoint, the inputs are actually always valid, as they satisfy the constraints of the given grammar. (Of course, one needs a valid grammar in the first place.) However, there are also semantic properties that cannot be easily expressed in a grammar. If, say, for a URL, the port range is supposed to be between 1024 and 2048, this is hard to write in a grammar. If one has to satisfy more complex constraints, one quickly reaches the limits of what a grammar can express.
One way around this is to attach constraints to grammars, as we will discuss later in this book. Another possibility is to put together the strengths of grammar-based fuzzing and mutation-based fuzzing. The idea is to use the grammar-generated inputs as seeds for further mutation-based fuzzing. This way, we can explore not only valid inputs, but also check out the boundaries between valid and invalid inputs. This is particularly interesting as slightly invalid inputs allow finding parser errors (which are often abundant). As with fuzzing in general, it is the unexpected which reveals errors in programs.
To use our generated inputs as seeds, we can feed them directly into the mutation fuzzers introduced earlier:
number_of_seeds = 10
seeds = [
simple_grammar_fuzzer(
grammar=URL_GRAMMAR,
max_nonterminals=10) for i in range(number_of_seeds)]
seeds
['ftps://user:password@www.google.com:80', 'http://cispa.saarland/', 'ftp://www.google.com:42/', 'ftps://user:password@fuzzingbook.com:39?abc=abc', 'https://www.google.com?x33=1&x06=1', 'http://www.google.com:02/', 'https://user:password@www.google.com/', 'ftp://cispa.saarland:8080/?abc=abc&def=def&abc=5', 'http://www.google.com:80/def?def=abc', 'http://user:password@cispa.saarland/']
m = MutationFuzzer(seeds)
[m.fuzz() for i in range(20)]
['ftps://user:password@www.google.com:80', 'http://cispa.saarland/', 'ftp://www.google.com:42/', 'ftps://user:password@fuzzingbook.com:39?abc=abc', 'https://www.google.com?x33=1&x06=1', 'http://www.google.com:02/', 'https://user:password@www.google.com/', 'ftp://cispa.saarland:8080/?abc=abc&def=def&abc=5', 'http://www.google.com:80/def?def=abc', 'http://user:password@cispa.saarland/', 'Eh4tp:www.coogle.com:80/def?d%f=abc', 'ftps://}ser:passwod@fuzzingbook.com:9?abc=abc', 'uftp//cispa.sRaarland:808&0?abc=abc&def=defabc=5', 'http://user:paswor9d@cispar.saarland/v', 'ftp://Www.g\x7fogle.cAom:42/', 'hht://userC:qassMword@cispy.csaarland/', 'httx://ww.googlecom:80defde`f=ac', 'htt://cispq.waarlnd/', 'htFtp\t://cmspa./saarna(md/', 'ft:/www.google.com:42\x0f']
While the first 10 fuzz()
calls return the seeded inputs (as designed), the later ones again create arbitrary mutations. Using MutationCoverageFuzzer
instead of MutationFuzzer
, we could again have our search guided by coverage – and thus bring together the best of multiple worlds.
Let us now introduce a few techniques that help us writing grammars.
With <
and >
delimiting nonterminals in our grammars, how can we actually express that some input should contain <
and >
? The answer is simple: Just introduce a symbol for them.
simple_nonterminal_grammar: Grammar = {
"<start>": ["<nonterminal>"],
"<nonterminal>": ["<left-angle><identifier><right-angle>"],
"<left-angle>": ["<"],
"<right-angle>": [">"],
"<identifier>": ["id"] # for now
}
In simple_nonterminal_grammar
, neither the expansion for <left-angle>
nor the expansion for <right-angle>
can be mistaken for a nonterminal. Hence, we can produce as many as we want.
(Note that this does not work with simple_grammar_fuzzer()
, but rather with the GrammarFuzzer
class we'll introduce in the next chapter.)
In the course of this book, we frequently run into the issue of creating a grammar by extending an existing grammar with new features. Such an extension is very much like subclassing in object-oriented programming.
To create a new grammar $g'$ from an existing grammar $g$, we first copy $g$ into $g'$, and then go and extend existing rules with new alternatives and/or add new symbols. Here's an example, extending the above nonterminal
grammar with a better rule for identifiers:
nonterminal_grammar = copy.deepcopy(simple_nonterminal_grammar)
nonterminal_grammar["<identifier>"] = ["<idchar>", "<identifier><idchar>"]
nonterminal_grammar["<idchar>"] = ['a', 'b', 'c', 'd'] # for now
nonterminal_grammar
{'<start>': ['<nonterminal>'], '<nonterminal>': ['<left-angle><identifier><right-angle>'], '<left-angle>': ['<'], '<right-angle>': ['>'], '<identifier>': ['<idchar>', '<identifier><idchar>'], '<idchar>': ['a', 'b', 'c', 'd']}
Since such an extension of grammars is a common operation, we introduce a custom function extend_grammar()
which first copies the given grammar and then updates it from a dictionary, using the Python dictionary update()
method:
def extend_grammar(grammar: Grammar, extension: Grammar = {}) -> Grammar:
"""Create a copy of `grammar`, updated with `extension`."""
new_grammar = copy.deepcopy(grammar)
new_grammar.update(extension)
return new_grammar
This call to extend_grammar()
extends simple_nonterminal_grammar
to nonterminal_grammar
just like the "manual" example above:
nonterminal_grammar = extend_grammar(simple_nonterminal_grammar,
{
"<identifier>": ["<idchar>", "<identifier><idchar>"],
# for now
"<idchar>": ['a', 'b', 'c', 'd']
}
)
In the above nonterminal_grammar
, we have enumerated only the first few letters; indeed, enumerating all letters or digits in a grammar manually, as in <idchar> ::= 'a' | 'b' | 'c' ...
is a bit painful.
However, remember that grammars are part of a program, and can thus also be constructed programmatically. We introduce a function srange()
which constructs a list of characters in a string:
import string
def srange(characters: str) -> List[Expansion]:
"""Construct a list with all characters in the string"""
return [c for c in characters]
If we pass it the constant string.ascii_letters
, which holds all ASCII letters, srange()
returns a list of all ASCII letters:
string.ascii_letters
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
srange(string.ascii_letters)[:10]
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
We can use such constants in our grammar to quickly define identifiers:
nonterminal_grammar = extend_grammar(nonterminal_grammar,
{
"<idchar>": (srange(string.ascii_letters) +
srange(string.digits) +
srange("-_"))
}
)
[simple_grammar_fuzzer(nonterminal_grammar, "<identifier>") for i in range(10)]
['b', 'd', 'V9', 'x4c', 'YdiEWj', 'c', 'xd', '7', 'vIU', 'QhKD']
The shortcut crange(start, end)
returns a list of all characters in the ASCII range of start
to (including) end
:
def crange(character_start: str, character_end: str) -> List[Expansion]:
return [chr(i)
for i in range(ord(character_start), ord(character_end) + 1)]
We can use this to express ranges of characters:
crange('0', '9')
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
assert crange('a', 'z') == srange(string.ascii_lowercase)
In the above nonterminal_grammar
, as in other grammars, we have to express repetitions of characters using recursion, that is, by referring to the original definition:
nonterminal_grammar["<identifier>"]
['<idchar>', '<identifier><idchar>']
It could be a bit easier if we simply could state that a nonterminal should be a non-empty sequence of letters – for instance, as in
<identifier> = <idchar>+
where +
denotes a non-empty repetition of the symbol it follows.
Operators such as +
are frequently introduced as handy shortcuts in grammars. Formally, our grammars come in the so-called Backus-Naur form, or BNF for short. Operators extend BNF to so-called _extended BNF, or EBNF* for short:
<symbol>?
indicates that <symbol>
is optional – that is, it can occur 0 or 1 times.<symbol>+
indicates that <symbol>
can occur 1 or more times repeatedly.<symbol>*
indicates that <symbol>
can occur 0 or more times. (In other words, it is an optional repetition.)To make matters even more interesting, we would like to use parentheses with the above shortcuts. Thus, (<foo><bar>)?
indicates that the sequence of <foo>
and <bar>
is optional.
Using such operators, we can define the identifier rule in a simpler way. To this end, let us create a copy of the original grammar and modify the <identifier>
rule:
nonterminal_ebnf_grammar = extend_grammar(nonterminal_grammar,
{
"<identifier>": ["<idchar>+"]
}
)
Likewise, we can simplify the expression grammar. Consider how signs are optional, and how integers can be expressed as sequences of digits.
EXPR_EBNF_GRAMMAR: Grammar = {
"<start>":
["<expr>"],
"<expr>":
["<term> + <expr>", "<term> - <expr>", "<term>"],
"<term>":
["<factor> * <term>", "<factor> / <term>", "<factor>"],
"<factor>":
["<sign>?<factor>", "(<expr>)", "<integer>(.<integer>)?"],
"<sign>":
["+", "-"],
"<integer>":
["<digit>+"],
"<digit>":
srange(string.digits)
}
Let us implement a function convert_ebnf_grammar()
that takes such an EBNF grammar and automatically translates it into a BNF grammar.
Here's an example of using convert_ebnf_grammar()
:
convert_ebnf_grammar({"<authority>": ["(<userinfo>@)?<host>(:<port>)?"]})
{'<authority>': ['<symbol-2><host><symbol-1-1>'], '<symbol>': ['<userinfo>@'], '<symbol-1>': [':<port>'], '<symbol-2>': ['', '<symbol>'], '<symbol-1-1>': ['', '<symbol-1>']}
expr_grammar = convert_ebnf_grammar(EXPR_EBNF_GRAMMAR)
expr_grammar
{'<start>': ['<expr>'], '<expr>': ['<term> + <expr>', '<term> - <expr>', '<term>'], '<term>': ['<factor> * <term>', '<factor> / <term>', '<factor>'], '<factor>': ['<sign-1><factor>', '(<expr>)', '<integer><symbol-1>'], '<sign>': ['+', '-'], '<integer>': ['<digit-1>'], '<digit>': ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], '<symbol>': ['.<integer>'], '<sign-1>': ['', '<sign>'], '<symbol-1>': ['', '<symbol>'], '<digit-1>': ['<digit>', '<digit><digit-1>']}
Success! We have nicely converted the EBNF grammar into BNF.
With character classes and EBNF grammar conversion, we have two powerful tools that make the writing of grammars easier. We will use these again and again as it comes to working with grammars.
During the course of this book, we frequently want to specify additional information for grammars, such as probabilities or constraints. To support these extensions, as well as possibly others, we define an annotation mechanism.
Our concept for annotating grammars is to add annotations to individual expansions. To this end, we allow that an expansion cannot only be a string, but also a pair of a string and a set of attributes, as in
"<expr>":
[("<term> + <expr>", opts(min_depth=10)),
("<term> - <expr>", opts(max_depth=2)),
"<term>"]
Here, the opts()
function would allow us to express annotations that apply to the individual expansions; in this case, the addition would be annotated with a min_depth
value of 10, and the subtraction with a max_depth
value of 2. The meaning of these annotations is left to the individual algorithms dealing with the grammars; the general idea, though, is that they can be ignored.
Since grammars are represented as strings, it is fairly easy to introduce errors. So let us introduce a helper function that checks a grammar for consistency.
The helper function is_valid_grammar()
iterates over a grammar to check whether all used symbols are defined, and vice versa, which is very useful for debugging; it also checks whether all symbols are reachable from the start symbol. You don't have to delve into details here, but as always, it is important to get the input data straight before we make use of it.
Let us make use of is_valid_grammar()
. Our grammars defined above pass the test:
assert is_valid_grammar(EXPR_GRAMMAR)
assert is_valid_grammar(CGI_GRAMMAR)
assert is_valid_grammar(URL_GRAMMAR)
The check can also be applied to EBNF grammars:
assert is_valid_grammar(EXPR_EBNF_GRAMMAR)
These do not pass the test, though:
assert not is_valid_grammar({"<start>": ["<x>"], "<y>": ["1"]}) # type: ignore
'<y>': defined, but not used. Consider applying trim_grammar() on the grammar '<x>': used, but not defined '<y>': unreachable from <start>. Consider applying trim_grammar() on the grammar
assert not is_valid_grammar({"<start>": "123"}) # type: ignore
'<start>': expansion is not a list
assert not is_valid_grammar({"<start>": []}) # type: ignore
'<start>': expansion list empty
assert not is_valid_grammar({"<start>": [1, 2, 3]}) # type: ignore
'<start>': 1: not a string
(The #type: ignore
annotations avoid static checkers flagging the above as errors).
From here on, we will always use is_valid_grammar()
when defining a grammar.
This chapter introduces grammars as a simple means to specify input languages, and to use them for testing programs with syntactically valid inputs. A grammar is defined as a mapping of nonterminal symbols to lists of alternative expansions, as in the following example:
US_PHONE_GRAMMAR: Grammar = {
"<start>": ["<phone-number>"],
"<phone-number>": ["(<area>)<exchange>-<line>"],
"<area>": ["<lead-digit><digit><digit>"],
"<exchange>": ["<lead-digit><digit><digit>"],
"<line>": ["<digit><digit><digit><digit>"],
"<lead-digit>": ["2", "3", "4", "5", "6", "7", "8", "9"],
"<digit>": ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
}
assert is_valid_grammar(US_PHONE_GRAMMAR)
Nonterminal symbols are enclosed in angle brackets (say, <digit>
). To generate an input string from a grammar, a producer starts with the start symbol (<start>
) and randomly chooses a random expansion for this symbol. It continues the process until all nonterminal symbols are expanded. The function simple_grammar_fuzzer()
does just that:
[simple_grammar_fuzzer(US_PHONE_GRAMMAR) for i in range(5)]
['(692)449-5179', '(519)230-7422', '(613)761-0853', '(979)881-3858', '(810)914-5475']
In practice, though, instead of simple_grammar_fuzzer()
, you should use the GrammarFuzzer
class or one of its coverage-based, probabilistic-based, or generator-based derivatives; these are more efficient, protect against infinite growth, and provide several additional features.
This chapter also introduces a grammar toolbox with several helper functions that ease the writing of grammars, such as using shortcut notations for character classes and repetitions, or extending grammars
As they make a great foundation for generating software tests, we use grammars again and again in this book. As a sneak preview, we can use grammars to fuzz configurations:
<options> ::= <option>*
<option> ::= -h | --version | -v | -d | -i | --global-config <filename>
We can use grammars for fuzzing functions and APIs and fuzzing graphical user interfaces:
<call-sequence> ::= <call>*
<call> ::= urlparse(<url>) | urlsplit(<url>)
We can assign probabilities and constraints to individual expansions:
<term>: 50% <factor> * <term> | 30% <factor> / <term> | 20% <factor>
<integer>: <digit>+ { <integer> >= 100 }
All these extras become especially valuable as we can
which we also discuss for all techniques in this book.
To get there, however, we still have a bit of homework to do. In particular, we first have to learn how to
As one of the foundations of human language, grammars have been around as long as human language existed. The first formalization of generative grammars was by Dakṣiputra Pāṇini in 350 BC \cite{Panini350bce}. As a general means to express formal languages for both data and programs, their role in computer science cannot be overstated. The seminal work by Chomsky \cite{Chomsky1956} introduced the central models of regular languages, context-free grammars, context-sensitive grammars, and universal grammars as they are used (and taught) in computer science as a means to specify input and programming languages ever since.
The use of grammars for producing test inputs goes back to Burkhardt \cite{Burkhardt1967}, to be later rediscovered and applied by Hanford \cite{Hanford1970} and Purdom \cite{Purdom1972}. The most important use of grammar testing since then has been compiler testing. Actually, grammar-based testing is one important reason why compilers and Web browsers work as they should:
The CSmith tool \cite{Yang2011} specifically targets C programs, starting with a C grammar and then applying additional steps, such as referring to variables and functions defined earlier or ensuring integer and type safety. Their authors have used it "to find and report more than 400 previously unknown compiler bugs."
The LangFuzz work \cite{Holler2012}, which shares two authors with this book, uses a generic grammar to produce outputs, and is used day and night to generate JavaScript programs and test their interpreters; as of today, it has found more than 2,600 bugs in browsers such as Mozilla Firefox, Google Chrome, and Microsoft Edge.
The EMI Project \cite{Le2014} uses grammars to stress-test C compilers, transforming known tests into alternative programs that should be semantically equivalent over all inputs. Again, this has led to more than 100 bugs in C compilers being fixed.
Grammarinator \cite{Hodovan2018} is an open-source grammar fuzzer (written in Python!), using the popular ANTLR format as grammar specification. Like LangFuzz, it uses the grammar for both parsing and producing, and has found more than 100 issues in the JerryScript lightweight JavaScript engine and an associated platform.
Domato is a generic grammar generation engine that is specifically used for fuzzing DOM input. It has revealed a number of security issues in popular Web browsers.
Compilers and Web browsers, of course, are not only domains where grammars are needed for testing, but also domains where grammars are well-known. Our claim in this book is that grammars can be used to generate almost any input, and our aim is to empower you to do precisely that.
Take a look at the JSON specification and derive a grammar from it:
is_valid_grammar()
to ensure the grammar is valid.Feed the grammar into simple_grammar_fuzzer()
. Do you encounter any errors, and why?
The name simple_grammar_fuzzer()
does not come by accident: The way it expands grammars is limited in several ways. What happens if you apply simple_grammar_fuzzer()
on nonterminal_grammar
and expr_grammar
, as defined above, and why?
In a grammar extended with regular expressions, we can use the special form
/regex/
to include regular expressions in expansions. For instance, we can have a rule
<integer> ::= /[+-]?[0-9]+/
to quickly express that an integer is an optional sign, followed by a sequence of digits.
Write a converter convert_regex(r)
that takes a regular expression r
and creates an equivalent grammar. Support the following regular expression constructs:
*
, +
, ?
, ()
should work just in EBNFs, above.a|b
should translate into a list of alternatives [a, b]
..
should match any character except newline.[abc]
should translate into srange("abc")
[^abc]
should translate into the set of ASCII characters except srange("abc")
.[a-b]
should translate into crange(a, b)
[^a-b]
should translate into the set of ASCII characters except crange(a, b)
.Example: convert_regex(r"[0-9]+")
should yield a grammar such as
{
"<start>": ["<s1>"],
"<s1>": [ "<s2>", "<s1><s2>" ],
"<s2>": crange('0', '9')
}
Write a converter convert_regex_grammar(g)
that takes a EBNF grammar g
containing regular expressions in the form /.../
and creates an equivalent BNF grammar. Support the regular expression constructs as above.
Example: convert_regex_grammar({ "<integer>" : "/[+-]?[0-9]+/" })
should yield a grammar such as
{
"<integer>": ["<s1><s3>"],
"<s1>": [ "", "<s2>" ],
"<s2>": srange("+-"),
"<s3>": [ "<s4>", "<s4><s3>" ],
"<s4>": crange('0', '9')
}
Optional: Support escapes in regular expressions: \c
translates to the literal character c
; \/
translates to /
(and thus does not end the regular expression); \\
translates to \
.
To obtain a nicer syntax for specifying grammars, one can make use of Python constructs which then will be parsed by an additional function. For instance, we can imagine a grammar definition which uses |
as a means to separate alternatives:
def expression_grammar_fn():
start = "<expr>"
expr = "<term> + <expr>" | "<term> - <expr>"
term = "<factor> * <term>" | "<factor> / <term>" | "<factor>"
factor = "+<factor>" | "-<factor>" | "(<expr>)" | "<integer>.<integer>" | "<integer>"
integer = "<digit><integer>" | "<digit>"
digit = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
If we execute expression_grammar_fn()
, this will yield an error. Yet, the purpose of expression_grammar_fn()
is not to be executed, but to be used as data from which the grammar will be constructed.
with ExpectError():
expression_grammar_fn()
Traceback (most recent call last): File "/var/folders/n2/xd9445p97rb3xh7m1dfx8_4h0006ts/T/ipykernel_17157/1271268731.py", line 2, in <cell line: 1> expression_grammar_fn() File "/var/folders/n2/xd9445p97rb3xh7m1dfx8_4h0006ts/T/ipykernel_17157/3029408019.py", line 3, in expression_grammar_fn expr = "<term> + <expr>" | "<term> - <expr>" TypeError: unsupported operand type(s) for |: 'str' and 'str' (expected)
To this end, we make use of the ast
(abstract syntax tree) and inspect
(code inspection) modules.
First, we obtain the source code of expression_grammar_fn()
...
source = inspect.getsource(expression_grammar_fn)
source
'def expression_grammar_fn():\n start = "<expr>"\n expr = "<term> + <expr>" | "<term> - <expr>"\n term = "<factor> * <term>" | "<factor> / <term>" | "<factor>"\n factor = "+<factor>" | "-<factor>" | "(<expr>)" | "<integer>.<integer>" | "<integer>"\n integer = "<digit><integer>" | "<digit>"\n digit = \'0\' | \'1\' | \'2\' | \'3\' | \'4\' | \'5\' | \'6\' | \'7\' | \'8\' | \'9\'\n'
... which we then parse into an abstract syntax tree:
tree = ast.parse(source)
We can now parse the tree to find operators and alternatives. get_alternatives()
iterates over all nodes op
of the tree; If the node looks like a binary or (|
) operation, we drill deeper and recurse. If not, we have reached a single production, and we try to get the expression from the production. We define the to_expr
parameter depending on how we want to represent the production. In this case, we represent a single production by a single string.
def get_alternatives(op, to_expr=lambda o: o.s):
if isinstance(op, ast.BinOp) and isinstance(op.op, ast.BitOr):
return get_alternatives(op.left, to_expr) + [to_expr(op.right)]
return [to_expr(op)]
funct_parser()
takes the abstract syntax tree of a function (say, expression_grammar_fn()
) and iterates over all assignments:
def funct_parser(tree, to_expr=lambda o: o.s):
return {assign.targets[0].id: get_alternatives(assign.value, to_expr)
for assign in tree.body[0].body}
The result is a grammar in our regular format:
grammar = funct_parser(tree)
for symbol in grammar:
print(symbol, "::=", grammar[symbol])
start ::= ['<expr>'] expr ::= ['<term> + <expr>', '<term> - <expr>'] term ::= ['<factor> * <term>', '<factor> / <term>', '<factor>'] factor ::= ['+<factor>', '-<factor>', '(<expr>)', '<integer>.<integer>', '<integer>'] integer ::= ['<digit><integer>', '<digit>'] digit ::= ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Write a single function define_grammar(fn)
that takes a grammar defined as function (such as expression_grammar_fn()
) and returns a regular grammar.
Solution. This is straightforward:
We note that the grammar representation we designed previously does not allow simple generation of alternatives such as srange()
and crange()
. Further, one may find the string representation of expressions limiting. It turns out that it is simple to extend our grammar definition to support grammars such as below:
def define_name(o):
return o.id if isinstance(o, ast.Name) else o.s
def define_expr(op):
if isinstance(op, ast.BinOp) and isinstance(op.op, ast.Add):
return (*define_expr(op.left), define_name(op.right))
return (define_name(op),)
def define_ex_grammar(fn):
return define_grammar(fn, define_expr)
The grammar:
@define_ex_grammar
def expression_grammar():
start = expr
expr = (term + '+' + expr
| term + '-' + expr)
term = (factor + '*' + term
| factor + '/' + term
| factor)
factor = ('+' + factor
| '-' + factor
| '(' + expr + ')'
| integer + '.' + integer
| integer)
integer = (digit + integer
| digit)
digit = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
for symbol in expression_grammar:
print(symbol, "::=", expression_grammar[symbol])
Note. The grammar data structure thus obtained is a little more detailed than the standard data structure. It represents each production as a tuple.
We note that we have not enabled srange()
or crange()
in the above grammar. How would you go about adding these? (Hint: wrap define_expr()
to look for ast.Call
)
Introduce an operator *
that takes a pair (min, max)
where min
and max
are the minimum and maximum number of repetitions, respectively. A missing value min
stands for zero; a missing value max
for infinity.
def identifier_grammar_fn():
identifier = idchar * (1,)
With the *
operator, we can generalize the EBNF operators – ?
becomes (0,1), *
becomes (0,), and +
becomes (1,). Write a converter that takes an extended grammar defined using *
, parse it, and convert it into BNF.