History

In April 2012, Google announced ClusterFuzz, a cloud-based fuzzing infrastructure for security-critical components of the Chromium web browser. Researchers can upload their own fuzzers and collect bug bounties if ClusterFuzz finds a crash with the uploaded fuzzer.

In September 2014, Shellshock was disclosed as a family of security bugs in the widely used UNIX shell Bash; most Shellshock vulnerabilities were found using the AFL fuzzer. (Many Internet-facing services, such as some web server deployments, use Bash to process certain requests, allowing an attacker to cause vulnerable versions of Bash to execute arbitrary commands. This can allow an attacker to gain unauthorized access to a computer system.)

In April 2015, Hanno Böck showed how the AFL fuzzer could have found the 2014 Heartbleed vulnerability. (The Heartbleed vulnerability was disclosed in April 2014. It is a serious vulnerability that allows attackers to decrypt encrypted messages. The vulnerability was accidentally introduced into OpenSSL, which implements TLS and is used by most servers on the Internet. According to Shodan, 238,000 machines remained vulnerable in April 2016; 200,000 in January 2017.)

In August 2016, the Defense Advanced Research Projects Agency (DARPA) held the finals of the first Cyber Grand Challenge, a fully automated capture-the-flag competition that lasted 11 hours. The goal was to develop automatic defense systems that can detect, exploit, and patch software flaws in real time. Fuzzing was used as an effective offensive strategy to find flaws in opponents' software. It showed huge potential in automating vulnerability discovery. The winner was the Mayhem system, developed by the ForAllSecure team led by David Brumley.

In September 2016, Microsoft announced Project Springfield, a cloud-based fuzzing service for finding security-critical bugs in software.

In December 2016, Google announced OSS-Fuzz, which allows continuous fuzzing of several security-critical open-source projects.

At Black Hat 2018, Christopher Domas demonstrated using fuzzing to reveal the presence of a hidden RISC core in a processor. This core could bypass existing security checks to execute Ring 0 commands from Ring 3.

In September 2020, Microsoft released OneFuzz, a self-hosted fuzzing-as-a-service platform that automates software bug discovery. It supports Windows and Linux.

Early random testing

Testing programs with random inputs dates back to the 1950s, when data was still stored on punched cards. Programmers used punched cards retrieved from the trash or decks of random numbers as inputs to computer programs. If execution revealed undesirable behavior, a bug was discovered.

Executing random inputs is also called random testing or monkey testing.

In 1981, Duran and Ntafos formally investigated the effectiveness of testing a program with random inputs. Although random testing was widely perceived as the worst means of testing a program, the authors were able to show that it is a cost-effective alternative to more systematic testing methods.

In 1983, Steve Capps at Apple developed "The Monkey", a tool that generated random inputs for classic Mac OS applications such as MacPaint. The figurative "monkey" refers to the infinite monkey theorem, which states that a monkey randomly hitting keys on a typewriter keyboard for an infinite amount of time will eventually type all of Shakespeare's works. In the case of testing, the monkey will write a specific sequence of inputs that causes a crash.

In 1991, the crashme tool was released, which was intended to test the robustness of Unix and Unix-like operating systems by randomly executing systems calls with randomly chosen parameters.[29]


A fuzzer can be categorized in several ways:

  1. A fuzzer can be generation-based or mutation-based depending on whether inputs are generated from scratch or by modifying existing inputs.
  2. A fuzzer can be dumb or smart depending on whether it is aware of input structure.
  3. A fuzzer can be white-, grey-, or black-box, depending on whether it is aware of program structure.

Reuse of existing input seeds

A mutation-based fuzzer leverages an existing corpus of seed inputs during fuzzing. It generates inputs by modifying (or rather mutating) the provided seeds.[31] For example, when fuzzing the image library libpng, the user would provide a set of valid PNG image files as seeds while a mutation-based fuzzer would modify these seeds to produce semi-valid variants of each seed. The corpus of seed files may contain thousands of potentially similar inputs. Automated seed selection (or test suite reduction) allows users to pick the best seeds in order to maximize the total number of bugs found during a fuzz campaign.[32]

A generation-based fuzzer generates inputs from scratch. For instance, a smart generation-based fuzzer[33] takes the input model that was provided by the user to generate new inputs. Unlike mutation-based fuzzers, a generation-based fuzzer does not depend on the existence or quality of a corpus of seed inputs.

Some fuzzers have the capability to do both, to generate inputs from scratch and to generate inputs by mutation of existing seeds.[34]

Aware of input structure

Typically, fuzzers are used to generate inputs for programs that take structured inputs, such as a file, a sequence of keyboard or mouse events, or a sequence of messages. This structure distinguishes valid input that is accepted and processed by the program from invalid input that is quickly rejected by the program. What constitutes a valid input may be explicitly specified in an input model. Examples of input models are formal grammars, file formats, GUI-models, and network protocols. Even items not normally considered as input can be fuzzed, such as the contents of databases, shared memory, environment variables or the precise interleaving of threads. An effective fuzzer generates semi-valid inputs that are "valid enough" so that they are not directly rejected from the parser and "invalid enough" so that they might stress corner cases and exercise interesting program behaviours.

A smart (model-based,[34] grammar-based,[33][35] or protocol-based[36]) fuzzer leverages the input model to generate a greater proportion of valid inputs. For instance, if the input can be modelled as an abstract syntax tree, then a smart mutation-based fuzzer[35] would employ random transformations to move complete subtrees from one node to another. If the input can be modelled by a formal grammar, a smart generation-based fuzzer[33] would instantiate the production rules to generate inputs that are valid with respect to the grammar. However, generally the input model must be explicitly provided, which is difficult to do when the model is proprietary, unknown, or very complex. If a large corpus of valid and invalid inputs is available, a grammar induction technique, such as Angluin's L* algorithm, would be able to generate an input model.[37][38]

A dumb fuzzer[39][40] does not require the input model and can thus be employed to fuzz a wider variety of programs. For instance, AFL is a dumb mutation-based fuzzer that modifies a seed file by flipping random bits, by substituting random bytes with "interesting" values, and by moving or deleting blocks of data. However, a dumb fuzzer might generate a lower proportion of valid inputs and stress the parser code rather than the main components of a program. The disadvantage of dumb fuzzers can be illustrated by means of the construction of a valid checksum for a cyclic redundancy check (CRC). A CRC is an error-detecting code that ensures that the integrity of the data contained in the input file is preserved during transmission. A checksum is computed over the input data and recorded in the file. When the program processes the received file and the recorded checksum does not match the re-computed checksum, then the file is rejected as invalid. Now, a fuzzer that is unaware of the CRC is unlikely to generate the correct checksum. However, there are attempts to identify and re-compute a potential checksum in the mutated input, once a dumb mutation-based fuzzer has modified the protected data.[41]

Aware of program structure

Typically, a fuzzer is considered more effective if it achieves a higher degree of code coverage. The rationale is, if a fuzzer does not exercise certain structural elements in the program, then it is also not able to reveal bugs that are hiding in these elements. Some program elements are considered more critical than others. For instance, a division operator might cause a division by zero error, or a system call may crash the program.

A black-box fuzzer[39][35] treats the program as a black box and is unaware of internal program structure. For instance, a random testing tool that generates inputs at random is considered a blackbox fuzzer. Hence, a blackbox fuzzer can execute several hundred inputs per second, can be easily parallelized, and can scale to programs of arbitrary size. However, blackbox fuzzers may only scratch the surface and expose "shallow" bugs. Hence, there are attempts to develop blackbox fuzzers that can incrementally learn about the internal structure (and behavior) of a program during fuzzing by observing the program's output given an input. For instance, LearnLib employs active learning to generate an automaton that represents the behavior of a web application.

A white-box fuzzer[40][34] leverages program analysis to systematically increase code coverage or to reach certain critical program locations. For instance, SAGE[42] leverages symbolic execution to systematically explore different paths in the program. If the program's specification is available, a whitebox fuzzer might leverage techniques from model-based testing to generate inputs and check the program outputs against the program specification. A whitebox fuzzer can be very effective at exposing bugs that hide deep in the program. However, the time used for analysis (of the program or its specification) can become prohibitive. If the whitebox fuzzer takes relatively too long to generate an input, a blackbox fuzzer will be more efficient.[43] Hence, there are attempts to combine the efficiency of blackbox fuzzers and the effectiveness of whitebox fuzzers.[44]

A gray-box fuzzer leverages instrumentation rather than program analysis to glean information about the program. For instance, AFL and libFuzzer utilize lightweight instrumentation to trace basic block transitions exercised by an input. This leads to a reasonable performance overhead but informs the fuzzer about the increase in code coverage during fuzzing, which makes gray-box fuzzers extremely efficient vulnerability detection tools.[45]


Fuzzing is used mostly as an automated technique to expose vulnerabilities in security-critical programs that might be exploited with malicious intent.[9][19][20] More generally, fuzzing is used to demonstrate the presence of bugs rather than their absence. Running a fuzzing campaign for several weeks without finding a bug does not prove the program correct.[46] After all, the program may still fail for an input that has not been executed, yet; executing a program for all inputs is prohibitively expensive. If the objective is to prove a program correct for all inputs, a formal specification must exist and techniques from formal methods must be used.

Exposing bugs

In order to expose bugs, a fuzzer must be able to distinguish expected (normal) from unexpected (buggy) program behavior. However, a machine cannot always distinguish a bug from a feature. In automated software testing, this is also called the test oracle problem.[47][48]

Typically, a fuzzer distinguishes between crashing and non-crashing inputs in the absence of specifications and to use a simple and objective measure. Crashes can be easily identified and might indicate potential vulnerabilities (e.g., denial of service or arbitrary code execution). However, the absence of a crash does not indicate the absence of a vulnerability. For instance, a program written in C may or may not crash when an input causes a buffer overflow. Rather the program's behavior is undefined.

To make a fuzzer more sensitive to failures other than crashes, sanitizers can be used to inject assertions that crash the program when a failure is detected.[49][50] There are different sanitizers for different kinds of bugs:

  • to detect memory related errors, such as buffer overflows and use-after-free (using memory debuggers such as AddressSanitizer),
  • to detect race conditions and deadlocks (ThreadSanitizer),
  • to detect undefined behavior (UndefinedBehaviorSanitizer),
  • to detect memory leaks (LeakSanitizer), or
  • to check control-flow integrity (CFISanitizer).

Fuzzing can also be used to detect "differential" bugs if a reference implementation is available. For automated regression testing,[51] the generated inputs are executed on two versions of the same program. For automated differential testing,[52] the generated inputs are executed on two implementations of the same program (e.g., lighttpd and httpd are both implementations of a web server). If the two variants produce different output for the same input, then one may be buggy and should be examined more closely.

Validating static analysis reports

Static program analysis analyzes a program without actually executing it. This might lead to false positives where the tool reports problems with the program that do not actually exist. Fuzzing in combination with dynamic program analysis can be used to try to generate an input that actually witnesses the reported problem.[53]

Browser security

Modern web browsers undergo extensive fuzzing. The Chromium code of Google Chrome is continuously fuzzed by the Chrome Security Team with 15,000 cores.[54] For Microsoft Edge and Internet Explorer, Microsoft performed fuzzed testing with 670 machine-years during product development, generating more than 400 billion DOM manipulations from 1 billion HTML files.[55][54]


A fuzzer produces a large number of inputs in a relatively short time. For instance, in 2016 the Google OSS-fuzz project produced around 4 trillion inputs a week.[20] Hence, many fuzzers provide a toolchain that automates otherwise manual and tedious tasks which follow the automated generation of failure-inducing inputs.

Automated bug triage

Automated bug triage is used to group a large number of failure-inducing inputs by root cause and to prioritize each individual bug by severity. A fuzzer produces a large number of inputs, and many of the failure-inducing ones may effectively expose the same software bug. Only some of these bugs are security-critical and should be patched with higher priority. For instance the CERT Coordination Center provides the Linux triage tools which group crashing inputs by the produced stack trace and lists each group according to their probability to be exploitable.[56] The Microsoft Security Research Centre (MSEC) developed the !exploitable tool which first creates a hash for a crashing input to determine its uniqueness and then assigns an exploitability rating:[57]

  • Exploitable
  • Probably Exploitable
  • Probably Not Exploitable, or
  • Unknown.

Previously unreported, triaged bugs might be automatically reported to a bug tracking system. For instance, OSS-Fuzz runs large-scale, long-running fuzzing campaigns for several security-critical software projects where each previously unreported, distinct bug is reported directly to a bug tracker.[20] The OSS-Fuzz bug tracker automatically informs the maintainer of the vulnerable software and checks in regular intervals whether the bug has been fixed in the most recent revision using the uploaded minimized failure-inducing input.

Automated input minimization

Automated input minimization (or test case reduction) is an automated debugging technique to isolate that part of the failure-inducing input that is actually inducing the failure.[58][59] If the failure-inducing input is large and mostly malformed, it might be difficult for a developer to understand what exactly is causing the bug. Given the failure-inducing input, an automated minimization tool would remove as many input bytes as possible while still reproducing the original bug. For instance, Delta Debugging is an automated input minimization technique that employs an extended binary search algorithm to find such a minimal input.[60]

See also

  • American fuzzy lop (fuzzer)
  • Concolic testing
  • Monkey testing
  • Random testing
  • Responsible disclosure
  • Runtime error detection
  • Security testing
  • Smoke testing (software)
  • Symbolic execution
  • System testing
  • Test automation


