Disassembler


A disassembler is a computer program that translates machine language into assembly language—the inverse operation to that of an assembler. Disassembly, the output of a disassembler, is often formatted for human-readability rather than suitability for input to an assembler, making it principally a reverse-engineering tool. Common uses of disassemblers include recovering source code of a program whose original source was lost, malware analysis, modifying software (such as ROM hacking), and software cracking.

A disassembler differs from a decompiler, which targets a high-level language rather than an assembly language.

Assembly language source code generally permits the use of constants and programmer comments. These are usually removed from the assembled machine code by the assembler. If so, a disassembler operating on the machine code would produce disassembly lacking these constants and comments; the disassembled output becomes more difficult for a human to interpret than the original annotated source code. Some disassemblers provide a built-in code commenting feature where the generated output gets enriched with comments regarding called API functions or parameters of called functions. Some disassemblers make use of the symbolic debugging information present in object files such as ELF. For example, IDA allows the human user to make up mnemonic symbols for values or regions of code in an interactive session: human insight applied to the disassembly process often parallels human creativity in the code writing process.

Writing a disassembler which produces code which, when assembled, produces exactly the original binary is possible; however, there are often differences. This poses demands on the expressivity of the assembler. For example, an x86 assembler takes an arbitrary choice between two binary codes for something as simple asMOV AX,BX. If the original code uses the other choice, the original code simply cannot be reproduced at any given point in time. However, even when a fully correct disassembly is produced, problems remain if the program requires modification. For example, the same machine language jump instruction can be generated by assembly code to jump to a specified location (for example, to execute specific code), or to jump a specified number of bytes (for example, to skip over an unwanted branch). A disassembler cannot know what is intended, and may use either syntax to generate a disassembly which reproduces the original binary. However, if a programmer wants to add instructions between the jump instruction and its destination, it is necessary to understand the program's operation to determine whether the jump should be absolute or relative, i.e.,whether its destination should remain at a fixed location, or be moved so as to skip both the original and added instructions.

Another challenge is that it is not always possible to identify which parts of the binary correspond to executable code, and which correspond to data. While common executable formats like ELF and PE divide the binary into executable and data sections, other formats such as flat binaries do not, so any given location in the binary may contain either executable instructions or non-executable data, making it difficult to decide whether it should be disassembled as instructions or left as data. Since CPUs generally allow dynamic jumps computed at runtime, it is not always possible to identify all possible locations in the binary that may be jumped to and therefore contain instructions.

On computer architectures with variable-width instructions, such as in many complex instruction set computer (CISC) architectures, more than one disassembly may be valid.