Disassembler Driver.
This is a low-level interface to the CFG reconstruction and disassembling engine. It is used by BAP's high-level components, such as the recursive-descent disassembler, so there is in general no need to use it directly, unless you're devising a custom disassembly pipe-line.
The disassembler is driven and controlled by the knowledge base, so it is possible to modify the behavior of the BAP disassembler layer through the knowledge base and turn the default recursive-descent mode into something more conservative, e.g., speculative, superset, shingled, or even probabilistic disassembler.
Algorithm
Memory Classification
When the driver is fed with a new memory region (using the Driver.scan
function), it uses the knowledge base to initially classify addresses that belong to this region.
For each byte in the region, it creates a temporary core-theory:program
object and sets is label-address
property to the address of that byte. It then queries if it is a function start (core-theory:is-function
) and if it is known to be code or data (if the core-theory:is-valid
property is (true)
then it is classified as code and if it (false)
then it is data, otherwise, it is undetermined). All objects created during classification are deleted immediately after the query and never committed to the knowledge base (they are scoped objects). Therefore it is fine to speculate and assume that all bytes are code by providing (true)
to the is-valid
property of each byte.
Disassembling
The disassembling starts from each function start (as identified by the previous step) and continues until there is no more unprocessed function starts and all addresses, which were classified as code, are either successfully disassembled or proved to be data.
During disassembling the driver will discover more jump destinations and add them to the worklist. The default mode is speculative, i.e., when we meet a barrier, we continue disassembling beyond it. If the worklist is empty, but the set of addresses that were classified as code (in the first step) is still not empty, which means that these addresses are not reachable from the initially provided starting points (function starts) then the minimal address is extracted from the set and is assumed to be a start of a basic block and added to the worklist.
The process continues until both the worklist and the set of code addresses are empty. When the process converges the knowledge base will contain all disassembled instructions (though some of them might be invalid). The result of the disassembly is the value that contains information sufficient to reconstruct the control-flow graph of the program. It could be queried directly, using various accessors or folded over with the explore
function, which is a generalized control-flow graph building function.
Backtracking
The disassembler has a backtracking mechanism that enables it to track each disassembled byte back to the memory byte that was initially marked as code or a function start. When we identify an instruction chain that is invalid, i.e., when data follow a machine instruction or when its destination is some data, we retract the whole chain of instructions. This ensures that all valid instructions belong to at least one valid instruction chain. The justification for not including invalid instructions chains into the disassembly is that such chains will unconditionally switch the CPU into the invalid instruction state which will terminate the program. Since such a chain can't include system calls or CPU exceptions (both are not barriers) it can't have any side-effects visible outside of the process so it could be safely ignored.
Delay slots
Any instruction could have a delay (core-theory:delay
) that is greater than zero. In that case the execution order of the instructions will not be equal to the linear order of the instructions addresses and m
instructions that follow the delayed instruction will be executed before that instruction (put in the basic block before it), where m
is the size of the delay slot, expressed in instructions.
information necessary to build the control-flow graph.
val bin_shape_state : Core_kernel.Bin_prot.Shape.t
val bin_size_state : state Core_kernel.Bin_prot.Size.sizer
val bin_write_state : state Core_kernel.Bin_prot.Write.writer
val bin_writer_state : state Core_kernel.Bin_prot.Type_class.writer
val bin_read_state : state Core_kernel.Bin_prot.Read.reader
val __bin_read_state__ : (int -> state) Core_kernel.Bin_prot.Read.reader
val bin_reader_state : state Core_kernel.Bin_prot.Type_class.reader
val bin_state : state Core_kernel.Bin_prot.Type_class.t
abstract type for a sequence of instructions.
abstract representation of a jump instruction.
init
the initial disassembler state.
merge x y
is a sum of information in states x
and y
.
equal x y
is true
if x
and y
denote equal graphs.
scan mem state
updates the state.
If bytes in mem
were already scanned then returns state
without changes. Nothing is stored in the knowledge base.
If it is a new memory region then classifies and disassembles the whole region. For each disassembled address p
an object Theory.Label.for_addr p
is stored in the knowledge base. It will contain a fully disassembled and lifted instruction. All instructions that are subroutine entry points will have is-subroutine
property set to (true)
.
See sec:Algorithm
for the detailed description of the algorithm.
val explore :
?entries:addr Core_kernel.Sequence.t ->
?entry:addr ->
?follow:(addr -> bool Bap_knowledge.knowledge) ->
block:(mem -> insns -> 'n Bap_knowledge.knowledge) ->
node:('n -> 'c -> 'c Bap_knowledge.knowledge) ->
edge:('n -> 'n -> 'c -> 'c Bap_knowledge.knowledge) ->
init:'c ->
state ->
'c Bap_knowledge.knowledge
explore ~block ~node ~edge ~init state
builds a control-flow graph from the state
.
This function is a generalized fold function that calls the user provided functions to construct the graph, which has abstract type 'g
.
block mem insns
creates a basic block of type 'n
which covers memory mem
; insns
is the sequence of instructions that constitute that memory, use list_insns insns
to get instructions in the linear order of their addresses, or execution_order
to get the execution order (which might be different from linear in the presence of delayed and speculative instructions).
node x g
inserts the node x
into graph g
;edge x y g
inserts an edge between nodes x
and y
;init
is the initial graph;follow x
if it returns true
then the function will follow this destination. Defaults to a function that always evaluates to true
.
entry
is the entry point from which to build the graph, if absent, then all basic blocks will be consecuitively, in the order of ascending addresses, used as the entry points.
entries
is the sequence of entry points, if both entry
and entries
are specified then entry
is consed with entries
.
val list_insns : ?rev:bool -> insns -> Bap_core_theory.Theory.Label.t list
list_insns xs
returns the list of instructions in the ascending order of their addresses (or descending if rev
is true
(defaults to false
.
val execution_order :
insns ->
Bap_core_theory.Theory.Label.t list Bap_knowledge.knowledge
execution_order xs
reruns a list of instructions in the order in which they will be executed by the target CPU, which could be different when instructions are delayed or speculatively executed.
Low-level interface
All functions in this interface were made availabe in BAP 2.2.0 unless stated otherwise.
val subroutines : state -> Core_kernel.Set.M(Addr).t
subroutines state
is a set of subroutine entry points that either were provided through the knowledge base or later discovered as destinations of call instructions.
val blocks : state -> Core_kernel.Set.M(Addr).t
blocks state
is the set of addresses of instructions that start basic blocks.
is_data state x
is true
if x
was classified as data.
is_subroutine s x
is Set.mem (subroutines s) x
.
is_block s x
is Set.mem (blocks s) x
.
is_jump s x
is true
if x
is the maximal address of an instruction in the basic block that terminates with a jump.
Note, that when jumps are delayed, the linear order of instructions differs from the execution order, so the address of the last instruction is not the maximal address in the basic block.
jump state src
if src
is the last in the linear order instruction of a basic block that terminates with a jump instruction then returns the descrption of that jump instruction.
val destinations : jump -> Core_kernel.Set.M(Addr).t
destinations jump
returns the set of resolved destinations.
val is_call : jump -> bool
is_call jump
is true
if jump is call
.
val is_barrier : jump -> bool
is_barrier jump
is true
if jump is barrier
.