Overview
Layered Architecture
The BAP library has the layered architecture consisting of four layers. Although the layers are not really observable from outside of the library, they make it easier to learn the library as they introduce new concepts sequentially. On top of these layers, the Project module is defined that consolidates all information about a target of analysis. The Project
module may be viewed as an entry point to the library.
+-----------------------------------------------------+
| +--------+ +-----------------------------------+ |
| | | | | |
| | | | Foundation Library | |
| | | | | |
| | | +-----------------------------------+ |
| | P | |
| | | +-----------------------------------+ |
| | R | | | |
| | | | Memory Model | |
| | O | | | |
| | | +-----------------------------------+ |
| | J | |
| | | +-----------------------------------+ |
| | E | | | |
| | | | Disassembly | |
| | C | | | |
| | | +-----------------------------------+ |
| | T | |
| | | +-----------------------------------+ |
| | | | | |
| | | | Semantic Analysis | |
| | | | | |
| +--------+ +-----------------------------------+ |
+-----------------------------------------------------+
The Foundation library defines BAP Instruction language data types, as well as other useful data structures, like Value
, Trie
, Vector
, etc. The Memory model layer is responsible for loading and parsing binary objects and representing them in a computer memory. It also defines a few useful data structures that are used extensively by later layers, e.g., Table
and Memmap
. The next layer performs disassembly and lifting to BIL. Finally, the semantic analysis layer transforms a binary into an IR representation, that is suitable for writing analysis.
Plugin Architecture
The standard library tries to be as extensible as possible. We are aware, that there are not good solutions for some problems, so we don't want to force our way of doing things. In short, we're trying to provide mechanisms, not policies. We achieve this by employing the dependency injection principle. By inversing the dependency we allow the library to depend on a user code. For example, a user code can teach the library how to disassemble the binary or even how to reconstruct the CFG. In fact, the library by itself doesn't contain the disassembler or lifter, or any architecture specific code. Everything is injected later by corresponding plugins.
The library defines a fixed set of extension points. (Other libraries, that constitute the Platform and follow the same principle, can define their own extension points, so the following set is not complete):
The Regular.Std
library, that forms a foundation for the BAP Standard Library, also follows the dependency injection principle, so every data type that implements regular interface, can be dynamically extended with:
- pretty printing function;
- serialization subroutines;
- caching.
Writing the analysis
A common use case, is to write some analysis that will take the program in some representation and then either output result of analysis in a human or machine readable way, or transform the program, in a way that can be employed by other analysis. Following a naming convention of a more established community of compiler writers, we name such analysis a _pass_.
The library itself doesn't run any analysis, it part of the job of a frontend to run it. In particular, the bap
frontend, will run the analyses based on a command line specification. See bap
--help
for more information.
We use Project
data structure to represent a program and all associated knowledge that we were capable to infer. To learn how to use the project data structure continue to Working with project.
Foundation Library
At this layer we define (Binary Instruction language) and few other useful data structures:
- arch - describes computer architecture;
- size - word and register sizes;
- var - BIL variable;
- typ - BIL type system;
- exp - BIL expression sub-language;
- stmt - BIL statements;
- bitvector - a bitvector data structure to represent immediate data, used usually by their aliases
word
and addr
; - value - an extensible variant type;
- dict - an extensible record;
- vector - an array that can grow;
- Trie - prefix trees;
Most of the types implement the Regular interface. This interface is very similar to Core's Identifiable
, and is supposed to represent a type that is as common as a built-in type. One should expect to find any function that is implemented for such types as int
, string
, char
, etc. Namely, this interface includes:
- comparison functions: (
<, >, <= , >= , compare, between, ...
); - each type defines a polymorphic
Map
with keys of type t
; - each type provides a
Set
with values of type t
; - hashtable is exposed via
Table
module; - hashset is available under
Hash_set
name - sexpable and binable interface;
to_string
, str
, pp
, ppo
, pps
functions for pretty-printing.
It is a convention, that for each type, there is a module with the same name that implements its interface. For example, type exp
is a type abbreviation for Exp.t
, and module Exp
contains all functions and types related to type exp
. For example, to create a hashtable of statements, just type:
let table = Exp.Table.create ()
If a type is a variant type (i.e., it defines constructors) then for each constructor named Name
, there exists a corresponding function named name
that will accept the same number of arguments as the arity of the constructor (also named a _functional constructor_). For example, a Bil.Int
can be constructed with the Bil.int
function that has type word ->
exp
. If a constructor has several arguments of the same type we usually disambiguate using labels, e.g., Bil.Load of
(exp,exp,endian,size)
has function Bil.load with type: mem:exp -> addr:exp -> endian -> size -> exp
Value
Universal values can be viewed as extensible variants on steroids. Not only they maybe extended, but they also can be serialized, compared with user-defined comparison function and even pretty printed.
Dict
Like value is an extensible sum type, dict can be viewed as an extensible product type. Dict is a sequence of values of type value
, with tags used as field names. Of course, fields are unique.
Vector
Vector
is an implementation of C++ STL like vectors with logarithmic push back.
Tries
The Foundation library also defines a prefix tree data structure that proves to be useful for binary analysis applications. Tries in BAP is a functor that derives a polymorphic trie data structure for a given Key.
For convenience we support instantiating tries for most of our data structures. For example, Word has several tries inside.
For the common string trie, there's Trie.String
.
Memory model
This layer is responsible for the representation of binaries. It provides interfaces for the memory objects:
- mem - a contiguous array of bytes, indexed with absolute addresses;
- 'a table - a mapping from a memory regions to arbitrary data (no duplicates or intersections);
- a memmap - a mapping from memory region to arbitrary data with duplicates and intersections allowed, aka segment tree or interval map;
- image - represents a binary object with all its symbols, segments, sections and other meta information.
The Image
module uses the plugin system to load binary objects. In order to add new loader, one should implement the Backend.t loader function and register it with the Image.register_backend function.
Disassembler
This layer defines the interfaces for disassemblers. Two interfaces are provided:
- Disasm - a regular interface that hides all complexities, but may not always be very flexible.
- Disasm_expert - an expert interface that provides access to a low-level representation. It is very flexible and fast, but harder to use.
To disassemble files or data with the regular interface, use one of the following functions:
All these functions perform disassembly by recursive descent, reconstruct the control flow graph, and perform lifting.
The result of disassembly is represented by the abstract value of type disasm. Two main data structures that are used to represent disassembled program are:
- insn - a machine instruction;
- block - a basic block, i.e., a linear sequence of instructions.
The following figure shows the relationship between basic data structures of the disassembled program.
+-----------------+
| +-------------+ |
| | disasm | |
| +-------------+ |
| | |
| | * |
| +-------------+ |
| | block | |
| +-------------+ |
| | |
| | * |
| +-------------+ |
| | insn | |
| +-------------+ |
| | |
| | * |
| +-------------+ |
| | stmt | |
| +-------------+ |
+-----------------+
A disassembled program is represented as a set of interconnected basic blocks, called a whole program control flow graph (CFG) and it is indeed represented as a graph Graphs.Cfg
. See graphlib for more information on graphs.
Each block is a container to a sequence of machine instructions. It is guaranteed that there's at least one instruction in the block, thus the Block.leader and Block.terminator functions are total.
Each machine instruction is represented by its opcode
, name
and array
of operands (these are machine and disassembler specific), a set of predicates (that approximates instruction semantics on a very high level), and a sequence of BIL statements that precisely define the semantics of the instruction.
The expert interface exposes low level interface that provides facilities for building custom implementations of disassemblers. The interface to the disassembler backend is exposed via the Disasm_expert.Basic
module. New backends can be added by implementing the 'disasm.hpp' interface.
Modules of type CPU provide a high level abstraction of the machine CPU and allow one to reason about the instruction semantics independently from the target platform. The module type Target brings CPU
and ABI
together. To get an instance of this module, you can use the target_of_arch function. Architecture specific implementations of the Target
interface may (and usually do) provide more information, see corresponding support libraries for ARM
and x86 architectures.
Semantic Analysis
On the semantic level the disassembled program is lifted into the BAP Intermediate Representation (BIR). BIR is a semi-graphical representation of BIL (where BIL represents a program as Abstract Syntax Tree). The BIR provides mechanisms to express richer relationships between program terms and it also easier to use for most use cases, especially for data dependency analysis.
The program in IR is build of terms. In fact the program itself is also a term. There're only 7 kinds of terms:
- program - the program in whole;
- sub - subroutine;
- arg - subroutine argument;
- blk - basic block;
- def - definition of a variable;
- phi - phi-node in the SSA form;
- jmp - a transfer of control.
Unlike expressions and statements in BIL, IR's terms are concrete entities. Concrete entity is such entity that can change in time and space, as well as come in and out of existence. Contrary, abstract entity is eternal and unchangeable. Identity denotes the sameness of a concrete entity as it changes in time. Abstract entities don't have an identity since they are immutable. Program is built of concrete entities called terms. Terms have attributes that can change in time, without affecting the identity of a term. Attributes are abstract entities. In each particular point of space and time a term is represented by a snapshot of all its attributes, colloquially called value. Functions that change the value of a term in fact return a new value with different set of attributes. For example, def
term has two attributes: left hand side (lhs), that associates definition with abstract variable, and right hand side (rhs) that associates def
with an abstract expression. Suppose, that the definition was:
# let d_1 = Def.create x Bil.(var y + var z);;
val d_1 : Def.t = 00000001: x := y + z
To change the right hand side of a definition we use Def.with_rhs
that returns the same definition but with different value:
# let d_2 = Def.with_rhs d_1 Bil.(int Word.b1);;
val d_2 : Def.t = 00000001: x := true
d_1
and d_2
is different values
# Def.equal d_1 d_2;;
- : bool = false
of the same term
# Term.same d_1 d_2;;
- : bool = true
The identity of this terms is denoted by the term identifier (tid
). In the textual representation term identifiers are printed as ordinal numbers.
Terms, can contain other terms. But unlike BIL expressions or statements, this relation is not truly recursive, since the structure of program term is fixed: arg
, phi
, def
, jmp
are leaf terms; sub
can only contain arg
's or blk
's; blk
consists of phi
, def
and jmp
sequences of terms, as pictured in the figure below. Although, the term structure is closed to changes, you still can extend particular term with attributes, using set_attr
and get_attr
functions of the Term module. This functions are using extensible variant type to encode attributes.
+--------------------------------------------------------+
| +-------------------+ |
| | program | |
| +---------+---------+ |
| |* |
| +---------+---------+ |
| | sub | |
| +---------+---------+ |
| | |
| +-----------------+---------------+ |
| |* |* |
| +-----+-------+ +-------+-------+ |
| | arg | | blk | |
| +-------------+ +-------+-------+ |
| | |
| +---------------+--------------+ |
| |* |* | * |
| +-----+-----+ +-----+-----+ +----+-----+ |
| | phi | | def | | jmp | |
| +-----------+ +-----------+ +----------+ |
+--------------------------------------------------------+
Working with project
There're two general approaches to obtain a value of type project:
- create it manually using
Project.create
function; - write a plugin to the
bap
frontend.
Although the first approach is simplistic and gives you a full control, we still recommend to use the latter.
To write a program analysis plugin (or pass in short) you need to implement a function with one of the following interfaces:
Once loaded from the bap
frontend (see bap --help
) this function will be invoked with a value of type project that provides access to all information gathered from the input source. If the registered function returns a non unit
type, then it can functionally update the project state, e.g., add annotations, discover new symbols, transform program representation, etc.
Example
The following plugin prints all sections in a file:
open Core_kernel[@@warning "-D"]
open Bap.Std
open Format
let print_sections p =
Project.memory p |> Memmap.to_sequence |> Seq.iter ~f:(fun (mem,x) ->
Option.iter (Value.get Image.section x) ~f:(fun name ->
printf "Section: %s@.%a@." name Memory.pp mem))
let () = Project.register_pass' print_sections
Note: this functionality is provided by the print
plugin.
To pass data from one pass to another in a type safe manner, we use universal values. Values can be attached to a particular memory region, IR terms, or put into the storage
dictionary. For the first case we use the memmap data structure. It is an interval tree containing all the memory regions that are used during analysis. For the storage
we use Dict
data structure. Also, each program term, has its own dictionary.
Memory annotations
By default the memory is annotated with the following attributes:
- section -- for regions of memory that had a particular name in the original binary. For example, in ELF, sections have names that annotate a corresponding memory region. If project was created from memory object, then the overall memory will be marked as a
"bap.user"
section.
- segment -- if the binary data was loaded from a binary format that contains segments, then the corresponding memory regions are be marked. Segments provide access to permission information.
BAP API
type abbreviation for 'a Sequence.t
val compare_seq : ('a -> 'a -> int) -> 'a seq -> 'a seq -> int
module Trie : sig ... end
Type to represent machine word
A synonym for word
, that should be used for words that are addresses
module Size : sig ... end
Type safe operand and register sizes.
type addr_size = [ `r32 | `r64 ] Size.p
Bitvector -- an integer with modular arithmentics.
Expose endian
constructors to Bap.Std
namespace
Shortcut for bitvectors that represent words
module Addr : sig ... end
Shortcut for bitvectors that represent addresses
val compare_typ : typ -> typ -> int
val compare_var : var -> var -> int
val compare_bil : bil -> bil -> int
val compare_exp : exp -> exp -> int
The type of a BIL expression.
Each BIL expression is either an immediate value of a given width, or a chunk of memory of a give size. The following predefined constructors are brought to the scope:
module Type : sig ... end
The type of a BIL expression.
8-bit width value
16-bit width value
16-bit width value
32-bit width value
32-bit width value
64-bit width value
64-bit width value
128-bit width value
128-bit width value
256-bit width value
mem32_t size
creates a type for memory with 32
-bit addresses and elements of the given size
.
mem64_t size
creates a type for memory with 64
-bit addresses and elements of the given size
.
module Context : sig ... end
Base class for evaluation contexts.
module Eval : sig ... end
Basic and generic expression evaluator.
module Expi : sig ... end
Expression Language Interpreter.
module Bili : sig ... end
Regular
interface for BIL expressions
module Stmt : sig ... end
Regular
interface for BIL statements
module Arch : sig ... end
module Value : sig ... end
module Dict : sig ... end
Universal Heterogeneous Map.
BAP IR.
Program is a tree of terms.
val compare_term : ('a -> 'a -> int) -> 'a term -> 'a term -> int
val compare_sub : sub -> sub -> int
val compare_arg : arg -> arg -> int
val compare_blk : blk -> blk -> int
val compare_phi : phi -> phi -> int
val compare_def : def -> def -> int
val compare_jmp : jmp -> jmp -> int
val compare_nil : nil -> nil -> int
val compare_tid : tid -> tid -> int
type label =
| Direct of tid
| Indirect of exp
target of control transfer
type jmp_kind =
| Call of call
| Goto of label
| Ret of label
return from call to label
| Int of int * tid
interrupt and return to tid
control transfer variants
type intent =
| In
| Out
| Both
Term type classes
module Biri : sig ... end
type color = [
| `black
| `red
| `green
| `yellow
| `blue
| `magenta
| `cyan
| `white
| `gray
]
Color something with a color
print marked entity with the specified color. (the same as color, but pretty printing function will output ascii escape sequence of corresponding color.
print marked entity with specified color. See foreground
.
A command in python language
A command in shell language
A virtual address of an entity
val filename : string tag
an image loaded into memory
a table from memory to 'a
interval trees from memory regions to 'a
Iterators lifted into monad
module Table : sig ... end
A locations of a chunk of memory
module Image : sig ... end
Memory maps. Memory map is an assosiative data structure that maps memory regions to values. Unlike in the Table, memory regions in the Memmap can intersect in an arbitrary ways. This data structure is also known as an Interval Tree.
Symbolizer defines a method for assigning symbolic names to addresses
Rooter defines a method for finding function starts in a program
Brancher defines a method for resolving branch instruction
Reconstructor defines a method for reconstructing symbol tables
value of type disasm
is a result of the disassembling of a memory region.
values of type insn
represents machine instructions decoded from a given piece of memory
block
is a region of memory that is believed to be a basic block of control flow graph to the best of our knowledge.
val compare_cfg : cfg -> cfg -> int
type jump = [
| `Jump
| `Cond
]
a jump kind. A jump to another block can be conditional or unconditional.
This type defines a relation between two basic blocks.
type edge = [
| jump
| `Fall
]
This type defines a relation between two basic blocks.
module Kind : sig ... end
abstract and opaque register
val compare_reg : reg -> reg -> int
val compare_imm : imm -> imm -> int
val compare_fmm : fmm -> fmm -> int
Integer immediate operand
Floating point immediate operand
val compare_op : op -> op -> int
Expert interface to disassembler.
module Insn : sig ... end
module Block : sig ... end
The interface to the disassembler level.
Reconstructed symbol table.
module type CPU = sig ... end
module type Target = sig ... end
Abstract interface for all targets.
target_of_arch arch
returns a module packed into value, that abstracts target architecture. The returned module has type Target
and can be unpacked locally with:
let module Target = (val target_of_arch arch) in
val register_target : arch -> (module Target) -> unit
Register new target architecture. If target for the given arch already exists, then it will be superseded by the new target.
module Live : sig ... end
module Term : sig ... end
Program in Intermediate representation.
A control transfer operation.
module Call : sig ... end
A control transfer to another subroutine.
module Label : sig ... end
Target of a control flow transfer.
module Taint : sig ... end
Symbolizer maps addresses to function names
Rooter finds starts of functions in the binary.
Brancher is responsible for resolving destinations of branch instructions.
Reconstructor is responsible for reconstructing symbol table from a CFG. It should partition a CFG into a set of possibly intersecting functions. See Symtab
module for more information about symbol table and functions.
module Event : sig ... end
The interface to the BAP toplevel state.
module Self () : sig ... end