pacomb

Parsing library based on combinators and ppx extension to write languages

pacomb

Overview

PaComb is a parsing libraries that compiles grammars with semantic actions to combintators that can be used for parsing. Languages are defined with the Pacomb.Grammar module or preferably through our PPX extension. The library offers scanner less parsing, but the Pacomb.Lex module provides a notion of terminals and the Pacomb.Blank module allows to define function ignoring spaces and comments. This and other modules like PAcomb.keywords and PAcomb.Word_list makes the scanner less approach very easy to use.

The main advantage of PaComb is to offer many features in one tool:

  1. BNF/EBNF like syntax to define grammars directly in your ML file using our ppx extension
  2. Parsing from left to right despite the use of combinators (hence the beginning of the input buffer can be collected by the GC).
  3. Support for ambiguous grammars with polynomial complexity using cache and merge (see Cache and merge for ambiguous grammars)
  4. Support for self extensible grammars via Dependent sequences
  5. Rejection of rule from action code using Pacomb.Lex.give_up
  6. Support for utf8 characters and graphemes (grapheme is the correct notion as it is atomic for unicode normalisation)
  7. Support for preprocessing of the buffer (to implement line or include directives) using Pacomb.Input.WithPP.
  8. Lexical convention (blank, comments, etc...) can change during parsing using Pacomb.grammar.layout and the fact that we use scanner less parsing.
  9. Possibility to print the grammar (not perfect yet). This is useful to implement helps in your program using Grammar.print_grammar.

Importantly, performances of PaComb are very good: it is often less than two times slower than grammars generated by ocamlyacc. However, using specific Pacomb features like dependant sequences or cache will result in much slower grammar ... that you can not in general write with ocamlyacc anyway.

Pacomb modules

  1. Pacomb.Grammar The main module to build grammars
  2. Pacomb.Lex lexing and Pacomb.Lex.give_up
  3. Pacomb.Blank blanks and layout record
  4. Pacomb.Pos positions in file and handling of exceptions
  5. Pacomb.Keyword Some function to parse and reserve keywords
  6. Pacomb.Word_list An efficient extensible dictionnary to parse words
  7. Pacomb.Regexp Regexp implementation above Pacomb.Lex
  8. Pacomb.Input build input buffer from string, channel, ...
  9. Pacomb.Charset an efficient representation of character sets

PPX syntax extension

Defining languages using the Pacomb.Grammar module directly is cumbersome. For that reason, PaComb provides a BNF-like PPX syntax extension. An example with arithmetic expressions is given below. It can be compiled with command ocamlfind ocamlopt -package pacomb,pacomb.ppx -o calc -linkpkg calc.ml (if it is written to a file calc.ml).

type p = Atom | Prod | Sum
let%parser rec expr p =
  Atom < Prod < Sum
  ; (p=Atom) (x::FLOAT)                        => x
  ; (p=Atom) '(' (e::expr Sum) ')'             => e
  ; (p=Prod) (x::expr Prod) '*' (y::expr Atom) => x *. y
  ; (p=Prod) (x::expr Prod) '/' (y::expr Atom) => x /. y
  ; (p=Sum ) (x::expr Sum ) '+' (y::expr Prod) => x +. y
  ; (p=Sum ) (x::expr Sum ) '-' (y::expr Prod) => x -. y

To use the above parser, you need to define a toplevel rule, say which characteres to ignore and call the parser on a file or string. For instance the following will work and handle parse error without exiting the program:

(* blanks, i.e. characteres to be ignored *)
let blank = Blank.from_charset (Charset.singleton ' ')

let _ =
  try
    while true do
      let f () =
        Printf.printf "=> %!"; (* prompt *)
        Grammar.parse_string (expr Sum) blank (input_line stdin)
      in
      (* [Pos] module provides a function to handle exception with
         an optional argument to call for error (default is to exit with
         code 1 *)
      Pos.handle_exception ~error:(fun _ -> ()) f ()
    done
  with
    End_of_file -> ()

This works using the extensions [%%parser ...] for structures and [%parser ...] for expression. These are also accessible by suffixing keywords with %parser as in the above example. These ppx extensions extends ocaml expressions with a syntax for grammars of type 'a Grammar.t and modifies the behaviour of let-bindings especially recursive ones to use declare_grammar, set_grammar and grammar_family. Recall that due to the limitation of ppx, we use a sub-syntax of OCaml expressions for grammars. It is therefore not a good idea to use "=>" as an infix inside [%parser ...].

We give below the BNF grammar for the extension, together with a sketch of its semantics.

grammar ::= rule                                                          itself
       | grammar ; rule                                              Grammar.alt
rule ::= qitems => expr                                   A rule with its action
       | expr < ... < expr                              priority order see below
       | ERROR("m")                        report an error if parsing fails here
       | ERROR(ls)                           report errors if parsing fails here
qitems ::= ()                                                      Grammar.empty
       | non_empty_qitems                                                 itself
       | cond non_empty_qitems                                  conditional rule
cond ::= expr bool_op expr
       | not expr
       | expr =| expr                                           pattern matching
non_empty_qitems ::= qitem                                                itself
       | non_empty_qitems qitems                                     Grammar.seq
qitem ::= item                                                            itself
       | (epat :: item)                        give a name if used in the action
       | ((epat,epat) >: item)                            as above, but for dseq
       | (lazy (epat,epat) >: item)                  as above, but for lazy dseq
       | (epat <: item)                                   dseq with no real deps
item ::= CHAR                                           Grammar.term(Lex.any ())
       | CHAR(c) or 'c'                                 Grammar.term(Lex.char c)
       | CHARSET(s)                                  Grammar.term(Lex.charset s)
       | STRING(s) or "s"                             Grammar.term(Lex.string s)
       | UTF8                                      Grammar.term(Lex.any_utf8 ())
       | UTF8(s)                                        Grammar.term(Lex.utf8 s)
       | GRAPHEME                              Grammar.term(Lex.any_grapheme ())
       | GRAPHEME(s)                                Grammar.term(Lex.grapheme s)
       | EOF                                            Grammar.term(Lex.eof ())
       | RE(expr)             Grammar.term(Lex.regexp (Regexp.from_string expr))
       | NAT                                            Grammar.term(Lex.nat ())
       | INT                                            Grammar.term(Lex.int ())
       | FLOAT                                        Grammar.term(Lex.float ())
       | STRING_LIT                              Grammar.term(Lex.string_lit ())
       | CHAR_LIT                                  Grammar.term(Lex.char_lit ())
       | RE(expr)             Grammar.term(Lex.regexp (Regexp.from_string expr))
       | ~? expr                                             Grammar.option expr
       | ~? [expr] expr                         Grammar.option_default expr expr
       | ~* expr                                               Grammar.star expr
       | ~* [expr] expr                               Grammar.star_sep expr expr
       | ~+ expr                                               Grammar.plus expr
       | ~+ [expr] expr                               Grammar.plus_sep expr expr
       | other expr                                                       itself
epat ::= lid
       | __                                                        encoding of _
       | (lazy epat)
       | (epat : coretype)
       | epat = lid                                     encoding of [pat as lid]
       | (epat, ..., epat)
       | uid(epat)
       | M.epat

epat correspond to an encoding of patterns in expressions. Beware that _ is invalid, use __ instead. bool_op is an expression using any test operator: "="|"<"|">"|"<="|">="|"<>"|"=="|"!="|"&&"|"||".

Condition using (e =| epat) allows for rule garder by pattern matching. This has also been tested with GADT (see examples/calc_ext2.ml).

Action code needs parenthesis or begin ... end if it uses if .. then, pattern matching or sequences.

Anything which does not correspond to this grammar will be unchanged in the ocaml code (like the type definition in the example above). A mutually recursive definition can also mix the definition of grammars (parametric of not) with the definition of normal ocaml values. This means you could put the whole file inside %%parser ....

Beware that inside the scope of the extension, you can use the syntax for grammars everywhere. This allows for some nesting as in:

type p = Atom | Prod | Sum
let%parser rec
     expr p = Atom < Prod < Sum
            ; (p=Atom) (x::FLOAT)                             => x
            ; (p=Atom) '(' (x::expr Sum) ')'                  => x
            ; (p=Prod) (x::expr Prod) => ( '*' (y::expr Atom) => x*.y
                                         ; '/' (y::expr Atom) => x/.y)
            ; (p=Sum ) (x::expr Sum ) => ( '+' (y::expr Prod) => x+.y
                                         ; '-' (y::expr Prod) => x-.y)

Remark: the left factorisation in this example is useless: in will be automatically performed by Pacomb.

Here is the meaning of let bindings for grammars through the ppx extension:

  • non recursive let bindings correspond to just a name for the grammar.
  • recursive let bindings correspond either to
  • Grammar.declare_grammar + Grammar.set_grammar (if no parameter)
  • Grammar.grammar_family + setting the grammar if parameters are given. multiple parameters and using label are supported through curryfication by the ppx extension.

For recursive grammar with exactly one parameter, a rule p_1 < p_2 < ... < p_n will automatically add rules to include the grammar parametrized by p_i in the grammar parametrized by p_(i+1). This was used by the calculator example above.

let%parser accepts the following attribute:

  • [@cache] to cache the grammar (that is call Grammar.cache on the grammar)
  • [@merge f] to apply f if two parsetrees are possible for the same input. This corresponds to Grammar.cache ~merge:f.
  • [@layout blank] or [@layout blank ~config:expr] to apply Grammar.layout and change the blank characters for that grammar.
  • [@print_param f] to specify a printing function for the parameters of a recursive grammar. The feature is used in examples/calc_prio.ml to print the grammar when provinding the -help option.

Controling evaluation of action and right recursion

Action are evaluated as soon as the rule is reduced. This may be a problem for grammar whose prefix are ambiguous, even if the grammar is not really ambiguous. This is notably true for right recursion (which as usual should be avoided). If nothing is done, right recursion will be quadratic in time, while linear in space and time is exptected (linear space is the reason to prefer left recursion). To delay evaluation of action, you should use lazy action. For instance, the right recursive grammar for sexpr below:

let%parser rec sexp =
  (x::RE id)         => Idt x
; '(' (l::sexps) ')' => Lst l
and sexps =
  () => lazy []
; (e::sexp) (l::sexps) => (e::l)

Does not work efficiently because in an expression like "a b c ..." all the lists [Idt "a"], [Idt "a"; Idt "b"], etc are constructed. It should be rewritten using lazy:

let%parser rec sexp =
  (x::RE id)              => Idt x
; '(' (lazy l::sexps) ')' => Lst l
and sexps =
  ()                        => lazy []
; (e::sexp) (lazy l::sexps) => lazy (e::l)

or left recursion:

let%parser rec sexp =
  (x::RE id)         => Idt x
; '(' (l::sexps) ')' => Lst (List.rev l)
and sexps =
  ()                   => []
; (l::sexps) (e::sexp) => e::l

Remark: there are some important optimisation for lazy so use the lazy keyword as above as much as possible and do not use Lazy.from_fun or Lazy.force if you can avoid it.

A ssimilar problem arises if you want to evaluate an action immediatly after a newline has been parsed. As Pacomb is trying to read the blank characteres after the newline, the evaluation of the action is retarted.

Remark: right recursion for sequence not using any semantics is optimized and works in O(1) memory. For instance, the following to parse a list of command with dependant grammar (hence right recursion is mandatory) works:

let%parser rec cmds env =
    ()                                          => ()
  ; (top_expr env) '\n' (cmds env)              => ()
  ; ((env,()) >: new_rule env) '\n' (cmds env)  => ()
  ; ((env,()) >: rem_rule env) '\n' (cmds env)  => ()
}]
It works for one main reason: we do not use any semantics
for any item in the rule. The first member of the pair for
dependant grammar is used soon enough and does not need a stack.

For instance, if you do not want to use [input_line] as above for your calculator,
you may want to write:
{[
let%parser rec exprs = () => ()
                     ; exprs (x::expr Sum) '\n' => print_float x

This will not work and printing will not occur just after reading the newline. A solution is to test for newline without reading it to trigger printing:

let nl _ b i _ _ = let (c,_,_) = Input.read b i in c = '\n'
let%parser rec top_expr =
  (t::Grammar.test_after nl expr) => Printf.printf "%g\n=> %!" t
let%parser rec exprs = () => () ; exprs top_expr '\n' => ()

Dependent sequences

Variable binding in the left part of a rule are not available before the final action code. Rherefore, it can not be used for selecting grammar rule. For instance, (p::prio) => (x::g p) => x will report p as unbounded. To solve this, you can use dependent sequences, using (x,y)>:item will allow x (but not y) to be used both in the action and the rule. The separation with dependent and non dependent part is crucial as dependent grammar are memoised and you don't want "noise". Here is an example of grammar using this to implement an extensible calculator (see tests/calc_ext.ml):

let%parser rec expr prio =
  ((pe1,e1)>:expr prio) ((pop,b)>:op pe1 prio) ((__,e2)::expr pop)
                             => (pop, b e1 e2)
; (x::FLOAT)                 => (0.0,x)
; '(' (e::expr max_prio) ')' => (0.0,e)

where op pe1 prio parse binary operator with priorities between pe1 and prio. The action returns both the ewxpression and its priority level.

In principle, you can write a grammar bnf to parse your favorite syntax for BNF and use a rule (g,__)>:bnf (x::parse_my_bnf g) => x to parse any BNF! The function parse_my_bnf should call function in the Pacomb.Grammar module to build the usable grammar and as dependent grammar are memoised, each bnf will be only compiled once. This really gives the power to write self extensible language easily.

Note: one can fairly well control when action are evaluated and therefore the parsing may depends upon global references that are modified by your parser itsel. However, purely functional code should be prefered, even for extensible grammars.

Dependant sequences are also useful to prevent construction of infinite grammar. For instance in this code from examples/calc_ext2.ml:

let%parser rec rule : type a. a ty -> (env -> a Grammar.t) Grammar.t
  = fun t ->
    "Exp" (prio<:FLOAT) (r::rule (Arr(Flt,t))) =>
      (fun env -> (x::expr env (get_prio prio env)) (f::r env) => f x)
  ; "Str" (s<:STRING_LIT) (r::rule t) =>
      (fun env -> (STR s) (x::r env) => x)
  ; "=>" (a::action t) => (fun _ -> () => a)

The construction for the grammar for rule would loop in the "Exp" rule if we were not using <:. epat<:item is equivalent to ((),epat)>:((x::item) => ((),x)). This represent a grammar depending from (), i.e. a constant grammar. The only effect if to delay the construction of the grammar right of <: until it is really used.

Cache and merge for ambiguous grammars

Pacomb can manage ambiguous grammars, but this will result in general in very poor performances and issue a warning: "Parsing ambiguity, use cache with merge" on stderr. You can use Pacomb.Grammar.parse_all_buffer to remove the warning, but you should preferably use cache and merge.

Pacomb.Grammar.cache can be apply to a grammar to produce a cached grammar. This guaranty that this grammar will at most be called once for each position in the parsed buffer. This will not solve ambiguity, but may improve performance. For instance, there are some non ambiguous grammars that Pacomb can not left factorise (left factorisation is not performed to the left of a dependent sequence). The annotation @cache on let%parser will automatically wrap the grammar with a call to Pacomb.Grammar.cache.

When your grammar is ambiguous, you can give an optional merge function to Pacomb.Grammar.cache. This function will be called to merge all results conrresponding to exactly the same portion of the input buffer with the cached grammar.

Here is a toy example (see example/catalan.ml):

let%parser [@merge (+.)] rec bin_seq =
    ()                              => 1.0
  ; (t1::bin_seq) 'a' (t2::bin_seq) => t1 *. t2

This grammar will parse any sequence of 'a' of length n and return the number of binary tree with n nodes (i.e. catalan number)! It does run in polynomial time (O(N^3) in fact, which is not the best that can be done).

In general, using cache and merge, you should be able to reach O(N^3) time complexity and O(N^2) space complexity for any BNF grammar. This seems confirmed by our benchmarks.

Getting positions

In action code (expression right of =>), a lid_lpos or lid_rpos will denote respectively the left and right position of the item named lid. a lid_pos will group both lid_lpos and lid_rpos in a record of type Pos.t * Pos.t. If the item is matched by a tuple and you want to use its position you must use pat = lid syntax to give a name to the whole item. The variables _lpos, _rpos and _pos corresponds to the position of the input parser by the whole rule.

Here is an example parsing sexpr with positions:

type sexp = { p: Pos.t * Pos.t; e : sexp' }
and sexp' =
  | Idt of string
  | Lst of sexp list

let id = "[a-zA-Z_][a-zA-Z_0-9]*[']*"

let%parser rec sexp
   =    ; (x::RE id)     => { p = _pos; e = Idt x }
   ; '(' (l::sexps) ')'  => { p = _pos; e = Lst (List.rev l) })
and sexps = () => []
          ; (l::sexps) (e::sexp) => e::l

The type Pacomb.Pos.t is a very light type to represent position. It contains a record of type Pacomb.Input.infos that is the same for the whole input and the position in bytes in the input of type Pacomb.Input.byte_pos. If you need line numbers and column number, the Pacomb.Pos module contains the necessary function to rescan the input and compute it (with a cache to avoid always rescanning from beginning).

This is important because in general positions are needed only in error messages and computing line and column numer is costly, especially for utf8.

When parsing a stream that is not a regular file, the whole buffer is kept to allow rescanning. If your stream is very large, this is not reasonnable. Two solutions: restart parsing at regular interval, or use the ~rescan:false optionnal parameters and use only position in bytes.

Another problem with rescanning: if you use Grammar.parse_channel or Grammar.parse_fd, you can not get line and column number after closing the channel or file descriptor. However, if you use Grammar.parse_string or Grammar.parse_file you can. In this case, It is even possible to marshal (with closures) the type Pos.t. With Grammar.parse_file position will remain correct as long as the file is unchanged.

Producing nice error messages

The ERROR(m) syntax in the ppx extension allows to add well chosen error messages in your code. Here is a way to use it on the sexp grammar above:

let%parser rec sexp
   = ERROR(["id";"("])
   ; (x::RE id)         => { p = _pos; e = Idt x }
   ; '(' (l::sexps)     => (ERROR(")") | ')'
                        => { p = _pos; e = Lst (List.rev l) })
and sexps = () => []
          ; (l::sexps) (e::sexp) => e::l

Parsing "a b (c + ..." with the above grammar will give

Parse error: Line 1, character 7.
expecting: ) id (

Banchmarks seems to show that adding error cost more or less 15% on the above example.

Limitations

Pacomb must eliminate left recursion in grammars in order to use combinators that would loop otherwise. However, left recursion is not supported if it traverses A Pacomb.Grammar.layout constructor to change blanks (probably possible to solve this, but probably not worth it).

Ambiguous grammar will be analysed in polynomial time if and only if you use a cache for all ambiguous non terminals of your grammar. Pacomb does perform some left factorisation, but it is not complete. Using cache is also a solution to that problem. Pacomb has function to print the grammar to see if left factorisation was performed.

Note: left recursion do not need and is not eliminated if the grammar uses a cache. However, this solution to use cache in general is too slow for non ambiguous grammars so we do not impose a cache to all left recursive grammars.

The ppx extension is not too bad but still suffers from the fact that it uses a sub-language of OCaml to describe grammars. For instance let%parser g = ((_,x)::g) => x is not legal because _ cannot be used in an Ocaml expression. Though the following works: let%parser g = ((__,x)::g) => x. The syntax (((x,y) = z) :: g) => (x,y,z_pos) is not very nice as we use = to replace the as keyword and we also need a lot of parentheses.