package devkit

You can search for identifiers within the package.

in-package search v0.2.0

On This Page

Preliminaries
Interface
1. Unicode String Functions

package devkit

devkit
- Devkit
devkit.core
- Devkit_core
  - Action
    
    timer
    
    timer_start
  - Async
    
    Internal
    
    Peer
  - Bit_struct_list
    
    Make
    
    S
    
    S
  - Cache
    
    Assoc
    
    Count
    
    Group
    
    LRU
    
    Lists
    
    Lock
    
    NoLock
    
    Reuse
    
    T
    
    TimeLimited2
    
    Lock
    
    cache
  - Control
  - Daemon
  - Devkit_ragel
  - Digest_auth
  - Exn
  - Exn_lwt
  - ExtArg
  - ExtEnum
  - ExtEnum_merge
  - ExtThread
    
    Async_fin
    
    LockMutex
    
    Pool
    
    WorkerT
    
    Workers
    
    T
  - Factor
    
    Float
    
    Int
  - FastBase64
  - Files
  - Gzip_io
  - Gzip_stream
  - HtmlStream
  - HtmlStream_ragel
    
    Raw
  - Httpev
    
    Answer
    
    Args
    
    T
    
    BuffersCache
    
    Hidden
    
    Param
    
    ReqBuffersCache
    
    Tcp
  - Httpev_common
  - Idn
    
    CONV
    
    Make
    
    CONV
  - Log
    
    State
    
    M
    
    Put
    
    logger
  - Logger
    
    Make
    
    T
    
    Put
    
    PutLimited
    
    T
    
    PutSimple
    
    T
    
    Target
  - Logstash
    
    Dyn
  - Lwt_engines
    
    libevent
    
    poll
  - Lwt_flag
  - Lwt_mark
  - Lwt_util
  - MVar
  - Memory
  - Mtq
  - Network
    
    IPv4
  - Nix
  - Parallel
    
    Forks
    
    T
    
    Services
    
    WorkerT
    
    Workers
  - Persist
    
    Marshal
    
    V
    
    Marshal_ext
    
    V
    
    Value
    
    Value_ext
  - Pid
  - Possibly_otel
    
    Traceparent
  - Prelude
    
    Fresh
    
    T
  - Reader
  - Signal
  - Static_config
    
    base
  - Stre
    
    ASCII
  - Systemd
    
    Daemon
  - Time
  - UnsafeBitSet
  - Var
    
    typ
  - Web
    
    CURL
    
    CurlCache
    
    Curl_blocking
    
    Curl_lwt_for_http
    
    HTTP
    
    IO
    
    Http
    
    Curl_IO
    
    IO
    
    Http_blocking
    
    IO
    
    Http_lwt
    
    IO
    
    IO_TYPE
    
    IO_blocking
    
    IO_lwt
devkit.ocamlnet_lite
- Ocamlnet_lite
  - Netaux
    
    ArrayAux
  - Netbuffer
  - Netconversion
  - Netencoding
    
    Html
    
    Url
  - Netstring_str
  - Netstring_tstring
  - Netsys_types

Legend:
Library
Module
Module type
Parameter
Class
Class type

Conversion between character encodings * * Contents *

preliminaries *
interface *
- Netconversion.direct_conv
- Netconversion.cursors *
  - Netconversion.bom
- unicode_functions
*

Preliminaries

* * A character set is a set of characters where every character is * identified by a code point. An encoding is a way of * representing characters from a set in byte strings. For example, * the Unicode character set has more than 96000 characters, and * the code points have values from 0 to 0x10ffff (not all code points * are assigned yet). The UTF-8 encoding represents the code points * by sequences of 1 to 4 bytes. There are also encodings that * represent code points from several sets, e.g EUC-JP covers four * sets. * * Encodings are enumerated by the type encoding, and names follow * the convention `Enc_*, e.g. `Enc_utf8. * Character sets are enumerated by the type * charset, and names follow the convention `Set_*, e.g. * `Set_unicode. * * This module deals mainly with encodings. It is important to know * that the same character set may have several encodings. For example, * the Unicode character set can be encoded as UTF-8 or UTF-16. * For the 8 bit character sets, however, there is usually only one * encoding, e.g `Set_iso88591 is always encoded as `Enc_iso88591. * * In a single-byte encoding every code point is represented by * one byte. This is what many programmers are accustomed at, and * what the OCaml language specially supports: A string is * a sequence of chars, where char means an 8 bit quantity * interpreted as character. For example, the following piece of code allocates * a string of four chars, and assigns them individually: * *

* let s = String.create 4 in
* s.[0] <- 'G';
* s.[1] <- 'e';
* s.[2] <- 'r';
* s.[3] <- 'd';
*

* * In a multi-byte encoding there are code points that are represented * by several bytes. As we still represent such text as string, the * problem arises that a single char, actually a byte, often represents * only a fraction of a full multi-byte character. There are two solutions: * - Give up the principle that text is represented by string. * This is, for example, the approach chosen by Camomile, another OCaml * library dealing with Unicode. Instead, text is represented as * int array. This way, the algorithms processing the text can * remain the same. * - Give up the principle that individual characters can be directly * accessed in a text. This is the primary way chosen by Ocamlnet. * This means that there is not any longer the possibility to read * or write the nth character of a text. One can, however, still * compose texts by just concatenating the strings representing * individual characters. Furthermore, it is possible to define * a cursor for a text that moves sequentially along the text. * The consequence is that programmers are restricted to sequential * algorithms. Note that the majority of text processing falls into * this class. * * The corresponding piece of code for Ocamlnet's Unicode implementation * is: *

* let b = Buffer.create 80 in
* Buffer.add b (ustring_of_uchar `Enc_utf8 71);  (* 71 = code point of 'G' *)
* Buffer.add b (ustring_of_uchar `Enc_utf8 101); (* 101 = code point of 'e' *)
* Buffer.add b (ustring_of_uchar `Enc_utf8 114); (* 114 = code point of 'r' *)
* Buffer.add b (ustring_of_uchar `Enc_utf8 100); (* 100 = code point of 'd' *)
* let s = Buffer.contents b
*

* * It is important to always remember that a char is no longer * a character but simply a byte. In many of the following explanations, * we strictly distinguish between byte positions or byte counts, * and character positions or character counts. * * There a number of special effects that usually only occur in * multi-byte encodings: * * - Bad encodings: Not every byte sequence is legal. When scanning * such text, the functions will raise the exception Malformed_code * when they find illegal bytes. * - Unassigned code points: It may happen that a byte sequence is * a correct representation for a code point, but that the code point * is unassigned in the character set. When scanning, this is also * covered by the exception Malformed_code. When converting from * one encoding to another, it is also possible that the code point * is only unassigned in the target character set. This case is * usually handled by a substitution function subst, and if no such * function is defined, by the exception Cannot_represent. * - Incomplete characters: The trailing bytes of a string may be the * correct beginning of a byte sequence for a character, but not a * complete sequence. Of course, if that string is the end of a * text, this is just illegal, and also a case for Malformed_code. * However, when text is processed chunk by chunk, this phenomenon * may happen legally for all chunks but the last. For this reason, * some of the functions below handle this case specially. * - Byte order marks: Some encodings have both big and little endian * variants. A byte order mark at the beginning of the text declares * which variant is actually used. This byte order mark is a * declaration written like a character, but actually not a * character. * * There is a special class of encodings known as ASCII-compatible. * They are important because there are lots of programs and protocols * that only interpret bytes from 0 to 127, and treat the bytes from * 128 to 255 as data. These programs can process texts as long as * the bytes from 0 to 127 are used as in ASCII. Fortunately, many * encodings are ASCII-compatible, including UTF-8. * *

Unicode

* * Netconversion is centred around Unicode. * The conversion from one encoding to another works by finding the * Unicode code point of the character * to convert, and by representing the code point in the target encoding, * even if neither encodings have to do with Unicode. * Of course, this approach requires that all character sets handled * by Netconversion are subsets of Unicode. * * The supported range of Unicode code points: 0 to 0xd7ff, 0xe000 to 0xfffd, * 0x10000 to 0x10ffff. All these code points can be represented in * UTF-8 and UTF-16. Netconversion does not know which of the code * points are assigned and which not, and because of this, it simply * allows all code points of the mentioned ranges (but for other character * sets, the necessary lookup tables exist). * * UTF-8: The UTF-8 representation can have one to four bytes. Malformed * byte sequences are always rejected, even those that want to cheat the * reader like "0xc0 0x80" for the code point 0. There is special support * for the Java variant of UTF-8 (`Enc_java). `Enc_utf8 strings must not * have a byte order mark (it would be interpreted as "zero-width space" * character). However, the Unicode standard allows byte order marks * at the very beginning of texts; use `Enc_utf8_opt_bom in this case. * * UTF-16: When reading from a string encoded as `Enc_utf16, a byte * order mark is expected at the beginning. The detected variant * (`Enc_utf16_le or `Enc_utf16_be) is usually returned by the parsing * function. The byte order mark is not included into the output string. - * Some functions of this * module cannot cope with `Enc_utf16 (i.e. UTF-16 without endianess * annotation), and will fail. * * Once the endianess is determined, the code point 0xfeff is no longer * interpreted as byte order mark, but as "zero-width non-breakable space". * * Some code points are represented by pairs of 16 bit values, these * are the so-called "surrogate pairs". They can only occur in UTF-16. * * UTF-32: This is very much the same as for UTF-16. There is a little * endian version `Enc_utf32_le and a big endian version `Enc_utf32_be. * *

Subsets of Unicode

* * The non-Unicode character sets are subsets of Unicode. Here, it may * happen that a Unicode code point does not have a corresponding * code point. In this case, certain rules are applied to handle * this (see below). It is, however, ensured that every non-Unicode * code point has a corresponding Unicode code point. (In other words, * character sets cannot be supported for which this property does * not hold.) * * It is even possible to create further subsets artificially. The * encoding `Enc_subset(e,def) means to derive a new encoding from * the existing one e, but to only accept the code points for which * the definition function def yields the value true. For example, * the encoding *

`Enc_subset(`Enc_usascii, 
*             fun i -> i <> 34 && i <> 38 && i <> 60 && i <> 62)

* is ASCII without the bracket angles, the quotation mark, and the * ampersand character, i.e. the subset of ASCII that can be included * in HTML text without escaping. * * If a code point is not defined by the encoding but found in a text, * the reader will raise the exception Malformed_code. When text is * output, however, the subst function will be called for undefined code * points (which raises Cannot_represent by default). The subst * function is an optional argument of many conversion functions that * allows it to insert a substitution text for undefined code points. * Note, however, that the substitution text is restricted to at most * 50 characters (because unlimited length would lead to difficult * problems we would like to avoid). * *

Linking this module

* * Many encodings require lookup tables. The following encodings * are built-in and always supported: * * - Unicode: `Enc_utf8, `Enc_java, `Enc_utf16, `Enc_utf16_le, `Enc_utf16_be, `Enc_utf32, `Enc_utf32_le, `Enc_utf32_be * - Other: `Enc_usascii, `Enc_iso88591, `Enc_empty * * The lookup tables for the other encodings are usually loaded at * runtime, but it is also possible to embed them in the generated * binary executable. See Netunidata for details. The functions * available_input_encodings and available_output_encodings can * be invoked to find out which encodings can be loaded, or are available * otherwise. * *

Supported Encodings, Restrictions

* * I took the mappings from www.unicode.org, and the standard names of * the character sets from IANA. Obviously, many character sets are missing * that can be supported; especially ISO646 character sets, and many EBCDIC * code pages. Stateful encodings like generic ISO-2022 have been omitted * (stateless subsets of ISO-2022 like EUC can be supported, however; * currently we support EUC-JP and EUC-KR). * * Because of the copyright statement from Unicode, I cannot put the * source tables that describe the mappings into the distribution. They * are publicly available from www.unicode.org. * *

Known Problems

* * - The following charsets do not have a bijective mapping to Unicode: * adobe_standard_encoding, adobe_symbol_encoding, * adobe_zapf_dingbats_encoding, cp1002 (0xFEBE). The current implementation * simply removes one of the conflicting code point pairs - this might * not what you want. * - Japanese encodings: * JIS X 0208: The character 1/32 is mapped to 0xFF3C, and not * to 0x005C.

Interface

* * Naming conventions: * * As it is possible to refer to substrings by either giving a byte * offset or by counting whole characters, these naming conventions * are helpful: * * - Labels called range_pos and range_len refer to byte positions of * characters, or substrings * - Labels called count refer to positions given as the number of characters * relative to an origin * * Furthermore: * * - A uchar is a single Unicode code point represented as int * - A ustring is a string of encoded characters * - A uarray is an array of int representing a string

exception Malformed_code

Raised when an illegal byte sequence is found

exception Cannot_represent of int

Raised when a certain Unicode code point cannot be represented in * the selected output encoding

type encoding = [

| `Enc_utf8
| `Enc_subset of encoding * (int -> bool)

]

The polymorphic variant enumerating the supported encodings. We have: * - `Enc_utf8: UTF-8 * - `Enc_subset(e,def): The subset of e by applying the definition * function def

Unicode String Functions

val uarray_of_ustring : 
  encoding ->
  ?range_pos:int ->
  ?range_len:int ->
  string ->
  int array

Returns the characters of the string as array of Unicode code points. * *

parameter range_pos
The byte position of the substring to extract * (default: 0) *
@param range_len
The byte length of the substring to extract * (default: byte length of the input string minus range_pos)

val ustring_of_uarray : 
  ?subst:(int -> string) ->
  encoding ->
  ?pos:int ->
  ?len:int ->
  int array ->
  string

Returns the array of Unicode code points as encoded string. * *

parameter pos
Selects a subarray: pos is the first array position * to encode (default: 0) *
@param len
Selects a subarray: len is the length of the subarray * to encode (default: array length minus pos) *
@param subst
This function is called when a code point cannot be represented * in the chosen character encoding. It must returns the (already encoded) * string to substitute for this code point. By default * (if ~subst is not passed), the exception Cannot_represent * will be raised in this case.

val convert_poly : 
  in_ops:'s1 Netstring_tstring.tstring_ops ->
  out_kind:'s2 Netstring_tstring.tstring_kind ->
  ?subst:(int -> string) ->
  in_enc:encoding ->
  out_enc:encoding ->
  ?range_pos:int ->
  ?range_len:int ->
  's1 ->
  's2

val convert : 
  ?subst:(int -> string) ->
  in_enc:encoding ->
  out_enc:encoding ->
  ?range_pos:int ->
  ?range_len:int ->
  string ->
  string

Converts the string from in_enc to out_enc, and returns it. * The string must consist of a whole number of characters. If it * ends with an incomplete multi-byte character, however, this is * detected, and the exception Malformed_code will be raised. * This exception is also raised for other encoding errors in the * input string. * *

parameter subst
This function is invoked for code points of in_enc that * cannot be represented in out_enc, and the result of the function * invocation is substituted (directly, without any further conversion). * Restriction: The string returned by subst must not be longer than 50 * bytes. * If subst is missing, Cannot_represent is raised in this case. * *
@param range_pos
Selects a substring for conversion. range_pos * is the byte position of the first character of the substring. * (Default: 0) * *
@param range_len
Selects a substring for conversion. range_len * is the length of the substring in bytes (Default: Length * of the input string minus range_pos)

val makechar : encoding -> int -> string

makechar enc i: * Creates the string representing the Unicode code point i in encoding * enc. Raises Not_found if the character is legal but cannot be * represented in enc. * * Possible encodings: everything but `Enc_utf16 and `Enc_utf32 * * Evaluation hints: * - PRE_EVAL(encoding) * *

deprecated
This function is deprecated since ocamlnet-0.96. Use * ustring_of_uchar instead.