package rosetta
Install
Dune Dependency
Authors
Maintainers
Sources
sha256=dffb638f78af6b03f23b285137538bb2575f67ecddbecab3ee7248b4de284564
sha512=c48851ccc623a24b7c2c6e6db959382933a5b4caa8e19abf07b2fdb787d9e3736edd572f8b5a479fdec943c1df3e96ec6a163e28c98fdfb144d15a181e76a15a
README.md.html
Rosetta - universal decoder of an encoded flow to Unicode
Rosetta is a merge-point between uuuu, coin and yuscii. It able to decode UTF-7, ISO-8859 and KOI8 and return Unicode code-point - then, end-user can normalize it to UTF-8 with uutf for example.
The final goal is to provide an universal decoder of any encoding. This project is a part of mrmime, a parser of emails to be able to decode encoded-word (according rfc2047).
If you want to handle a new encoding (like, hmmhmm, APL-ISO-IR-68...), you can make a new issue - then, the process will be to make a new little library and integrate it to rosetta
.
How to use it?
rosetta
follows the same design as libraries used underlying. More precisely, it follows the same API as uutf about encoding. This is a little example to transform a latin1 flow to UTF-8:
let trans ic oc =
let decoder = Rosetta.decoder (Rosetta.encoding_of_string "latin1") (`Channel ic) in
let encoder = Uutf.encoder `UTF_8 (`Channel oc) in
let rec go () = match Rosetta.decode decoder with
| `Await -> assert false (* XXX(dinosaure): impossible when you use `String of `Channel as source. *)
| `Uchar _ as uchar -> ignore @@ Uutf.encode encoder uchar ; go ()
| `End -> ignore @@ Uutf.encoder `End
| `Malformed err -> failwith err in
go ()
let () = trans stdin stdout
About encoding_of_string
rosetta
follows aliases availables into IANA character sets database: https://www.iana.org/assignments/character-sets.xhtml
Others aliases will raise an exception. This function is case-insensitive.
About translation tables
rosetta
relies on underlying libraries such as uuuu
or coin
. They integrate translation tables provided by Unicode consortium. They should not be updated - so we statically save them into an int array
.
About encoding
rosetta
supports only decoding to Unicode code-point. A support of encoding is not on our plan where people should only use Unicode now. Deal with many encodings is a pain and we should only produce something according to Unicode than old encoding like latin1.