package molenc

  1. Overview
  2. Docs
Molecular encoder/featurizer using rdkit and OCaml

Install

Dune Dependency

Authors

Maintainers

Sources

v16.2.0.tar.gz
sha256=3c8530da23d2e646c967cf2fe531cf10b0f2f472d1450e85a51c3fdaf14add00
md5=b220f63cadabbd908ea99932028f7ce9

Description

Chemical fingerprints are lossy encodings of molecules. molenc allows to encode molecules using unfolded-counted fingerprints (i.e. a potentially very long but sparse vector of positive integers).

Currently, Faulon fingerprints and atom pairs are supported.

Currently, atom types are the quadruplet (#pi-electrons, element symbol, #HA neighbors, formal charge). In the future, pharmacophore features might be supported (a more abstract/fuzzy atom typing scheme).

Bibliography:

Faulon, J. L., Visco, D. P., & Pophale, R. S. (2003). The signature molecular descriptor.

  1. Using extended valence sequences in QSAR and QSPR studies. Journal of chemical information and computer sciences, 43(3), 707-720.

Carhart, R. E., Smith, D. H., & Venkataraghavan, R. (1985). Atom pairs as molecular features in structure-activity studies: definition and applications. Journal of Chemical Information and Computer Sciences, 25(2), 64-73.

Kearsley, S. K., Sallamack, S., Fluder, E. M., Andose, J. D., Mosley, R. T., & Sheridan, R. P. (1996). Chemical similarity using physiochemical property descriptors. Journal of Chemical Information and Computer Sciences, 36(1), 118-127.

OpenSMILES specification. Craig A. James et. al. v1.0 2016-05-15. http://opensmiles.org/opensmiles.html

Published: 07 Oct 2021

README

Introduction

MolEnc: a molecular encoder using rdkit and OCaml.

The implemented fingerprint is J-L Faulon's "Signature Molecular Descriptor" (SMD[1]). This is an unfolded-counted chemical fingerprint. Such fingerprints are less lossy than famous chemical fingerprints like ECFP4. SMD encoding doesn't introduce feature collisions upon encoding. Also, a feature dictionary is created at encoding time. This dictionary can be used later on to map a given feature index to an atom environment.

We recommend using a radius of zero to one (molenc.sh -r 0:1 ...) or zero to two.

Currently, the fingerprint can be run using atom types (#pi-electrons, element symbol, #HA neighbors, formal charge).

In the future, we might add pharmacophore feature points[3] (Donor, Acceptor, PosIonizable, NegIonizable, Aromatic, Hydrophobe), to allow a fuzzier description of molecules. It is also planned to support atom pairs[2] in addition to or in combination with SMD.

How to install the software

For beginners/non opam users: download and execute the latest self-installer shell script from (https://github.com/UnixJunkie/molenc/releases).

Then execute:

./molenc-5.0.1.sh ~/usr/molenc-5.0.1

This will create ~/usr/molenc-5.0.1/bin/molenc.sh, among other things inside the same directory.

For opam users:

opam install molenc

Do not hesitate to contact the author in case you have problems installing or using the software or if you have any question.

Usage

molenc.sh -i input.smi -o output.txt
         [-d encoding.dix]; reuse existing dictionary
         [-r i:j]; fingerprint radius (default=0:1)
         [--seq]; sequential mode (disable parallelization)
         [--no-std]; don't standardize input file molecules
                     ONLY USE IF THEY HAVE ALREADY BEEN STANDARDIZED

How to encode a database of molecules:

molenc.sh -i molecules.smi -o molecules.txt

How to encode another database of molecules, but reusing the feature dictionary from another database:

molenc.sh -i other_molecules.smi -o other_molecules.txt -d molecules.txt.dix

Bibliography

[1] Faulon, J. L., Visco, D. P., & Pophale, R. S. (2003). The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies. Journal of chemical information and computer sciences, 43(3), 707-720.

[2] Carhart, R. E., Smith, D. H., & Venkataraghavan, R. (1985). Atom pairs as molecular features in structure-activity studies: definition and applications. Journal of Chemical Information and Computer Sciences, 25(2), 64-73.

[3] Kearsley, S. K., Sallamack, S., Fluder, E. M., Andose, J. D., Mosley, R. T., & Sheridan, R. P. (1996). Chemical similarity using physiochemical property descriptors. Journal of Chemical Information and Computer Sciences, 36(1), 118-127.

Dependencies (15)

  1. line_oriented
  2. vector3
  3. parany >= "12.1.1"
  4. ocamlgraph
  5. ocaml >= "4.07" & < "5.0"
  6. minicli >= "5.0.0"
  7. dune >= "1.11"
  8. dolog >= "5.0.0"
  9. dokeysto_camltc
  10. cpm >= "9.0.0"
  11. conf-rdkit
  12. conf-python-3
  13. conf-graphviz
  14. bst >= "2.0.0"
  15. batteries >= "3.2.0" & < "3.4.0"

Dev Dependencies

None

Used by (5)

  1. hts_shrink >= "3.0.1"
  2. linwrap >= "9.0.3"
  3. oranger >= "3.0.1"
  4. rankers < "2.0.9"
  5. svmwrap

Conflicts

None

OCaml

Innovation. Community. Security.