ConClusion: Cluster algorithms, PCA, and chemical conformere analysis

This is a package candidate release! Here you can preview how this package release will appear once published to the main package index (which can be accomplished via the 'maintain' link below). Please note that once a package has been published to the main package index it cannot be undone! Please consult the package uploading documentation for more information.

[maintain] [Publish]

Please see the README on GitLab at https://gitlab.com/theoretical-chemistry-jena/quantum-chemistry/ConfoCluster


[Skip to Readme]

Properties

Versions 0.0.1, 0.0.1, 0.0.2, 0.1.0
Change log Changelog.md
Dependencies aeson (==1.5.*), attoparsec (>=0.13.0.0 && <0.15), base (>=4.7 && <4.15), cmdargs (>=0.10.0 && <0.11), ConClusion, containers (>=0.6.0.0 && <0.7), formatting (>=7.1.0 && <7.2), hmatrix (>=0.20.0 && <0.21), massiv (>=0.6.0.0 && <0.7), optics (>=0.3 && <0.5), PSQueue (>=1.1.0.1 && <1.2), rio (>=0.1.13.0 && <0.2), text (>=1.2.0.0 && <1.3) [details]
License AGPL-3.0-only
Copyright 2021 Phillip Seeber
Author Phillip Seeber
Maintainer phillip.seeber@googlemail.com
Category Statistics, Chemistry
Bug tracker https://gitlab.com/theoretical-chemistry-jena/quantum-chemistry/ConfoCluster/-/issues
Source repo head: git clone https://gitlab.com/theoretical-chemistry-jena/quantum-chemistry/ConfoCluster
Uploaded by phillipseeber at 2021-04-26T08:10:34Z

Modules

[Index]

Downloads

Maintainer's Corner

For package maintainers and hackage trustees


Readme for ConClusion-0.0.1

[back to package description]

ConClusion

ConClusion provides principal component analysis, hierarchical clustering and DBScan in Haskell. There is also a command line interface for processing of CREST conformere trajectories. Hence the name: CONformere CLUStering. The procedure to analyse conformere data has three steps:

  1. Read the trajectory and calculate a set of features for each conformere. The features can include the energy, a set of bond lengths, a set of bond angles, and a set of dihedral angles in arbitrary combination. Those descriptors form a feature matrix.
  2. A principal component analysis of the feature matrix might be perfomed to reduce the number of dimensions and remove redundancies.
  3. The (potentially PCA-processed) feature matrix is being clustered. Different distance measures are available. Either DBScan or hierarchical clustering can be used to group different conformeres.

While the command line interface only fits the work flow described above, the underlying clustering algorithms and PCA are implemented in a general way and can be utilised independently as library.

Installation

Bundled Archive

A self-contained executable archive is build for the main branch and for releases. This can be executed directly on any Linux and has just to be downloaded. Go to the page of releaes and download an archive. Make it executable (e.g. chmod +x conclusion) and you are done!

From Source

If you have the Haskell toolchain intalled and therefore working Cabal and GHC, you may build ConClusion from source. This also requires working BLAS and LAPACK libraries on your system.

git clone https://gitlab.com/theoretical-chemistry-jena/quantum-chemistry/ConfoCluster.git ConClusion
cd ConClusion
cabal install --installdir=$(PREFIX)

Choose a PREFIX where to install the executable. $HOME/.local/bin/ is often a good choice.

If you would like to use ConClusion on systems where Nix is not available (Windows, BSD, ...) this is the way to go.

With Nix

When you have Nix available on your system, everything can be build by Nix:

git clone https://gitlab.com/theoretical-chemistry-jena/quantum-chemistry/ConfoCluster.git ConClusion
cd ConClusion/nix
nix-build -A ConClusion.components.exes.conclusion

Usage

The command line interface to the conclusion executable offers full control about all three steps described above.

Each processing step will produce a Gnuplot compatible file (space separated columns). The pure feature matrix will be features.dat, the results of the PCA will be in pca.dat and the clustering results will be in cluster.dat. The first column in cluster.dat will be an integer giving the cluster number this point belongs to, that can be used for colour-coding in Gnuplot.

Example

A perylene dye with four phenoxy groups has different conformeres, that have different spectral properties. For solubility the dye has also some alkyl groups. Crest finds about 1400 conformeres, most of them being different only in the alkyl side-chains, that do not influence spectral properties. Therefore, a much smaller group of different conformeres with respect to different positions of phenoxy groups exist. From each of those groups the lowest energy conformere shall be obtained. We therefore select eight dihedral angles and the energy as features; 2 dihedrals for each phenoxy group. One dihedral per phenoxy group describing the rotation of the perylene-O bond, the second one describing the rotation around the O-Ph bond. As the dihedral angles are not independent from each other, as some orientations of phenoxy groups are not possible, we use a PCA to reduce dimensionalty and remove redundancies. After the PCA, DBScan is used to obtain clusters of similar conformeres. The lowest index in each cluster is also the lowest energy conformere in each group, as CREST sorts conformeres by energy.

conclusion \
  --xyz=crest_conformers.xyz \
  --pca=3 \
  --dim="e, d 19 18 2 56, d 18 2 56 65, d 16 15 1 45, d 15 1 45 46, d 11 13 0 34, d 13 0 34 43, d 29 31 3 67, d31 3 67 76" \
  --measure=manhattan \
  --cluster=dbscan \
  --distance=0.3 \
  --minsize=5

Library/Haskell Package

ConClusion provides principal components analysis and the clustering algorithms DBScan and hierarchical clustering. The algorithms are implemented in efficient parallel arrays and perform quite well. For the API see the haddock documentation, which can be generated by:

cabal haddock