# ConClusion ConClusion provides principal component analysis, hierarchical clustering and DBScan in Haskell. There is also a command line interface for processing of [CREST](https://github.com/grimme-lab/crest) conformere trajectories. Hence the name: CONformere CLUStering. The procedure to analyse conformere data has three steps: 1. Read the trajectory and calculate a set of features for each conformere. The features can include the energy, a set of bond lengths, a set of bond angles, and a set of dihedral angles in arbitrary combination. Those descriptors form a feature matrix. 2. A [principal component analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) of the feature matrix might be perfomed to reduce the number of dimensions and remove redundancies. 3. The (potentially PCA-processed) feature matrix is being clustered. Different distance measures are available. Either [DBScan](https://en.wikipedia.org/wiki/DBSCAN) or [hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) can be used to group different conformeres. While the command line interface only fits the work flow described above, the underlying clustering algorithms and PCA are implemented in a general way and can be utilised independently as library. ## Installation ### Bundled Archive A self-contained executable archive is build for the main branch and for releases. This can be executed directly on any Linux and has just to be downloaded. Go to the [page of releaes](https://gitlab.com/theoretical-chemistry-jena/quantum-chemistry/ConfoCluster/-/releases) and download an archive. Make it executable (e.g. `chmod +x conclusion`) and you are done! ### From Source If you have the Haskell toolchain intalled and therefore working Cabal and GHC, you may build ConClusion from source. This also requires working BLAS and LAPACK libraries on your system. ```bash git clone https://gitlab.com/theoretical-chemistry-jena/quantum-chemistry/ConfoCluster.git ConClusion cd ConClusion cabal install --installdir=$(PREFIX) ``` Choose a `PREFIX` where to install the executable. `$HOME/.local/bin/` is often a good choice. If you would like to use ConClusion on systems where Nix is not available (Windows, BSD, ...) this is the way to go. ### With Nix When you have Nix available on your system, everything can be build by Nix: ```bash git clone https://gitlab.com/theoretical-chemistry-jena/quantum-chemistry/ConfoCluster.git ConClusion cd ConClusion/nix nix-build -A ConClusion.components.exes.conclusion ``` ## Usage The command line interface to the `conclusion` executable offers full control about all three steps described above. - `-x --xyz` takes the path to the XYZ trajectory that is to be processed. It must contain the energy as comment line (which is the case for CREST trajectories). - `--dim` specifies the characterising features; therefore a set of internal coordinates and maybe the energy. A comma separated list of an arbitrary amount of features can be specified with the following syntax: - `e` for the energy - `b m n` for a bond length, where `m` and `n` are the 0-based atom indices - `a m n o` for an angle, where `m`, `n` and `o` are 0-based atom indices. Calculates the angle around `n` - `d m n o p` for a dihedral angle, where `m`, `n`, `o` and `p` are 0-based atom indices. Calculates the rotation of the bond between `n` and `o`. Dihedrals use a metric, that respects periodicity and direction of the rotation. For each dihedral there will be two rows in the feature matrix, therefore. See [this paper](https://doi.org/10.1002/prot.20310). - (indices are 0-based) - `-p --pca` activates dimensionalty reduction by principal component analysis. Give an integer to specify how many principal components are kept. During the execution of ConClusion the error introduced by PCA will be printed. - `-c --cluster` activates clustering of the results. The clustering algorithm can be selected by giving either `dbscan` or `hca` - `--measure` specifies the distance measure between the conformeres. By default an euclidean distance is used, but Manhattan and Mahalanobis distances are also available, as well as a general L_p norm. If Mahalanobis distances are used, it might be worth a try to disable PCA. - `--joinstrat` controls how inter-cluster distances are calculated in hierarchical clustering. `single` might be the best choice to get dense groups of conformeres. - `--distance` gives the search radius in DBScan or the dendrogram cut distance in hierarchical clustering. - `--minsize` is the minimum size of a cluster in DBScan. If `--forcemin` is given, the clusters obtained by HCA are also filtered for a minimum size. - `--forcemin` forces filtering of HCA clusters for their minimum size as given by `--minsize`. Disabled by default. Each processing step will produce a Gnuplot compatible file (space separated columns). The pure feature matrix will be `features.dat`, the results of the PCA will be in `pca.dat` and the clustering results will be in `cluster.dat`. The first column in `cluster.dat` will be an integer giving the cluster number this point belongs to, that can be used for colour-coding in Gnuplot. ## Example A perylene dye with four phenoxy groups has different conformeres, that have different spectral properties. For solubility the dye has also some alkyl groups. Crest finds about 1400 conformeres, most of them being different only in the alkyl side-chains, that do not influence spectral properties. Therefore, a much smaller group of different conformeres with respect to different positions of phenoxy groups exist. From each of those groups the lowest energy conformere shall be obtained. We therefore select eight dihedral angles and the energy as features; 2 dihedrals for each phenoxy group. One dihedral per phenoxy group describing the rotation of the perylene-O bond, the second one describing the rotation around the O-Ph bond. As the dihedral angles are not independent from each other, as some orientations of phenoxy groups are not possible, we use a PCA to reduce dimensionalty and remove redundancies. After the PCA, DBScan is used to obtain clusters of similar conformeres. The lowest index in each cluster is also the lowest energy conformere in each group, as CREST sorts conformeres by energy. ```bash conclusion \ --xyz=crest_conformers.xyz \ --pca=3 \ --dim="e, d 19 18 2 56, d 18 2 56 65, d 16 15 1 45, d 15 1 45 46, d 11 13 0 34, d 13 0 34 43, d 29 31 3 67, d31 3 67 76" \ --measure=manhattan \ --cluster=dbscan \ --distance=0.3 \ --minsize=5 ``` ## Library/Haskell Package ConClusion provides principal components analysis and the clustering algorithms DBScan and hierarchical clustering. The algorithms are implemented in efficient parallel arrays and perform quite well. For the API see the haddock documentation, which can be generated by: ```bash cabal haddock ```