biohazard-0.6.5: bioinformatics support library

Safe HaskellNone




type DamageModel a = Bool -> Int -> Vector (Mat44 a) Source

Things specific to ancient DNA, e.g. damage models.

For aDNA, we need a substitution probability. We have three options: use an empirically determined PSSM, use an arithmetically defined PSSM based on the Johnson model, use a context sensitive PSSM based on the Johnson model and an alignment. Using Dindel, actual substitutions relative to a called haplotype would be taken into account. Since we're not going to do that, taking alignments into account is difficult, somewhat approximate, and therefore not worth the hassle.

We represent substitution matrices by the type Mat44D. Internally, this is a vector of packed vectors. Conveniently, each of the packed vectors represents all transition into the given nucleotide.

A DamageModel is a function that gives substitution matrices for each position in a read. The DamageModel can depend on the length of the read and whether its alignment is reversed. In practice, we should probably memoize precomputed damage models somehow.

data To Source


Nucleotide :-> Nucleotide infix 9 

(!) :: Mat44D -> To -> Double infix 8 Source

Convenience function to access a substitution matrix that has a mnemonic reading.

noDamage :: Num a => DamageModel a Source

DamageModel for undamaged DNA. The likelihoods follow directly from the quality score. This needs elaboration to see what to do with amibiguity codes (even though those haven't actually been observed in the wild).

data DamageParameters float Source

Parameters for the universal damage model.

We assume the correct model is either no damage, or single strand damage, or double strand damage. Each of them comes with a probability. It turns out that blending them into one is simply accomplished by multiplying these probabilities onto the deamination probabilities.

For single stranded library prep, only one kind of damage occurs (C to T), it occurs at low frequency (ssd_delta) everywhere, at high frequency (ssd_sigma) in single stranded parts, and the overhang length is distributed exponentially with parameter ssd_lambda at the 5' end and ssd_kappa at the 3' end. (Without UDG treatment, those will be equal. With UDG, those are much smaller and in fact don't literally represent overhangs.)

For double stranded library prep, we get C->T damage at the 5' end and G->A at the 3' end with rate dsd_sigma and both in the interior with rate dsd_delta. Everything is symmetric, and therefore the orientation of the aligned read doesn't matter either. Both overhangs follow a distribution with parameter dsd_lambda.




ssd_sigma :: !float
ssd_delta :: !float
ssd_lambda :: !float
ssd_kappa :: !float
dsd_sigma :: !float
dsd_delta :: !float
dsd_lambda :: !float


Read float => Read (DamageParameters float) Source 
Show float => Show (DamageParameters float) Source 

genSubstMat :: Fractional a => a -> a -> Mat44 a Source

Generic substitution matrix, has C->T and G->A deamination as parameters. Setting p or q to 0 as appropriate makes this apply to the single stranded or undamaged case.