Safe Haskell | None |
---|---|
Language | Haskell2010 |
Things specific to ancient DNA, e.g. damage models.
For aDNA, we need a substitution probability. We have three options: use an empirically determined PSSM, use an arithmetically defined PSSM based on the Johnson model, use a context sensitive PSSM based on the Johnson model and an alignment. Using Dindel, actual substitutions relative to a called haplotype would be taken into account. Since we're not going to do that, taking alignments into account is difficult, somewhat approximate, and therefore not worth the hassle.
Synopsis
- data DmgStats = DmgStats {
- basecompo5 :: CompositionStats
- basecompo3 :: CompositionStats
- substs5 :: SubstitutionStats
- substs3 :: SubstitutionStats
- substs5d5 :: SubstitutionStats
- substs3d5 :: SubstitutionStats
- substs5d3 :: SubstitutionStats
- substs3d3 :: SubstitutionStats
- substs5dd :: SubstitutionStats
- substs3dd :: SubstitutionStats
- substs5cpg :: SubstitutionStats
- substs3cpg :: SubstitutionStats
- type CompositionStats = [(Maybe Nucleotide, Vector Int)]
- type SubstitutionStats = [(Subst, Vector Int)]
- addFragType :: Monad m => BamMeta -> Stream (Of BamRaw) m b -> Stream (Of (BamRaw, FragType)) m b
- damagePatternsIter :: MonadIO m => Int -> Int -> Stream (Of (BamRec, FragType, Vector Word8, Vector NPair)) m x -> Stream (Of Alignment) m DmgStats
- damagePatternsIterMD :: MonadIO m => Int -> Stream (Of (BamRaw, FragType)) m x -> Stream (Of Alignment) m DmgStats
- damagePatternsIter2Bit :: MonadIO m => Refs -> TwoBitFile -> Int -> Int -> Stream (Of (BamRaw, FragType)) m x -> Stream (Of Alignment) m DmgStats
- alnFromMd :: Vector_Nucs_half Nucleotides -> Vector Cigar -> [MdOp] -> Vector NPair
- data DamageParameters float = DP {
- ssd_sigma :: !float
- ssd_delta :: !float
- ssd_lambda :: !float
- ssd_kappa :: !float
- dsd_sigma :: !float
- dsd_delta :: !float
- dsd_lambda :: !float
- data NewDamageParameters vec float = NDP {}
- data GenDamageParameters vec float
- = UnknownDamage
- | OldDamage (DamageParameters float)
- | NewDamage (NewDamageParameters vec float)
- type DamageModel = Bool -> Int -> Int -> Mat44D
- bang :: Mat44D -> Subst -> Double
- nudge :: MMat44D -> Subst -> Double -> IO ()
- data Alignment = ALN {
- a_sequence :: !(Vector NPair)
- a_fragment_type :: !FragType
- data FragType
- data Subst = Nucleotide :-> Nucleotide
- data NPair
- npair :: Nucleotides -> Nucleotides -> NPair
- fst_np :: NPair -> Nucleotides
- snd_np :: NPair -> Nucleotides
- noDamage :: DamageModel
- univDamage :: DamageParameters Double -> DamageModel
- empDamage :: NewDamageParameters Vector Double -> DamageModel
- newtype Mat44D = Mat44D (Vector Double)
- newtype MMat44D = MMat44D (IOVector Double)
- scalarMat :: Double -> Mat44D
- complMat :: Mat44D -> Mat44D
- freezeMats :: MMat44D -> MMat44D -> IO Mat44D
- bwa_cal_maxdiff :: Double -> Int -> Int
Documentation
Collected "traditional" statistics:
- Base composition near 5' end and near 3' end. Each consists of
five vectors of counts of A,C,G,T, and everything else.
basecompo5
begins withcontext
bases to the left of the 5' end,basecompo3
ends withcontext
bases to the right of the 3' end. - Substitutions. Counted from the reconstructed alignment, once around the 5' end and once around the 3' end. For a total of 2*4*4 different substitutions. Positions where the query has a gap are skipped.
- Substitutions at CpG motifs. Also counted from the reconstructed alignment, and a CpG site is simply the sequence CG in the reference. Gaps may confuse that definition, so that CpHpG still counts as CpG, because the H is gapped. That might actually be desirable.
- Conditional substitutions. The 5' and 3' ends count as damaged if
the very last position has a C-to-T substitution. With that in
mind,
substs5d5
,substs5d3
,substs5dd
are likesubsts5
, but counting only reads where the 5' end is damaged, where the 3' end is damaged, and where both ends are damaged, respectively.
type CompositionStats = [(Maybe Nucleotide, Vector Int)] Source #
addFragType :: Monad m => BamMeta -> Stream (Of BamRaw) m b -> Stream (Of (BamRaw, FragType)) m b Source #
damagePatternsIter :: MonadIO m => Int -> Int -> Stream (Of (BamRec, FragType, Vector Word8, Vector NPair)) m x -> Stream (Of Alignment) m DmgStats Source #
Common logic for statistics. The function get_ref_and_aln
reconstructs reference sequence and alignment from a Bam record. It
is expected to construct the alignment with respect to the forwards
strand of the reference; we reverse-complement it if necessary.
damagePatternsIterMD :: MonadIO m => Int -> Stream (Of (BamRaw, FragType)) m x -> Stream (Of Alignment) m DmgStats Source #
Stream transformer that computes some statistics from plain BAM
with a valid MD field. The Alignment
is also reconstructed and
passed downstream. The final value of the source stream becomes
the stats_more
field of the result.
- Reconstruct the alignment from CIGAR, SEQ, and MD.
- Filter the alignment to get the reference sequence, accumulate it.
- Accumulate everything over the alignment.
The argument is the amount of context inside desired.
For Complete
fragments, we cut the read in the middle, so the 5'
and 3' plots stay clean from each other's influence. Leading
and
Trailing
fragments count completely towards the appropriate end.
damagePatternsIter2Bit :: MonadIO m => Refs -> TwoBitFile -> Int -> Int -> Stream (Of (BamRaw, FragType)) m x -> Stream (Of Alignment) m DmgStats Source #
Stream transformer that computes some statistics from plain BAM
(no MD field needed) and a 2bit file. The Alignment
is also
reconstructed and passed downstream. The final value of the source
stream ends up in the stats_more
field of the result.
- Get the reference sequence including both contexts once. If this includes invalid sequence (negative coordinate), pad suitably.
- Accumulate counts for the valid parts around 5' and 3' ends as appropriate from flags and config.
- Combine the part that was aligned to (so no context) with the read to reconstruct the alignment.
Arguments are the table of reference names, the 2bit file with the reference, the amount of context outside the alignment desired, and the amount of context inside desired.
For Complete
fragments, we cut the read in the middle, so the 5'
and 3' plots stay clean from each other's influence. Leading
and
Trailing
fragments count completely towards the appropriate end.
alnFromMd :: Vector_Nucs_half Nucleotides -> Vector Cigar -> [MdOp] -> Vector NPair Source #
Reconstructs the alignment from query, cigar, and md. Only positions where the query is not gapped are produced.
data DamageParameters float Source #
Parameters for the universal damage model.
We assume the correct model is either no damage, or single strand damage, or double strand damage. Each of them comes with a probability. It turns out that blending them into one is simply accomplished by multiplying these probabilities onto the deamination probabilities.
For single stranded library prep, only one kind of damage occurs (C
frequency (ssd_sigma
) in single stranded parts, and the overhang
length is distributed exponentially with parameter ssd_lambda
at
the 5' end and ssd_kappa
at the 3' end. (Without UDG treatment,
those will be equal. With UDG, those are much smaller and in fact
don't literally represent overhangs.)
For double stranded library prep, we get C->T damage at the 5' end
and G->A at the 3' end with rate dsd_sigma
and both in the interior
with rate dsd_delta
. Everything is symmetric, and therefore the
orientation of the aligned read doesn't matter either. Both
overhangs follow a distribution with parameter dsd_lambda
.
DP | |
|
Instances
data NewDamageParameters vec float Source #
Instances
data GenDamageParameters vec float Source #
UnknownDamage | |
OldDamage (DamageParameters float) | |
NewDamage (NewDamageParameters vec float) |
Instances
type DamageModel = Bool -> Int -> Int -> Mat44D Source #
A DamageModel
is a function that gives substitution matrices for
each position in a read. The DamageModel
can depend on whether the
alignment is reversed, the length of the read and the position. (In
practice, we should probably memoize precomputed damage models
somehow.)
bang :: Mat44D -> Subst -> Double infix 8 Source #
Convenience function to access a substitution matrix that has a mnemonic reading.
Alignment record. The reference sequence is filled with Ns if missing.
ALN | |
|
Nucleotide :-> Nucleotide infix 9 |
Compact storage of a pair of ambiguous Nucleotides
. Used to
represent alignments in a way that is accessible even to assembly
code. The first and sencond field are stored in the low and high
nybble, respectively. See fst_np
, snd_np
, npair
.
Instances
Eq NPair Source # | |
Ord NPair Source # | |
Show NPair Source # | |
Storable NPair Source # | |
npair :: Nucleotides -> Nucleotides -> NPair Source #
fst_np :: NPair -> Nucleotides Source #
snd_np :: NPair -> Nucleotides Source #
noDamage :: DamageModel Source #
DamageModel
for undamaged DNA. The likelihoods follow directly
from the quality score. This needs elaboration to see what to do
with amibiguity codes (even though those haven't actually been
observed in the wild).
We represent substitution matrices by the type Mat44D
. Internally,
this is a vector of packed vectors. Conveniently, each of the packed
vectors represents all transitions into the given nucleotide.
freezeMats :: MMat44D -> MMat44D -> IO Mat44D Source #
Adds the two matrices of a mutable substitution model (one for each strand) appropriately, normalizes the result (to make probabilities from pseudo-counts), and freezes that into one immutable matrix. We add a single count everywhere to avoid getting NaN from bizarre data.
bwa_cal_maxdiff :: Double -> Int -> Int Source #
Number of mismatches allowed by BWA.
bwa_cal_maxdiff thresh len
returns the number of mismatches
bwa aln -n $tresh
would allow in a read of length len
. For
reference, here is the code from BWA that computes it (we assume err
= 0.02
, just like BWA):
int bwa_cal_maxdiff(int l, double err, double thres) { double elambda = exp(-l * err); double sum, y = 1.0; int k, x = 1; for (k = 1, sum = elambda; k < 1000; ++k) { y *= l * err; x *= k; sum += elambda * y / x; if (1.0 - sum < thres) return k; } return 2; }