biohazard-0.6.5: bioinformatics support library

Bio.Genocall

Synopsis

# Documentation

simple_indel_call :: Int -> IndelPile -> (GL, [IndelVariant]) Source

Simple indel calling. We don't bother with it too much, so here's the gist: We collect variants (simply different variants, details don't matter), so (n) variants give rise to ((n+1)*n/2) GL values. (That's two out of ((n+1)), the reference allele, represented here as no deletion and no insertion, is there, too.) To assign these, we need a likelihood for an observed variant given an assumed genotype.

For variants of equal length, the likelihood is the sum of qualities of mismatching bases, but no higher than the mapping quality. That is roughly the likelihood of getting the observed sequence even though the real sequence is a different variant. For variants of different length, the likelihood is the map quality. This corresponds to the assumption that indel errors in sequencing are much less likely than mapping errors. Since this is hardly our priority, the approximations are hereby declared good enough.

simple_snp_call :: (Qual -> Double) -> Int -> BasePile -> Snp_GLs Source

Naive SNP call; essentially the GATK model. We create a function that computes a likelihood for a given base, then hand over to simple call. Since everything is so straight forward, this works even in the face of damage.

simple_call :: Int -> (a -> [Prob]) -> [a] -> GL Source

Compute `GL` values for the simple case. The simple case is where we sample `ploidy` alleles with equal probability and assume that errors occur independently from each other.

The argument `pls` is a function that computes the likelihood for getting the current read, for every variant assuming that variant was sampled.

XXX This eats up ~40% of total runtime; it *screams out* for specialization to diploidy and four alleles (common SNPs) and maybe diploidy and two alleles (common indels).

mk_snp_gts :: Int -> [Vec4D] Source

Make a list of genotypes, each represented as a vector of allele probabilities, from ploidy and four possible alleles.

This makes the most sense for SNPs. The implied order of alleles is A,C,G,T, and the resulting genotype vectors can straight forwardly be mutiplied with a substitution matrix to give a sensible result. (Something similar for indels could be imagined, but doesn't seem all that useful. We specialize for SNPs to get simpler types and efficient code.)

"For biallelic sites the ordering is: AA,AB,BB; for triallelic sites the ordering is: AA,AB,BB,AC,BC,CC, etc."

SNP call according to maqsamtoolsbsnp model. The matrix k counts how many errors we made, approximately.

data Snp_GLs Source

This pairs up GL values and the reference allele. When constructing it, we make sure the GL values are in the correct order if the reference allele is listed first.

Constructors

 Snp_GLs !GL !Nucleotides

Instances

 Source