biohazard-0.6.2: bioinformatics support library

Bio.Genocall

Synopsis

Documentation

simple_indel_call :: Int -> IndelPile -> (GL, [IndelVariant]) Source

Simple indel calling. We don't bother with it too much, so here's the gist: We collect variants (simply different variants, details don't matter), so n variants give rise to (n+1)*n/2 GL values. (That's two out of (n+1), the reference allele, represented here as no deletion and no insertion, is there, too.) To assign these, we need a likelihood for an observed variant given an assumed genotype.

For variants of equal length, the likelihood is the sum of qualities of mismatching bases, but no higher than the mapping quality. That is roughly the likelihood of getting the observed sequence even though the real sequence is a different variant. For variants of different length, the likelihood is the map quality. This corresponds to the assumption that indel errors in sequencing are much less likely than mapping errors. Since this hardly our priority, the approximations are declared good enough.

Naive SNP call; essentially the GATK model. We create a function that computes a likelihood for a given base, then hand over to simple call. Since everything is so straight forward, this works even in the face of damage.

simple_call :: Int -> (a -> [Prob]) -> [a] -> GL Source

Compute GL values for the simple case. The simple case is where we sample ploidy alleles with equal probability and assume that errors occur independently from each other.

The argument pls is a function that computes the likelihood for getting the current read, for every variant assuming that variant was sampled.

NOTE, this may warrant specialization to diploidy and four alleles (common SNPs) and diploidy and two alleles (common indels).

mk_snp_gts :: Int -> [Vec4D] Source

Make a list of genotypes, each represented as a vector of allele probabilities, from ploidy and four possible alleles.

This makes the most sense for SNPs. The implied order of alleles is A,C,G,T, and the resulting genotype vectors can straight forwardly be mutiplied with a substitution matrix to give a sensible result. (Something similar for indels could be imagined, but doesn't seem all that useful. We specialize for SNPs to get simpler types and efficient code.)

"For biallelic sites the ordering is: AA,AB,BB; for triallelic sites the ordering is: AA,AB,BB,AC,BC,CC, etc."

SNP call according to maqsamtoolsbsnp model. The matrix k counts how many errors we made, approximately.