biohazard-0.6.13: bioinformatics support library

Safe HaskellNone
LanguageHaskell2010

Bio.Bam.Pileup

Synopsis

Documentation

data PrimChunks Source #

Genotype Calling: like Samtools(?), but for aDNA

The goal for this module is to call haploid and diploid single nucleotide variants the best way we can, including support for aDNA. Indel calling is out of scope, we only do it "on the side".

The cleanest way to call genotypes under all circumstances is probably the Dindel approach: define candidate haplotypes, align each read to each haplotype, then call the likely haplotypes with a quality derived from the quality scores. This approach neatly integrates indel calling with ancient DNA and makes a separate indel realigner redundant. However, it's rather expensive in that it requires inclusion of an aligner, and we'd need an aligner that is compatible with the chosen error model, which might be hard.

Here we'll take a short cut: We do not really call indels. Instead, these variants are collected and are assigned an affine score. This works best if indels are 'left-aligned' first. In theory, one indel variant could be another indel variant with a sequencing error---we ignore that possibility for the most part. Once indels are taken care off, SNVs are treated separately as independent columns of the pileup.

Regarding the error model, there's a choice between samtools or the naive model everybody else (GATK, Rasmus Nielsen, etc.) uses. Naive is easy to marry to aDNA, samtools is (probably) better. Either way, we introduce a number of parameters (eta and kappa for samtools, lambda, delta, delta_ss for Johnson). Running a maximum likehood fit for those may be valuable. It would be cool, if we could do that without rerunning the complete genotype caller, but it's not a priority.

So, outline of the genotype caller: We read BAM (minimally filtering; general filtering is somebody else's problem, but we might want to split by read group). We will scan each read's CIGAR line in concert with the sequence and effective quality. Effective quality is the lowest available quality score of QUAL, MAPQ, and BQ. For aDNA calling, the base is transformed into four likelihoods based on the aDNA substitution matrix.

So, either way, we need something like "pileup", where indel variants are collected as they are (any length), while matches are piled up.

Regarding output, we certainly don't want to write VCF or BCF. (No VCF because it's ugly, no BCF, because the tool support is non-existent.) It will definitely be something binary. For the GL values, small floating point formats may make sense: half-precision floating point's representable range would be 6.1E-5 to 6.5E+5, 0.4.4 minifloat from Bio.Util goes from 0 to 63488.

The primitive pieces for genotype calling: A position, a base represented as four likelihoods, an inserted sequence, and the length of a deleted sequence. The logic is that we look at a base followed by some indel, and all those indels are combined into a single insertion and a single deletion.

Constructors

Seek Int PrimBase

skip to position (at start or after N operation)

Indel [Nucleotides] [DamagedBase] PrimBase

observed deletion and insertion between two bases

EndOfRead

nothing anymore

data PrimBase Source #

Constructors

Base

more chunks

Fields

data DamagedBase Source #

Represents our knowledge about a certain base, which consists of the base itself (A,C,G,T, encoded as 0..3; no Ns), the quality score (anything that isn't A,C,G,T becomes A with quality 0), and a substitution matrix representing post-mortem but pre-sequencing substitutions.

Unfortunately, none of this can be rolled into something more simple, because damage and sequencing error behave so differently.

Damage information is polymorphic. We might run with a simple version (a matrix) for calling, but we need more (a matrix and a mutable matrix, I think) for estimation.

Constructors

DB

reference base from MD field

Fields

newtype DmgToken Source #

Constructors

DmgToken 

Fields

decompose :: DmgToken -> BamRaw -> [PosPrimChunks] Source #

Decomposes a BAM record into chunks suitable for piling up. We pick apart the CIGAR and MD fields, and combine them with sequence and quality as appropriate. Clipped bases are removed/skipped as appropriate. We also do apply a substitution matrix to each base, which must be supplied along with the read.

data CallStats Source #

Statistics about a genotype call. Probably only useful for fitlering (so not very useful), but we keep them because it's easy to track them.

Constructors

CallStats 

type GL = Vector Prob Source #

Genotype likelihood values. A variant call consists of a position, some measure of qualities, genotype likelihood values, and a representation of variants. A note about the GL values: VCF would normalize them so that the smallest one becomes zero. We do not do that here, since we might want to compare raw values for a model test. We also store them in a Double to make arithmetics easier. Normalization is appropriate when converting to VCF.

If GL is given, we follow the same order used in VCF: "the ordering of genotypes for the likelihoods is given by: F(jk) = (k*(k+1)2)+j. In other words, for biallelic sites the ordering is: AA,AB,BB; for triallelic sites the ordering is: AA,AB,BB,AC,BC,CC, etc."

newtype V_Nuc Source #

Constructors

V_Nuc (Vector Nucleotide) 

Instances

Eq V_Nuc Source # 

Methods

(==) :: V_Nuc -> V_Nuc -> Bool #

(/=) :: V_Nuc -> V_Nuc -> Bool #

Ord V_Nuc Source # 

Methods

compare :: V_Nuc -> V_Nuc -> Ordering #

(<) :: V_Nuc -> V_Nuc -> Bool #

(<=) :: V_Nuc -> V_Nuc -> Bool #

(>) :: V_Nuc -> V_Nuc -> Bool #

(>=) :: V_Nuc -> V_Nuc -> Bool #

max :: V_Nuc -> V_Nuc -> V_Nuc #

min :: V_Nuc -> V_Nuc -> V_Nuc #

Show V_Nuc Source # 

Methods

showsPrec :: Int -> V_Nuc -> ShowS #

show :: V_Nuc -> String #

showList :: [V_Nuc] -> ShowS #

type BasePile = [DamagedBase] Source #

Map quality and a list of encountered bases, with damage information and reference base if known.

type IndelPile = [(Qual, ([Nucleotides], [DamagedBase]))] Source #

Map quality and a list of encountered indel variants. The deletion has the reference sequence, if known, an insertion has the inserted sequence with damage information.

data Pile' a b Source #

Running pileup results in a series of piles. A Pile has the basic statistics of a VarCall, but no GL values and a pristine list of variants instead of a proper call. We emit one pile with two BasePiles (one for each strand) and one IndelPile (the one immediately following) at a time.

Constructors

Pile 

Instances

(Show b, Show a) => Show (Pile' a b) Source # 

Methods

showsPrec :: Int -> Pile' a b -> ShowS #

show :: Pile' a b -> String #

showList :: [Pile' a b] -> ShowS #

type Pile = Pile' (BasePile, BasePile) (IndelPile, IndelPile) Source #

Raw pile. Bases and indels are piled separately on forward and backward strands.

data SinglePop Source #

Simple single population model. prob_div is the fraction of homozygous divergent sites, prob_het is the fraction of heterozygous variant sites among sites that are not homozygous divergent.

Constructors

SinglePop 

Fields

single_pop_posterior :: (Unbox a, Ord a, Floating a) => SinglePop -> Int -> Vector (Prob' a) -> Vector (Prob' a) Source #

Computes posterior genotype probabilities from likelihoods under the SinglePop model.

pileup :: Enumeratee [PosPrimChunks] [Pile] IO b Source #

The pileup enumeratee takes BamRaws, decomposes them, interleaves the pieces appropriately, and generates Piles. The output will contain at most one BasePile and one IndelPile for each position, piles are sorted by position.

This top level driver receives BamRaws. Unaligned reads and duplicates are skipped (but not those merely failing quality checks). Processing stops when the first read with invalid br_rname is encountered or a t end of file.

newtype PileM m a Source #

The pileup logic keeps a current coordinate (just two integers) and two running queues: one of active PrimBases that contribute to current genotype calling and on of waiting PrimBases that will contribute at a later point.

Oppan continuation passing style! Not only is the CPS version of the state monad (we have five distinct pieces of state) somewhat faster, we also need CPS to interact with the mechanisms of Iteratee. It makes implementing yield, peek, and bump straight forward.

Constructors

PileM 

Fields

Instances

Monad (PileM m) Source # 

Methods

(>>=) :: PileM m a -> (a -> PileM m b) -> PileM m b #

(>>) :: PileM m a -> PileM m b -> PileM m b #

return :: a -> PileM m a #

fail :: String -> PileM m a #

Functor (PileM m) Source # 

Methods

fmap :: (a -> b) -> PileM m a -> PileM m b #

(<$) :: a -> PileM m b -> PileM m a #

Applicative (PileM m) Source # 

Methods

pure :: a -> PileM m a #

(<*>) :: PileM m (a -> b) -> PileM m a -> PileM m b #

(*>) :: PileM m a -> PileM m b -> PileM m b #

(<*) :: PileM m a -> PileM m b -> PileM m a #

type PileF m r = Refseq -> Int -> ([PrimBase], [PrimBase]) -> (Heap, Heap) -> (Stream [Pile] -> Iteratee [Pile] m r) -> Stream [PosPrimChunks] -> Iteratee [PosPrimChunks] m (Iteratee [Pile] m r) Source #

The things we drag along in PileM. Notes: * The active queue is a simple stack. We add at the front when we encounter reads, which reverses them. When traversing it, we traverse reads backwards, but since we accumulate the BasePile, it gets reversed back. The new active queue, however, is no longer reversed (as it should be). So after the traversal, we reverse it again. (Yes, it is harder to understand than using a proper deque type, but it is cheaper. There may not be much point in the reversing, though.)

upd_pos :: (Int -> Int) -> PileM m () Source #

yieldPile :: CallStats -> BasePile -> BasePile -> CallStats -> IndelPile -> IndelPile -> PileM m () Source #

Sends one piece of output downstream. You are not expected to understand how this works, but inlining eneeCheckIfDone plugged an annoying memory leak.

pileup' :: PileM m () Source #

The actual pileup algorithm. If active contains something, continue here. Else find the coordinate to continue from, which is the minimum of the next waiting coordinate and the next coordinate in input; if found, continue there, else we're all done.

p'feed_input :: PileM m () Source #

Feeds input as long as it starts at the current position

p'check_waiting :: PileM m () Source #

Checks waiting queue. If there is anything waiting for the current position, moves it to active queue.

p'scan_active :: PileM m ((CallStats, BasePile), (CallStats, BasePile), (CallStats, IndelPile), (CallStats, IndelPile)) Source #

Separately scans the two active queues and makes one BasePile from each. Also sees what's next in the PrimChunks: Indels contribute to two separate IndelPiles, Seeks are pushed back to the waiting queue, EndOfReads are removed, and everything else is added to two fresh active queues.

data Heap Source #

We need a simple priority queue. Here's a skew heap (specialized to strict Int priorities and PrimBase values).

Constructors

Empty 
Node !Int PrimBase Heap Heap