Safe Haskell	None
Language	Haskell98

Bio.Bam.Pileup

Contents

TODO*
A whole lot of testing.
ML fitting and evaluation of parameters for different possible
Maybe specialize to ploidy one and two.

Synopsis

TODO*

A whole lot of testing.

ML fitting and evaluation of parameters for different possible

Maybe specialize to ploidy one and two.

data PrimChunks Source

The primitive pieces for genotype calling: A position, a base represented as four likelihoods, an inserted sequence, and the length of a deleted sequence. The logic is that we look at a base followed by some indel, and all those indels are combined into a single insertion and a single deletion.

Constructors

Seek Int PrimBase	skip to position (at start or after N operation)
Indel [Nucleotides] [DamagedBase] PrimBase	observed deletion and insertion between two bases
EndOfRead	nothing anymore

Instances

Show PrimChunks Source

data PrimBase Source

Constructors

Base	more chunks
Fields _pb_wait :: Int number of bases to wait due to a deletion _pb_likes :: DamagedBase four likelihoods _pb_mapq :: Qual map quality _pb_rev :: Bool reverse strand? _pb_chunks :: PrimChunks

Instances

Show PrimBase Source

type PosPrimChunks = (Refseq, Int, PrimChunks) Source

data DamagedBase Source

Represents our knowledge about a certain base, which consists of the base itself (A,C,G,T, encoded as 0..3; no Ns), the quality score (anything that isn't A,C,G,T becomes A with quality 0), and a substitution matrix representing post-mortem but pre-sequencing substitutions.

Unfortunately, none of this can be rolled into something more simple, because damage and sequencing error behave so differently.

Constructors

DB	damage matrix
Fields db_call :: !Nucleotide called base db_qual :: !Qual quality of called base db_ref :: !Nucleotides reference base from MD field db_dmg :: !Mat44D

Instances

Show DamagedBase Source

decompose :: [Mat44D] -> BamRaw -> Maybe PosPrimChunks Source

Decomposes a BAM record into chunks suitable for piling up. We pick apart the CIGAR and MD fields, and combine them with sequence and quality as appropriate. Clipped bases are removed/skipped as appropriate. We also do apply a substitution matrix to each base, which must be supplied along with the read.

data CallStats Source

Statistics about a genotype call. Probably only useful for fitlering (so not very useful), but we keep them because it's easy to track them.

Constructors

CallStats
Fields read_depth :: !Int reads_mapq0 :: !Int sum_mapq :: !Int sum_mapq_squared :: !Int

Instances

Eq CallStats Source
Show CallStats Source
Monoid CallStats Source

type GL = Vector Prob Source

Genotype likelihood values. A variant call consists of a position, some measure of qualities, genotype likelihood values, and a representation of variants. A note about the GL values: VCF would normalize them so that the smallest one becomes zero. We do not do that here, since we might want to compare raw values for a model test. We also store them in a Double to make arithmetics easier. Normalization is appropriate when converting to VCF.

If GL is given, we follow the same order used in VCF: "the ordering of genotypes for the likelihoods is given by: F(jk) = (k*(k+1)2)+j. In other words, for biallelic sites the ordering is: AA,AB,BB; for triallelic sites the ordering is: AA,AB,BB,AC,BC,CC, etc."

newtype V_Nuc Source

Constructors

V_Nuc (Vector Nucleotide)

Instances

Eq V_Nuc Source
Ord V_Nuc Source
Show V_Nuc Source

newtype V_Nucs Source

Constructors

V_Nucs (Vector Nucleotides)

Instances

Eq V_Nucs Source
Ord V_Nucs Source
Show V_Nucs Source

data IndelVariant Source

Constructors

IndelVariant
Fields deleted_bases :: !V_Nucs inserted_bases :: !V_Nuc

Instances

Eq IndelVariant Source
Ord IndelVariant Source
Show IndelVariant Source

type BasePile = [(Qual, DamagedBase)] Source

Map quality and a list of encountered bases, with damage information and reference base if known.

type IndelPile = [(Qual, ([Nucleotides], [DamagedBase]))] Source

Map quality and a list of encountered indel variants. The deletion has the reference sequence, if known, an insertion has the inserted sequence with damage information.

data Pile' a b Source

Running pileup results in a series of piles. A Pile has the basic statistics of a VarCall, but no GL values and a pristine list of variants instead of a proper call. We emit one pile with two BasePiles (one for each strand) and one IndelPile (the one immediately following) at a time.

Constructors

Pile
Fields p_refseq :: !Refseq p_pos :: !Int p_snp_stat :: !CallStats p_snp_pile :: a p_indel_stat :: !CallStats p_indel_pile :: b

Instances

(Show a, Show b) => Show (Pile' a b) Source

type Pile = Pile' (BasePile, BasePile) IndelPile Source

pileup :: Monad m => Enumeratee [PosPrimChunks] [Pile] m a Source

The pileup enumeratee takes BamRaws, decomposes them, interleaves the pieces appropriately, and generates Piles. The output will contain at most one BasePile and one IndelPile for each position, piles are sorted by position.

This top level driver receives BamRaws. Unaligned reads and duplicates are skipped (but not those merely failing quality checks). Processing stops when the first read with invalid br_rname is encountered or a t end of file.

newtype PileM m a Source

The pileup logic keeps a current coordinate (just two integers) and two running queues: one of active PrimBases that contribute to current genotype calling and on of waiting PrimBases that will contribute at a later point.

Oppan continuation passing style! Not only is the CPS version of the state monad (we have five distinct pieces of state) somewhat faster, we also need CPS to interact with the mechanisms of Iteratee. It makes implementing yield, peek, and bump straight forward.

Constructors

PileM
Fields runPileM :: forall r. (a -> PileF m r) -> PileF m r

Instances

Monad (PileM m) Source
Functor (PileM m) Source
Applicative (PileM m) Source

type PileF m r = Refseq -> Int -> [PrimBase] -> Heap -> (Stream [Pile] -> Iteratee [Pile] m r) -> Stream [PosPrimChunks] -> Iteratee [PosPrimChunks] m (Iteratee [Pile] m r) Source

The things we drag along in PileM. Notes: * The active queue is a simple stack. We add at the front when we encounter reads, which reverses them. When traversing it, we traverse reads backwards, but since we accumulate the BasePile, it gets reversed back. The new active queue, however, is no longer reversed (as it should be). So after the traversal, we reverse it again. (Yes, it is harder to understand than using a proper deque type, but it is cheaper. There may not be much point in the reversing, though.)

get_refseq :: PileM m Refseq Source

get_pos :: PileM m Int Source

upd_pos :: (Int -> Int) -> PileM m () Source

set_pos :: (Refseq, Int) -> PileM m () Source

get_active :: PileM m [PrimBase] Source

upd_active :: ([PrimBase] -> [PrimBase]) -> PileM m () Source

add_active :: PrimBase -> PileM m () Source

clr_active :: PileM m [PrimBase] Source

ins_waiting :: Int -> PrimBase -> PileM m () Source

get_waiting :: PileM m Heap Source

set_waiting :: Heap -> PileM m () Source

yield :: Monad m => Pile -> PileM m () Source

Sends one piece of output downstream. You are not expected to understand how this works, but at last it doesn't leak memory.

peek :: PileM m (Maybe PosPrimChunks) Source

Inspect next input element, if any. Returns Just b if b is the next input element, Nothing if no such element exists. Waits for more input if nothing is available immediately.

bump :: PileM m () Source

Discard next input element, if any. Does nothing if input has already ended. Waits for input to discard if nothing is available immediately.

consume_active :: a -> (a -> PrimBase -> PileM m a) -> PileM m a Source

pileup' :: Monad m => PileM m () Source

The actual pileup algorithm. If active contains something, continue here. Else find the coordinate to continue from, which is the minimum of the next waiting coordinate and the next coordinate in input; if found, continue there, else we're all done.

pileup'' :: Monad m => PileM m () Source

p'feed_input :: PileM m () Source

Feeds input as long as it starts at the current position

p'check_waiting :: PileM m () Source

Checks waiting queue. If there is anything waiting for the current position, moves it to active queue.

p'scan_active :: PileM m ((CallStats, [(Qual, Either DamagedBase DamagedBase)]), (CallStats, [(Qual, ([Nucleotides], [DamagedBase]))])) Source

Scans active queue and makes a BasePile. Also sees what's next in the PrimChunks: Indels contribute to an IndelPile, Seeks and deletions are pushed back to the waiting queue, EndOfReads are removed, and everything else is added to the fresh active queue.

partitionPairEithers :: [(a, Either b c)] -> ([(a, b)], [(a, c)]) Source

data Heap Source

We need a simple priority queue. Here's a skew heap (specialized to strict Int priorities and PrimBase values).

Constructors

Empty
Node !Int !PrimBase Heap Heap

union :: Heap -> Heap -> Heap Source

getMinKey :: Heap -> Maybe Int Source

viewMin :: Heap -> Maybe (Int, PrimBase, Heap) Source