| Safe Haskell | None | 
|---|---|
| Language | Haskell2010 | 
Bio.Bam.Pileup
Description
Pileup, similar to Samtools
Pileup turns a sorted sequence of reads into a sequence of "piles", one for each site where a genetic variant might be called. We will scan each read's CIGAR line and MD field in concert with the sequence and effective quality. Effective quality is the lowest available quality score of QUAL, MAPQ, and BQ. For aDNA calling, a base is represented as four probabilities, derived from a position dependent damage model.
- data PrimChunks- = Seek !Int PrimBase
- | Indel [Nucleotides] [DamagedBase] PrimBase
- | EndOfRead
 
- data PrimBase = Base {- _pb_wait :: !Int
- _pb_base :: !DamagedBase
- _pb_mapq :: !Qual
- _pb_chunks :: PrimChunks
 
- type PosPrimChunks = (Refseq, Int, Bool, PrimChunks)
- data DamagedBase = DB {- db_call :: !Nucleotide
- db_qual :: !Qual
- db_dmg_tk :: !DmgToken
- db_dmg_pos :: !Int
- db_ref :: !Nucleotides
 
- newtype DmgToken = DmgToken {- fromDmgToken :: Int
 
- decompose :: DmgToken -> BamRaw -> [PosPrimChunks]
- data CallStats = CallStats {- read_depth :: !Int
- reads_mapq0 :: !Int
- sum_mapq :: !Int
- sum_mapq_squared :: !Int
 
- newtype V_Nuc = V_Nuc (Vector Nucleotide)
- newtype V_Nucs = V_Nucs (Vector Nucleotides)
- data IndelVariant = IndelVariant {- deleted_bases :: !V_Nucs
- inserted_bases :: !V_Nuc
 
- type BasePile = [DamagedBase]
- type IndelPile = [(Qual, ([Nucleotides], [DamagedBase]))]
- data Pile' a b = Pile {- p_refseq :: !Refseq
- p_pos :: !Int
- p_snp_stat :: !CallStats
- p_snp_pile :: a
- p_indel_stat :: !CallStats
- p_indel_pile :: b
 
- type Pile = Pile' (BasePile, BasePile) (IndelPile, IndelPile)
- pileup :: Enumeratee [PosPrimChunks] [Pile] IO b
- newtype PileM m a = PileM {}
- type PileF m r = Refseq -> Int -> ([PrimBase], [PrimBase]) -> (Heap, Heap) -> (Stream [Pile] -> Iteratee [Pile] m r) -> Stream [PosPrimChunks] -> Iteratee [PosPrimChunks] m (Iteratee [Pile] m r)
- get_refseq :: PileM m Refseq
- get_pos :: PileM m Int
- upd_pos :: (Int -> Int) -> PileM m ()
- yieldPile :: CallStats -> BasePile -> BasePile -> CallStats -> IndelPile -> IndelPile -> PileM m ()
- pileup' :: PileM m ()
- pileup'' :: PileM m ()
- p'feed_input :: PileM m ()
- p'check_waiting :: PileM m ()
- p'scan_active :: PileM m ((CallStats, BasePile), (CallStats, BasePile), (CallStats, IndelPile), (CallStats, IndelPile))
- data Heap
- unionH :: Heap -> Heap -> Heap
- getMinKeyH :: Heap -> Maybe Int
- viewMinH :: Heap -> Maybe (Int, PrimBase, Heap)
Documentation
data PrimChunks Source #
The primitive pieces for genotype calling: A position, a base represented as four likelihoods, an inserted sequence, and the length of a deleted sequence. The logic is that we look at a base followed by some indel, and all those indels are combined into a single insertion and a single deletion.
Constructors
| Seek !Int PrimBase | skip to position (at start or after N operation) | 
| Indel [Nucleotides] [DamagedBase] PrimBase | observed deletion and insertion between two bases | 
| EndOfRead | nothing anymore | 
Instances
Constructors
| Base | more chunks | 
| Fields 
 | |
type PosPrimChunks = (Refseq, Int, Bool, PrimChunks) Source #
data DamagedBase Source #
Represents our knowledge about a certain base, which consists of the base itself (A,C,G,T, encoded as 0..3; no Ns), the quality score (anything that isn't A,C,G,T becomes A with quality 0), and a substitution matrix representing post-mortem but pre-sequencing substitutions.
Unfortunately, none of this can be rolled into something more simple, because damage and sequencing error behave so differently.
Damage information is polymorphic. We might run with a simple version (a matrix) for calling, but we need more (a matrix and a mutable matrix, I think) for estimation.
Constructors
| DB | reference base from MD field | 
| Fields 
 | |
Instances
decompose :: DmgToken -> BamRaw -> [PosPrimChunks] Source #
Decomposes a BAM record into chunks suitable for piling up. We pick apart the CIGAR and MD fields, and combine them with sequence and quality as appropriate. Clipped bases are removed/skipped as needed. We also apply a substitution matrix to each base, which must be supplied along with the read.
Statistics about a genotype call. Probably only useful for fitlering (so not very useful), but we keep them because it's easy to track them.
Constructors
| CallStats | |
| Fields 
 | |
Constructors
| V_Nuc (Vector Nucleotide) | 
Constructors
| V_Nucs (Vector Nucleotides) | 
data IndelVariant Source #
Constructors
| IndelVariant | |
| Fields 
 | |
Instances
type BasePile = [DamagedBase] Source #
Map quality and a list of encountered bases, with damage information and reference base if known.
type IndelPile = [(Qual, ([Nucleotides], [DamagedBase]))] Source #
Map quality and a list of encountered indel variants. The deletion has the reference sequence, if known, an insertion has the inserted sequence with damage information.
Running pileup results in a series of piles.  A Pile has the
 basic statistics of a VarCall, but no likelihood values and a
 pristine list of variants instead of a proper call.  We emit one pile
 with two BasePiles (one for each strand) and one IndelPile (the
 one immediately following) at a time.
Constructors
| Pile | |
| Fields 
 | |
type Pile = Pile' (BasePile, BasePile) (IndelPile, IndelPile) Source #
Raw pile. Bases and indels are piled separately on forward and backward strands.
pileup :: Enumeratee [PosPrimChunks] [Pile] IO b Source #
The pileup enumeratee takes BamRaws, decomposes them, interleaves
 the pieces appropriately, and generates Piles.  The output will
 contain at most one BasePile and one IndelPile for each position,
 piles are sorted by position.
This top level driver receives BamRaws.  Unaligned reads and
 duplicates are skipped (but not those merely failing quality checks).
 Processing stops when the first read with invalid br_rname is
 encountered or a t end of file.
The pileup logic keeps a current coordinate (just two integers) and
 two running queues: one of active PrimBases that contribute to
 current genotype calling and on of waiting PrimBases that will
 contribute at a later point.
Oppan continuation passing style!  Not only is the CPS version of the
 state monad (we have five distinct pieces of state) somewhat faster,
 we also need CPS to interact with the mechanisms of Iteratee.  It
 makes implementing yield, peek, and bump straight forward.
type PileF m r = Refseq -> Int -> ([PrimBase], [PrimBase]) -> (Heap, Heap) -> (Stream [Pile] -> Iteratee [Pile] m r) -> Stream [PosPrimChunks] -> Iteratee [PosPrimChunks] m (Iteratee [Pile] m r) Source #
The things we drag along in PileM.  Notes:
 * The active queue is a simple stack.  We add at the front when we
   encounter reads, which reverses them.  When traversing it, we traverse
   reads backwards, but since we accumulate the BasePile, it gets reversed
   back.  The new active queue, however, is no longer reversed (as it should
   be).  So after the traversal, we reverse it again.  (Yes, it is harder to
   understand than using a proper deque type, but it is cheaper.
   There may not be much point in the reversing, though.)
get_refseq :: PileM m Refseq Source #
yieldPile :: CallStats -> BasePile -> BasePile -> CallStats -> IndelPile -> IndelPile -> PileM m () Source #
Sends one piece of output downstream.  You are not expected to
 understand how this works, but inlining eneeCheckIfDone plugged an
 annoying memory leak.
pileup' :: PileM m () Source #
The actual pileup algorithm. If active contains something, continue here. Else find the coordinate to continue from, which is the minimum of the next waiting coordinate and the next coordinate in input; if found, continue there, else we're all done.
p'feed_input :: PileM m () Source #
Feeds input as long as it starts at the current position
p'check_waiting :: PileM m () Source #
Checks waiting queue. If there is anything waiting for the current position, moves it to active queue.
p'scan_active :: PileM m ((CallStats, BasePile), (CallStats, BasePile), (CallStats, IndelPile), (CallStats, IndelPile)) Source #