Safe Haskell	None
Language	Haskell2010

Bio.Bam.Pileup

Description

Pileup, similar to Samtools

Pileup turns a sorted sequence of reads into a sequence of "piles", one for each site where a genetic variant might be called. We will scan each read's CIGAR line and MD field in concert with the sequence and effective quality. Effective quality is the lowest available quality score of QUAL, MAPQ, and BQ. For aDNA calling, a base is represented as four probabilities, derived from a position dependent damage model.

Synopsis

data PrimChunks
- = Seek !Int PrimBase
- | Indel [Nucleotides] [DamagedBase] PrimBase
- | EndOfRead
data PrimBase = Base {
- _pb_wait :: !Int
- _pb_base :: !DamagedBase
- _pb_mapq :: !Qual
- _pb_chunks :: PrimChunks
}
type PosPrimChunks = (Refseq, Int, Bool, PrimChunks)
data DamagedBase = DB {
- db_call :: !Nucleotide
- db_qual :: !Qual
- db_dmg_tk :: !DmgToken
- db_dmg_pos :: !Int
- db_ref :: !Nucleotides
}
newtype DmgToken = DmgToken {
- fromDmgToken :: Int
}
decompose :: DmgToken -> BamRaw -> [PosPrimChunks]
data CallStats = CallStats {
- read_depth :: !Int
- reads_mapq0 :: !Int
- sum_mapq :: !Int
- sum_mapq_squared :: !Int
}
newtype V_Nuc = V_Nuc (Vector Nucleotide)
newtype V_Nucs = V_Nucs (Vector Nucleotides)
data IndelVariant = IndelVariant {
- deleted_bases :: !V_Nucs
- inserted_bases :: !V_Nuc
}
type BasePile = [DamagedBase]
type IndelPile = [(Qual, ([Nucleotides], [DamagedBase]))]
data Pile' a b = Pile {
- p_refseq :: !Refseq
- p_pos :: !Int
- p_snp_stat :: !CallStats
- p_snp_pile :: a
- p_indel_stat :: !CallStats
- p_indel_pile :: b
}
type Pile = Pile' (BasePile, BasePile) (IndelPile, IndelPile)
pileup :: Enumeratee [PosPrimChunks] [Pile] IO b
newtype PileM m a = PileM {
- runPileM :: forall r. (a -> PileF m r) -> PileF m r
}
type PileF m r = Refseq -> Int -> ([PrimBase], [PrimBase]) -> (Heap, Heap) -> (Stream [Pile] -> Iteratee [Pile] m r) -> Stream [PosPrimChunks] -> Iteratee [PosPrimChunks] m (Iteratee [Pile] m r)
get_refseq :: PileM m Refseq
get_pos :: PileM m Int
upd_pos :: (Int -> Int) -> PileM m ()
yieldPile :: CallStats -> BasePile -> BasePile -> CallStats -> IndelPile -> IndelPile -> PileM m ()
pileup' :: PileM m ()
pileup'' :: PileM m ()
p'feed_input :: PileM m ()
p'check_waiting :: PileM m ()
p'scan_active :: PileM m ((CallStats, BasePile), (CallStats, BasePile), (CallStats, IndelPile), (CallStats, IndelPile))
data Heap
- = Empty
- | Node !Int PrimBase Heap Heap
unionH :: Heap -> Heap -> Heap
getMinKeyH :: Heap -> Maybe Int
viewMinH :: Heap -> Maybe (Int, PrimBase, Heap)

Documentation

data PrimChunks Source #

The primitive pieces for genotype calling: A position, a base represented as four likelihoods, an inserted sequence, and the length of a deleted sequence. The logic is that we look at a base followed by some indel, and all those indels are combined into a single insertion and a single deletion.

Constructors

Seek !Int PrimBase	skip to position (at start or after N operation)
Indel [Nucleotides] [DamagedBase] PrimBase	observed deletion and insertion between two bases
EndOfRead	nothing anymore

Instances

Show PrimChunks Source #
Instance details Defined in Bio.Bam.Pileup Methods showsPrec :: Int -> PrimChunks -> ShowS # show :: PrimChunks -> String # showList :: [PrimChunks] -> ShowS #

data PrimBase Source #

Constructors

Base	more chunks
Fields _pb_wait :: !Int number of bases to wait due to a deletion _pb_base :: !DamagedBase _pb_mapq :: !Qual map quality _pb_chunks :: PrimChunks

Instances

Show PrimBase Source #
Instance details Defined in Bio.Bam.Pileup Methods showsPrec :: Int -> PrimBase -> ShowS # show :: PrimBase -> String # showList :: [PrimBase] -> ShowS #

type PosPrimChunks = (Refseq, Int, Bool, PrimChunks) Source #

data DamagedBase Source #

Represents our knowledge about a certain base, which consists of the base itself (A,C,G,T, encoded as 0..3; no Ns), the quality score (anything that isn't A,C,G,T becomes A with quality 0), and a substitution matrix representing post-mortem but pre-sequencing substitutions.

Unfortunately, none of this can be rolled into something more simple, because damage and sequencing error behave so differently.

Damage information is polymorphic. We might run with a simple version (a matrix) for calling, but we need more (a matrix and a mutable matrix, I think) for estimation.

Constructors

DB	reference base from MD field
Fields db_call :: !Nucleotide called base db_qual :: !Qual quality of called base db_dmg_tk :: !DmgToken damage information db_dmg_pos :: !Int damage information db_ref :: !Nucleotides

Instances

Show DamagedBase Source #
Instance details Defined in Bio.Bam.Pileup Methods showsPrec :: Int -> DamagedBase -> ShowS # show :: DamagedBase -> String # showList :: [DamagedBase] -> ShowS #

newtype DmgToken Source #

Constructors

DmgToken
Fields fromDmgToken :: Int

decompose :: DmgToken -> BamRaw -> [PosPrimChunks] Source #

Decomposes a BAM record into chunks suitable for piling up. We pick apart the CIGAR and MD fields, and combine them with sequence and quality as appropriate. Clipped bases are removed/skipped as needed. We also apply a substitution matrix to each base, which must be supplied along with the read.

data CallStats Source #

Statistics about a genotype call. Probably only useful for fitlering (so not very useful), but we keep them because it's easy to track them.

Constructors

CallStats
Fields read_depth :: !Int reads_mapq0 :: !Int sum_mapq :: !Int sum_mapq_squared :: !Int

Instances

Eq CallStats Source #
Instance details Defined in Bio.Bam.Pileup Methods (==) :: CallStats -> CallStats -> Bool # (/=) :: CallStats -> CallStats -> Bool #
Show CallStats Source #
Instance details Defined in Bio.Bam.Pileup Methods showsPrec :: Int -> CallStats -> ShowS # show :: CallStats -> String # showList :: [CallStats] -> ShowS #
Generic CallStats Source #
Instance details Defined in Bio.Bam.Pileup Associated Types type Rep CallStats :: * -> * # Methods from :: CallStats -> Rep CallStats x # to :: Rep CallStats x -> CallStats #
Semigroup CallStats Source #
Instance details Defined in Bio.Bam.Pileup Methods (<>) :: CallStats -> CallStats -> CallStats # sconcat :: NonEmpty CallStats -> CallStats # stimes :: Integral b => b -> CallStats -> CallStats #
Monoid CallStats Source #
Instance details Defined in Bio.Bam.Pileup Methods mempty :: CallStats # mappend :: CallStats -> CallStats -> CallStats # mconcat :: [CallStats] -> CallStats #
type Rep CallStats Source #
Instance details Defined in Bio.Bam.Pileup type Rep CallStats = D1 (MetaData "CallStats" "Bio.Bam.Pileup" "biohazard-1.0.4-3XlcK2SyOMd8MdyOraimjZ" False) (C1 (MetaCons "CallStats" PrefixI True) ((S1 (MetaSel (Just "read_depth") SourceUnpack SourceStrict DecidedStrict) (Rec0 Int) :: S1 (MetaSel (Just "reads_mapq0") SourceUnpack SourceStrict DecidedStrict) (Rec0 Int)) :: (S1 (MetaSel (Just "sum_mapq") SourceUnpack SourceStrict DecidedStrict) (Rec0 Int) :*: S1 (MetaSel (Just "sum_mapq_squared") SourceUnpack SourceStrict DecidedStrict) (Rec0 Int))))

newtype V_Nuc Source #

Constructors

V_Nuc (Vector Nucleotide)

Instances

Eq V_Nuc Source #
Instance details Defined in Bio.Bam.Pileup Methods (==) :: V_Nuc -> V_Nuc -> Bool # (/=) :: V_Nuc -> V_Nuc -> Bool #
Ord V_Nuc Source #
Instance details Defined in Bio.Bam.Pileup Methods compare :: V_Nuc -> V_Nuc -> Ordering # (<) :: V_Nuc -> V_Nuc -> Bool # (<=) :: V_Nuc -> V_Nuc -> Bool # (>) :: V_Nuc -> V_Nuc -> Bool # (>=) :: V_Nuc -> V_Nuc -> Bool # max :: V_Nuc -> V_Nuc -> V_Nuc # min :: V_Nuc -> V_Nuc -> V_Nuc #
Show V_Nuc Source #
Instance details Defined in Bio.Bam.Pileup Methods showsPrec :: Int -> V_Nuc -> ShowS # show :: V_Nuc -> String # showList :: [V_Nuc] -> ShowS #

newtype V_Nucs Source #

Constructors

V_Nucs (Vector Nucleotides)

Instances

Eq V_Nucs Source #
Instance details Defined in Bio.Bam.Pileup Methods (==) :: V_Nucs -> V_Nucs -> Bool # (/=) :: V_Nucs -> V_Nucs -> Bool #
Ord V_Nucs Source #
Instance details Defined in Bio.Bam.Pileup Methods compare :: V_Nucs -> V_Nucs -> Ordering # (<) :: V_Nucs -> V_Nucs -> Bool # (<=) :: V_Nucs -> V_Nucs -> Bool # (>) :: V_Nucs -> V_Nucs -> Bool # (>=) :: V_Nucs -> V_Nucs -> Bool # max :: V_Nucs -> V_Nucs -> V_Nucs # min :: V_Nucs -> V_Nucs -> V_Nucs #
Show V_Nucs Source #
Instance details Defined in Bio.Bam.Pileup Methods showsPrec :: Int -> V_Nucs -> ShowS # show :: V_Nucs -> String # showList :: [V_Nucs] -> ShowS #

data IndelVariant Source #

Constructors

IndelVariant
Fields deleted_bases :: !V_Nucs inserted_bases :: !V_Nuc

Instances

Eq IndelVariant Source #
Instance details Defined in Bio.Bam.Pileup Methods (==) :: IndelVariant -> IndelVariant -> Bool # (/=) :: IndelVariant -> IndelVariant -> Bool #
Ord IndelVariant Source #
Instance details Defined in Bio.Bam.Pileup Methods compare :: IndelVariant -> IndelVariant -> Ordering # (<) :: IndelVariant -> IndelVariant -> Bool # (<=) :: IndelVariant -> IndelVariant -> Bool # (>) :: IndelVariant -> IndelVariant -> Bool # (>=) :: IndelVariant -> IndelVariant -> Bool # max :: IndelVariant -> IndelVariant -> IndelVariant # min :: IndelVariant -> IndelVariant -> IndelVariant #
Show IndelVariant Source #
Instance details Defined in Bio.Bam.Pileup Methods showsPrec :: Int -> IndelVariant -> ShowS # show :: IndelVariant -> String # showList :: [IndelVariant] -> ShowS #
Generic IndelVariant Source #
Instance details Defined in Bio.Bam.Pileup Associated Types type Rep IndelVariant :: * -> * # Methods from :: IndelVariant -> Rep IndelVariant x # to :: Rep IndelVariant x -> IndelVariant #
type Rep IndelVariant Source #
Instance details Defined in Bio.Bam.Pileup type Rep IndelVariant = D1 (MetaData "IndelVariant" "Bio.Bam.Pileup" "biohazard-1.0.4-3XlcK2SyOMd8MdyOraimjZ" False) (C1 (MetaCons "IndelVariant" PrefixI True) (S1 (MetaSel (Just "deleted_bases") NoSourceUnpackedness SourceStrict DecidedStrict) (Rec0 V_Nucs) :*: S1 (MetaSel (Just "inserted_bases") NoSourceUnpackedness SourceStrict DecidedStrict) (Rec0 V_Nuc)))

type BasePile = [DamagedBase] Source #

Map quality and a list of encountered bases, with damage information and reference base if known.

type IndelPile = [(Qual, ([Nucleotides], [DamagedBase]))] Source #

Map quality and a list of encountered indel variants. The deletion has the reference sequence, if known, an insertion has the inserted sequence with damage information.

data Pile' a b Source #

Running pileup results in a series of piles. A Pile has the basic statistics of a VarCall, but no likelihood values and a pristine list of variants instead of a proper call. We emit one pile with two BasePiles (one for each strand) and one IndelPile (the one immediately following) at a time.

Constructors

Pile
Fields p_refseq :: !Refseq p_pos :: !Int p_snp_stat :: !CallStats p_snp_pile :: a p_indel_stat :: !CallStats p_indel_pile :: b

Instances

(Show a, Show b) => Show (Pile' a b) Source #
Instance details Defined in Bio.Bam.Pileup Methods showsPrec :: Int -> Pile' a b -> ShowS # show :: Pile' a b -> String # showList :: [Pile' a b] -> ShowS #

type Pile = Pile' (BasePile, BasePile) (IndelPile, IndelPile) Source #

Raw pile. Bases and indels are piled separately on forward and backward strands.

pileup :: Enumeratee [PosPrimChunks] [Pile] IO b Source #

The pileup enumeratee takes BamRaws, decomposes them, interleaves the pieces appropriately, and generates Piles. The output will contain at most one BasePile and one IndelPile for each position, piles are sorted by position.

This top level driver receives BamRaws. Unaligned reads and duplicates are skipped (but not those merely failing quality checks). Processing stops when the first read with invalid br_rname is encountered or a t end of file.

newtype PileM m a Source #

The pileup logic keeps a current coordinate (just two integers) and two running queues: one of active PrimBases that contribute to current genotype calling and on of waiting PrimBases that will contribute at a later point.

Oppan continuation passing style! Not only is the CPS version of the state monad (we have five distinct pieces of state) somewhat faster, we also need CPS to interact with the mechanisms of Iteratee. It makes implementing yield, peek, and bump straight forward.

Constructors

PileM
Fields runPileM :: forall r. (a -> PileF m r) -> PileF m r

Instances

Monad (PileM m) Source #
Instance details Defined in Bio.Bam.Pileup Methods (>>=) :: PileM m a -> (a -> PileM m b) -> PileM m b # (>>) :: PileM m a -> PileM m b -> PileM m b # return :: a -> PileM m a # fail :: String -> PileM m a #
Functor (PileM m) Source #
Instance details Defined in Bio.Bam.Pileup Methods fmap :: (a -> b) -> PileM m a -> PileM m b # (<$) :: a -> PileM m b -> PileM m a #
Applicative (PileM m) Source #
Instance details Defined in Bio.Bam.Pileup Methods pure :: a -> PileM m a # (<>) :: PileM m (a -> b) -> PileM m a -> PileM m b # liftA2 :: (a -> b -> c) -> PileM m a -> PileM m b -> PileM m c # (>) :: PileM m a -> PileM m b -> PileM m b # (<*) :: PileM m a -> PileM m b -> PileM m a #

type PileF m r = Refseq -> Int -> ([PrimBase], [PrimBase]) -> (Heap, Heap) -> (Stream [Pile] -> Iteratee [Pile] m r) -> Stream [PosPrimChunks] -> Iteratee [PosPrimChunks] m (Iteratee [Pile] m r) Source #

The things we drag along in PileM. Notes: * The active queue is a simple stack. We add at the front when we encounter reads, which reverses them. When traversing it, we traverse reads backwards, but since we accumulate the BasePile, it gets reversed back. The new active queue, however, is no longer reversed (as it should be). So after the traversal, we reverse it again. (Yes, it is harder to understand than using a proper deque type, but it is cheaper. There may not be much point in the reversing, though.)

get_refseq :: PileM m Refseq Source #

get_pos :: PileM m Int Source #

upd_pos :: (Int -> Int) -> PileM m () Source #

yieldPile :: CallStats -> BasePile -> BasePile -> CallStats -> IndelPile -> IndelPile -> PileM m () Source #

Sends one piece of output downstream. You are not expected to understand how this works, but inlining eneeCheckIfDone plugged an annoying memory leak.

pileup' :: PileM m () Source #

The actual pileup algorithm. If active contains something, continue here. Else find the coordinate to continue from, which is the minimum of the next waiting coordinate and the next coordinate in input; if found, continue there, else we're all done.

pileup'' :: PileM m () Source #

p'feed_input :: PileM m () Source #

Feeds input as long as it starts at the current position

p'check_waiting :: PileM m () Source #

Checks waiting queue. If there is anything waiting for the current position, moves it to active queue.

p'scan_active :: PileM m ((CallStats, BasePile), (CallStats, BasePile), (CallStats, IndelPile), (CallStats, IndelPile)) Source #

Separately scans the two active queues and makes one BasePile from each. Also sees what's next in the PrimChunks: Indels contribute to two separate IndelPiles, Seeks are pushed back to the waiting queue, EndOfReads are removed, and everything else is added to two fresh active queues.

data Heap Source #

We need a simple priority queue. Here's a skew heap (specialized to strict Int priorities and PrimBase values).

Constructors

Empty
Node !Int PrimBase Heap Heap

unionH :: Heap -> Heap -> Heap Source #

getMinKeyH :: Heap -> Maybe Int Source #

viewMinH :: Heap -> Maybe (Int, PrimBase, Heap) Source #