Safe Haskell | None |
---|---|
Language | Haskell98 |
- data PrimChunks
- data PrimBase = Base {
- _pb_wait :: !Int
- _pb_likes :: !DamagedBase
- _pb_mapq :: !Qual
- _pb_rev :: !Bool
- _pb_chunks :: PrimChunks
- data DamagedBase = DB {}
- decompose :: BamRaw -> [Mat44D] -> PrimChunks
- data CallStats = CallStats {
- read_depth :: !Int
- reads_mapq0 :: !Int
- sum_mapq :: !Int
- sum_mapq_squared :: !Int
- type GL = Vector Prob
- newtype V_Nuc = V_Nuc (Vector Nucleotide)
- data IndelVariant = IndelVariant {
- deleted_bases :: !Int
- inserted_bases :: !V_Nuc
- type BasePile = [(Qual, DamagedBase)]
- type IndelPile = [(Qual, (Int, [DamagedBase]))]
- data Pile' a b = Pile {
- p_refseq :: !Refseq
- p_pos :: !Int
- p_snp_stat :: !CallStats
- p_snp_pile :: a
- p_indel_stat :: !CallStats
- p_indel_pile :: b
- type Pile = Pile' (BasePile, BasePile) IndelPile
- type Calls = Pile' GL (GL, [IndelVariant])
- pileup :: Monad m => DamageModel Double -> Enumeratee [BamRaw] [Pile] m a
- newtype PileM m a = PileM {}
- type PileF m r = Refseq -> Int -> [PrimBase] -> Heap -> DamageModel Double -> (Stream [Pile] -> Iteratee [Pile] m r) -> Stream [BamRaw] -> Iteratee [BamRaw] m (Iteratee [Pile] m r)
- get_refseq :: PileM m Refseq
- get_pos :: PileM m Int
- upd_pos :: (Int -> Int) -> PileM m ()
- set_pos :: (Refseq, Int) -> PileM m ()
- get_active :: PileM m [PrimBase]
- upd_active :: ([PrimBase] -> [PrimBase]) -> PileM m ()
- get_waiting :: PileM m Heap
- upd_waiting :: (Heap -> Heap) -> PileM m ()
- get_damage_model :: PileM m (DamageModel Double)
- yield :: Monad m => Pile -> PileM m ()
- peek :: PileM m (Maybe BamRaw)
- bump :: PileM m ()
- consume_active :: a -> (a -> PrimBase -> PileM m a) -> PileM m a
- pileup' :: Monad m => PileM m ()
- pileup'' :: Monad m => PileM m ()
- partitionPairEithers :: [(a, Either b c)] -> ([(a, b)], [(a, c)])
- data Heap
- union :: Heap -> Heap -> Heap
- insert :: Int -> PrimBase -> Heap -> Heap
- getMinKey :: Heap -> Maybe Int
- viewMin :: Heap -> Maybe (Int, PrimBase, Heap)
TODO*
A whole lot of testing.
Actual genotype calling.
ML fitting and evaluation of parameters for different possible
Maybe specialize to ploidy one and two.
data PrimChunks Source
The primitive pieces for genotype calling: A position, a base represented as four likelihoods, an inserted sequence, and the length of a deleted sequence. The logic is that we look at a base followed by some indel, and all those indels are combined into a single insertion and a single deletion.
Base | more chunks |
|
data DamagedBase Source
Represents our knowledge about a certain base, which consists of the base itself (A,C,G,T, encoded as 0..3; no Ns), the quality score (anything that isn't A,C,G,T becomes A with quality 0), and a substitution matrix representing post-mortem but pre-sequencing substitutions.
Unfortunately, none of this can be rolled into something more simple, because damage and sequencing error behave so differently.
decompose :: BamRaw -> [Mat44D] -> PrimChunks Source
Decomposes a BAM record into chunks suitable for piling up. We
pick apart the CIGAR field, and combine it with sequence and quality
as appropriate. We ignore the MD
field, even if it is present.
Clipped bases are removed/skipped as appropriate. We also ignore the
reference allele, in fact, we don't even know it, which nicely avoids
any possible reference bias by construction. But we do apply a
substitution matrix to each base, which must be supplied along with
the read.
Statistics about a genotype call. Probably only useful for fitlering (so not very useful), but we keep them because it's easy to track them.
CallStats | |
|
Genotype likelihood values. A variant call consists of a position,
some measure of qualities, genotype likelihood values, and a
representation of variants. A note about the GL values: VCF
would
normalize them so that the smallest one becomes zero. We do not do
that here, since we might want to compare raw values for a model
test. We also store them in a Double
to make arithmetics easier.
Normalization is appropriate when converting to VCF
.
If GL is given, we follow the same order used in VCF: "the ordering of genotypes for the likelihoods is given by: F(jk) = (k*(k+1)2)+j. In other words, for biallelic sites the ordering is: AA,AB,BB; for triallelic sites the ordering is: AA,AB,BB,AC,BC,CC, etc."
data IndelVariant Source
IndelVariant | |
|
type BasePile = [(Qual, DamagedBase)] Source
type IndelPile = [(Qual, (Int, [DamagedBase]))] Source
Running pileup results in a series of piles. A Pile
has the
basic statistics of a VarCall
, but no GL values and a pristine list
of variants instead of a proper call. We emit one pile with two
BasePile
s (one for each strand) and one IndelPile
(the one
immediately following) at a time.
Pile | |
|
pileup :: Monad m => DamageModel Double -> Enumeratee [BamRaw] [Pile] m a Source
The pileup enumeratee takes BamRaw
s, decomposes them, interleaves
the pieces appropriately, and generates Pile
s. The output will
contain at most one BasePile
and one IndelPile
for each position,
piles are sorted by position.
This top level driver receives BamRaw
s. Unaligned reads and
duplicates are skipped (but not those merely failing quality checks).
Processing stops when the first read with invalid br_rname
is
encountered or a t end of file.
The pileup logic keeps a current coordinate (just two integers) and
two running queues: one of active PrimBase
s that contribute to
current genotype calling and on of waiting PrimBase
s that will
contribute at a later point.
Oppan continuation passing style! Not only is the CPS version of the
state monad (we have five distinct pieces of state) somewhat faster,
we also need CPS to interact with the mechanisms of Iteratee
. It
makes implementing yield
, peek
, and bump
straight forward.
type PileF m r = Refseq -> Int -> [PrimBase] -> Heap -> DamageModel Double -> (Stream [Pile] -> Iteratee [Pile] m r) -> Stream [BamRaw] -> Iteratee [BamRaw] m (Iteratee [Pile] m r) Source
The things we drag along in PileM
. Notes:
* The active queue is a simple stack. We add at the front when we
encounter reads, which reverses them. When traversing it, we traverse
reads backwards, but since we accumulate the BasePile
, it gets reversed
back. The new active queue, however, is no longer reversed (as it should
be). So after the traversal, we reverse it again. (Yes, it is harder to
understand than using a proper deque type, but it is cheaper.
There may not be much point in the reversing, though.)
get_refseq :: PileM m Refseq Source
get_active :: PileM m [PrimBase] Source
upd_active :: ([PrimBase] -> [PrimBase]) -> PileM m () Source
get_waiting :: PileM m Heap Source
upd_waiting :: (Heap -> Heap) -> PileM m () Source
get_damage_model :: PileM m (DamageModel Double) Source
peek :: PileM m (Maybe BamRaw) Source
Inspect next input element, if any. Returns Just b
if b
is the
next input element, Nothing
if no such element exists. Waits for
more input if nothing is available immediately.
Discard next input element, if any. Does nothing if input has already ended. Waits for input to discard if nothing is available immediately.
consume_active :: a -> (a -> PrimBase -> PileM m a) -> PileM m a Source
partitionPairEithers :: [(a, Either b c)] -> ([(a, b)], [(a, c)]) Source