Safe Haskell	None
Language	Haskell98

Bio.TwoBit

Synopsis

Documentation

module Bio.Base

data TwoBitFile Source

Would you believe it? The 2bit format stores blocks of Ns in a table at the beginning of a sequence, then packs four bases into a byte. So it is neither possible nor necessary to store Ns in the main sequence, and you would think they aren't stored there, right? And they aren't. Instead Ts are stored which the reader has to replace with Ns.

The sensible way to treat these is probably to just say there are two kinds of implied annotation (repeats and large gaps for a typical genome), which can be interpreted in whatever way fits. And that's why we have Mask and getSubseqWith.

TODO: use binary search for the Int->Int mappings on the raw data?

Constructors

TBF
Fields tbf_raw :: ByteString tbf_seqs :: !(HashMap Seqid TwoBitSequence)

data TwoBitSequence Source

Constructors

TBS
Fields tbs_n_blocks :: !(IntMap Int) tbs_m_blocks :: !(IntMap Int) tbs_dna_offset :: !Int tbs_dna_size :: !Int

openTwoBit :: FilePath -> IO TwoBitFile Source

Brings a 2bit file into memory. The file is mmap'ed, so it will not work on streams that are not actual files. It's also unsafe if the file is modified in any way.

getFwdSubseqWith :: TwoBitFile -> TwoBitSequence -> (Word8 -> Mask -> a) -> Int -> [a] Source

getSubseq :: TwoBitFile -> Range -> [Nucleotide] Source

Extract a subsequence without masking.

getSubseqWith :: (Nucleotide -> Mask -> a) -> TwoBitFile -> Range -> [a] Source

Extract a subsequence and apply masking. TwoBit file can represent two kinds of masking (hard and soft), where hard masking is usually realized by replacing everything by Ns and soft masking is done by lowercasing. Here, we take a user supplied function to apply masking.

getSubseqAscii :: TwoBitFile -> Range -> String Source

Extract a subsequence with masking for biologists: soft masking is done by lowercasing, hard masking by printing an N.

getSubseqMasked :: TwoBitFile -> Range -> [Nucleotides] Source

Extract a subsequence with typical masking: soft masking is ignored, hard masked regions are replaced with Ns.

getLazySubseq :: TwoBitFile -> Position -> [Nucleotide] Source

Works only in forward direction.

getSeqnames :: TwoBitFile -> [Seqid] Source

lookupSequence :: TwoBitFile -> Seqid -> Maybe TwoBitSequence Source

getSeqLength :: TwoBitFile -> Seqid -> Int Source

clampPosition :: TwoBitFile -> Range -> Range Source

limits a range to a position within the actual sequence

getRandomSeq Source

Arguments

:: RandomGen g
=> TwoBitFile	2bit file
-> Int	desired length
-> g	RNG
-> ((Range, [Nucleotide]), g)	position, sequence, new RNG

Sample a piece of random sequence uniformly from the genome. Only pieces that are not hard masked are sampled, soft masking is allowed, but not reported. On a 32bit platform, this will fail for genomes larger than 1G bases. However, if you're running this code on a 32bit platform, you have bigger problems to worry about.

takeOverlap :: Int -> IntMap Int -> [(Int, Int)] Source

mergeBlocks :: [(Int, Int)] -> [(Int, Int)] -> [(Int, Int, Mask)] Source

Merge blocks of Ns and blocks of Ms into single list of blocks with masking annotation. Gaps remain. Used internally only.

data Mask Source

Constructors

None
Soft
Hard
Both

Instances

Enum Mask Source
Eq Mask Source
Ord Mask Source
Show Mask Source