Safe Haskell | None |
---|---|
Language | Haskell2010 |
Would you believe it? The 2bit format stores blocks of Ns in a table at the beginning of a sequence, then packs four bases into a byte. So it is neither possible nor necessary to store Ns in the main sequence, and you would think they aren't stored there, right? And they aren't. Instead Ts are stored which the reader has to replace with Ns.
The sensible way to treat these is probably to just say there are two
kinds of implied annotation (repeats and large gaps for a typical
genome), which can be interpreted in whatever way fits. And that's why
we have Mask
and getSubseqWith
.
- data TwoBitFile = TBF {
- tbf_raw :: ByteString
- tbf_seqs :: !(HashMap Seqid TwoBitSequence)
- data TwoBitSequence = TBS {
- tbs_n_blocks :: !(IntMap Int)
- tbs_m_blocks :: !(IntMap Int)
- tbs_dna_offset :: !Int
- tbs_dna_size :: !Int
- openTwoBit :: FilePath -> IO TwoBitFile
- getFwdSubseqWith :: TwoBitFile -> TwoBitSequence -> (Word8 -> Mask -> a) -> Int -> [a]
- getSubseq :: TwoBitFile -> Range -> [Nucleotide]
- getSubseqWith :: (Nucleotide -> Mask -> a) -> TwoBitFile -> Range -> [a]
- getSubseqAscii :: TwoBitFile -> Range -> String
- getSubseqMasked :: TwoBitFile -> Range -> [Nucleotides]
- getLazySubseq :: TwoBitFile -> Position -> [Nucleotide]
- getFragment :: TwoBitFile -> Seqid -> Int -> Int -> Vector Word8
- getFwdSubseqV :: TwoBitFile -> TwoBitSequence -> Int -> Int -> Vector Word8
- getSeqnames :: TwoBitFile -> [Seqid]
- lookupSequence :: TwoBitFile -> Seqid -> Maybe TwoBitSequence
- getSeqLength :: TwoBitFile -> Seqid -> Int
- clampPosition :: TwoBitFile -> Range -> Range
- getRandomSeq :: RandomGen g => TwoBitFile -> Int -> g -> ((Range, [Nucleotide]), g)
- takeOverlap :: Int -> IntMap Int -> [(Int, Int)]
- mergeBlocks :: [(Int, Int)] -> [(Int, Int)] -> [(Int, Int, Mask)]
- data Mask
Documentation
data TwoBitFile Source #
TBF | |
|
data TwoBitSequence Source #
TBS | |
|
openTwoBit :: FilePath -> IO TwoBitFile Source #
Brings a 2bit file into memory. The file is mmap'ed, so it will not work on streams that are not actual files. It's also unsafe if the file is modified in any way.
getFwdSubseqWith :: TwoBitFile -> TwoBitSequence -> (Word8 -> Mask -> a) -> Int -> [a] Source #
getSubseq :: TwoBitFile -> Range -> [Nucleotide] Source #
Extract a subsequence without masking.
getSubseqWith :: (Nucleotide -> Mask -> a) -> TwoBitFile -> Range -> [a] Source #
Extract a subsequence and apply masking. TwoBit file can represent two kinds of masking (hard and soft), where hard masking is usually realized by replacing everything by Ns and soft masking is done by lowercasing. Here, we take a user supplied function to apply masking.
getSubseqAscii :: TwoBitFile -> Range -> String Source #
Extract a subsequence with masking for biologists: soft masking is done by lowercasing, hard masking by printing an N.
getSubseqMasked :: TwoBitFile -> Range -> [Nucleotides] Source #
Extract a subsequence with typical masking: soft masking is ignored, hard masked regions are replaced with Ns.
getLazySubseq :: TwoBitFile -> Position -> [Nucleotide] Source #
Works only in forward direction.
getFragment :: TwoBitFile -> Seqid -> Int -> Int -> Vector Word8 Source #
Gets a fragment from a 2bit file. The result always has the desired length; if necessary, it is padded with Ns. Be careful about the unconventional encoding: 0..4 == TCAGN
getFwdSubseqV :: TwoBitFile -> TwoBitSequence -> Int -> Int -> Vector Word8 Source #
getSeqnames :: TwoBitFile -> [Seqid] Source #
lookupSequence :: TwoBitFile -> Seqid -> Maybe TwoBitSequence Source #
getSeqLength :: TwoBitFile -> Seqid -> Int Source #
clampPosition :: TwoBitFile -> Range -> Range Source #
limits a range to a position within the actual sequence
:: RandomGen g | |
=> TwoBitFile | 2bit file |
-> Int | desired length |
-> g | RNG |
-> ((Range, [Nucleotide]), g) | position, sequence, new RNG |
Sample a piece of random sequence uniformly from the genome. Only pieces that are not hard masked are sampled, soft masking is allowed, but not reported. On a 32bit platform, this will fail for genomes larger than 1G bases. However, if you're running this code on a 32bit platform, you have bigger problems to worry about.