Safe Haskell | None |
---|---|
Language | Haskell98 |
- parseFastq :: Monad m => Enumeratee ByteString [BamRec] m a
- parseFastq' :: Monad m => (ByteString -> BamRec -> BamRec) -> Enumeratee ByteString [BamRec] m a
- parseFastqCassava :: Monad m => Enumeratee ByteString [BamRec] m a
Documentation
parseFastq :: Monad m => Enumeratee ByteString [BamRec] m a Source
Warning: parseFastq no longer removes syntactic warts!
Parser for FastA/FastQ
, Iteratee
style, based on
Data.Attoparsec, and written such that it is compatible with module
Bam
. This gives import of FastA/FastQ
while respecting some
local conventions.
Reader for DNA (not protein) sequences in FastA and FastQ. We read everything vaguely looking like FastA or FastQ, then shoehorn it into a BAM record. We strive to extract information following more or less established conventions from the header, but we won't support everything under the sun. The recognized syntactical warts are converted into appropriate flags and removed. Only the canonical variant of FastQ is supported (qualities stored as raw bytes with base 33).
Supported additional conventions:
- A name suffix of
/1
or/2
is turned into the first mate or second mate flag and the read is flagged as paired. - Same for name prefixes of
F_
orR_
, respectively. - A name prefix of
M_
flags the sequence as unpaired and merged - A name prefix of
T_
flags the sequence as unpaired and trimmed - A name prefix of
C_
, either before or after any of the other prefixes, is turned into the extra flagXP:i:-1
(result of duplicate removal with unknown duplicate count). - A collection of tags separated from the name by an octothorpe is
removed and put into the fields
XI
andXJ
as text. - In
parseFastqCassava
only, if the first word of the description has at least four colon separated subfields, the first if used to flag first/second mate, the second is the "QC failed" flag, and the fourth is the index sequence.
Everything before the first sequence header is ignored. Headers can
start with >
or @
, we treat both equally. The first word of
the header becomes the read name, the remainder of the header is
ignored. The sequence can be split across multiple lines;
whitespace, dashes and dots are ignored, IUPAC ambiguity codes are
accepted as bases, anything else causes an error. The sequence ends
at a line that is either a header or starts with +
, in the latter
case, that line is ignored and must be followed by quality scores.
There must be exactly as many Q-scores as there are bases, followed
immediately by a header or end-of-file. Whitespace is ignored.
parseFastq' :: Monad m => (ByteString -> BamRec -> BamRec) -> Enumeratee ByteString [BamRec] m a Source
Warning: parseFastq' no longer removes syntactic warts!
Same as parseFastq
, but a custom function can be applied to the
description string (the part of the header after the sequence name),
which can modify the parsed record. Note that the quality field can
end up empty.
parseFastqCassava :: Monad m => Enumeratee ByteString [BamRec] m a Source