biohazard-0.6.2: bioinformatics support library

Safe HaskellNone




parseFastq :: Monad m => Enumeratee ByteString [BamRec] m a Source

Warning: parseFastq no longer removes syntactic warts!

Parser for FastA/FastQ, Iteratee style, based on Data.Attoparsec, and written such that it is compatible with module Bam. This gives import of FastA/FastQ while respecting some local conventions.

Reader for DNA (not protein) sequences in FastA and FastQ. We read everything vaguely looking like FastA or FastQ, then shoehorn it into a BAM record. We strive to extract information following more or less established conventions from the header, but we won't support everything under the sun. The recognized syntactical warts are converted into appropriate flags and removed. Only the canonical variant of FastQ is supported (qualities stored as raw bytes with base 33).

Supported additional conventions:

  • A name suffix of /1 or /2 is turned into the first mate or second mate flag and the read is flagged as paired.
  • Same for name prefixes of F_ or R_, respectively.
  • A name prefix of M_ flags the sequence as unpaired and merged
  • A name prefix of T_ flags the sequence as unpaired and trimmed
  • A name prefix of C_, either before or after any of the other prefixes, is turned into the extra flag XP:i:-1 (result of duplicate removal with unknown duplicate count).
  • A collection of tags separated from the name by an octothorpe is removed and put into the fields XI and XJ as text.
  • In parseFastqCassava only, if the first word of the description has at least four colon separated subfields, the first if used to flag first/second mate, the second is the "QC failed" flag, and the fourth is the index sequence.

Everything before the first sequence header is ignored. Headers can start with > or @, we treat both equally. The first word of the header becomes the read name, the remainder of the header is ignored. The sequence can be split across multiple lines; whitespace, dashes and dots are ignored, IUPAC ambiguity codes are accepted as bases, anything else causes an error. The sequence ends at a line that is either a header or starts with +, in the latter case, that line is ignored and must be followed by quality scores. There must be exactly as many Q-scores as there are bases, followed immediately by a header or end-of-file. Whitespace is ignored.

parseFastq' :: Monad m => (ByteString -> BamRec -> BamRec) -> Enumeratee ByteString [BamRec] m a Source

Warning: parseFastq' no longer removes syntactic warts!

Same as parseFastq, but a custom function can be applied to the description string (the part of the header after the sequence name), which can modify the parsed record. Note that the quality field can end up empty.