biohazard-2.1: bioinformatics support library

Safe HaskellNone
LanguageHaskell2010

Bio.Bam.Fastq

Description

Parser for FastA/FastQ, ByteStream style, written such that it works well with module Bio.Bam.

Input streams are broken into numbered lines, then into records. Records can start with empty lines, which are ignored, or random junk, which is ignored, but results in a warning, followed by a header indicating either a FastA (begins with > or ;) or FastQ record (begins with @). More description lines begining with ; are allowed, and silently ignored. All following lines not starting with +, >, ; or @ are sequence lines. (Only) in a FastQ record, this is followed by a separator line starting with a +, which is ignored, and exactly as many quality lines as there were sequence lines. A missing separator results in a warning and the record being parsed without quality scores.

In sequence lines, IUPAC-IUB ambiguity codes are converted to Nucleotides, white space is skipped silently. Any other character becomes an unknown base ('=' in SAM) and a warning is emitted. Note that downstream tools are unlikely to handle the resulting unknown bases and/or empty records gracefully. If the quality lines do not have the same total length as the sequence lines (this includes missing quality lines due to end-of-stream), a warning is emitted, and the record receives no quality scores (just as if it was a FastA record). Else, if the quality lines have a different layout than the sequence lines, a warning is emitted, but they are still used.

Quality scores must be stored as raw bytes with offset 33. (Other variants, like 454's ASCII qualities and Solexa's raw bytes with offset 64 are difficult to detect, and extinct in the wild anyway.) If the second word of the header stores multiple fields, we try to extract Illumina's "QC failed" flag and either an index sequence or a read group name from it.

Other flags are commonly encoded into the sequence names. We do not handle those here, but most of the conventions at MPI EVAN are dealt with by removeWarts.

Synopsis

Documentation

data JunkFound Source #

Emitted when random text is found instead of a header.

Constructors

JunkFound !Int !Bytes 

data QualitiesMissing Source #

Emitted when a quality separator was expected, but not found.

Constructors

QualitiesMissing !Int !Bytes 

data SequenceHasGaps Source #

Emitted when a sequence record contains strange characters

Constructors

SequenceHasGaps !Int !Bytes