biohazard-0.6.5: bioinformatics support library

Safe HaskellNone




rmdup :: (Monad m, Ord l) => (BamRec -> l) -> Bool -> Collapse -> Enumeratee [BamRec] [BamRec] m r Source

Removes duplicates from an aligned, sorted BAM stream.

The incoming stream must be sorted by coordinate, and we check for violations of that assumption. We cannot assume that length was taken into account when sorting (samtools doesn't do so), so duplicates may be separated by reads that start at the same position but have different length or different strand.

We are looking at three different kinds of reads: paired reads, true single ended reads, merged or trimmed reads. They are somewhat different, but here's the situation if we wanted to treat them separately. These conditions define a set of duplicates:

Merged or trimmed: We compare the leftmost coordinates and the aligned length. If the library prep is strand-preserving, we also compare the strand.

Paired: We compare both left-most coordinates (b_pos and b_mpos). If the library prep is strand-preserving, only first-mates can be duplicates of first-mates. Else a first-mate can be the duplicate of a second-mate. There may be pairs with one unmapped mate. This is not a problem as they get assigned synthetic coordinates and will be handled smoothly.

True singles: We compare only the leftmost coordinate. It does not matter if the library prep is strand-preserving, the strand always matters.

Across these classes, we can see more duplicates:

Merged/trimmed and paired: these can be duplicates if the merging failed for the pair. We would need to compare the outer coordinates of the merged reads to the two 5' coordinates of the pair. However, since we don't have access to the mate, we cannot actually do anything right here. This case should be solved externally by merging those pairs that overlap in coordinate space.

Single and paired: in the single case, we only have one coordinate to compare. This will inevitably lead to trouble, as we could find that the single might be the duplicate of two pairs, but those two pairs are definitely not duplicates of each other. We solve it by removing the single read(s).

Single and merged/trimmed: same trouble as in the single+paired case. We remove the single to solve it.

In principle, we might want to allow some wiggle room in the coordinates. So far, this has not been implemented. It adds the complication that groups of separated reads can turn into a set of duplicates because of the appearance of a new reads. Needs some thinking about... or maybe it's not too important.

Once a set of duplicates is collected, we perform a majority vote on the correct CIGAR line. Of all those reads that agree on this CIGAR line, a consensus is called, quality scores are adjusted and clamped to a maximum, the MD field is updated and the XP field is assigned the number of reads in the original cluster. The new MAPQ becomes the RMSQ of the map qualities of all reads.

Treatment of Read Groups: We generalize by providing a "label" function; only reads that have the same label are considered duplicates of each other. The typical label function would extract read groups, libraries or samples.

normalizeTo :: Seqid -> Int -> BamRec -> BamRec Source

Normalize a read's alignment to fall into the canonical region of [0..l]. Takes the name of the reference sequence and its length.

wrapTo :: Int -> BamRec -> [BamRec] Source

Wraps a read to be fully contained in the canonical interval [0..l]. If the read overhangs, it is duplicated and both copies are suitably masked.