.\" Process this file with .\" groff -man -Tascii bam-rmdup.1 .\" .TH JIVEBUNNY 1 "JULY 2015" Applications "User Manuals" .SH NAME jivebunny \- demultiplex Illumina sequences .SH SYNOPSIS .B jivebunny [ .I option .B | .I file .B ... ] .SH DESCRIPTION .B jivebunny demultiplexes double-index Illumina sequencing data from one or more BAM files. In a first pass, the present mixture is analyzed, which serves to estimate possibly uneven mixture ratios and to assess unexpected contaminants. In a second pass, each read is assigned to the most probable read group given the estimated mixture ratios. For both passes, all input files are concatenated. Summary statistics and quality scores are computed globally and per read as appropriate. .SH OPTIONS .IP "-o, --output file" Send BAM output to .IR file . The default is to produce no output and estimate mixture ratios only. If .I file is '-', BAM output is sent to .I stdout and the final tally is instead sent to .IR stderr . .IP "-I, --index-database file" Read the database of possible indices from .IR file . Every combination of a P7 index and a P5 index from this file is considered a possible component of the mix. The default is a file containing all Illumina Truseq indices and indices from the Meyer/Kircher paper. See below for the format of this file. .IP "-r, --read-groups file" Read read group definitions from file, see below for the format of this file. Read group definitions are not necessary to identify a mixture component, but only known components can be named and assigned. .IP "--threshold frac" Set the threshold for the estimation of mixture components to .IR frac . The iteration stops as soon as no estimate for any component changes by more than .IR frac . The default of 1/200000 seems to work well in practice. .IP "--sample num" Sample .I num reads for the mixture estimation. The default is 50000, which is usually good enough. By sampling more, the ability to detect contaminants at low concentration can be improved at the cost of longer computation. .IP "--components num" Print the top .I num components of the mixture after estimation. By default, 25% more than the number of defined read groups, but at least 20 are printed. Setting this to a higher number may be a good idea if you're trying to reverse engineer a pipetting accident. .IP "-s, --single-index" Pretend there is only one index. This switch doesn't change the program's logic (it still pretends everything is doubly indexed), but it helps to make the output more readable if only one index was really sequenced. .IP "--pedantic" Be pedantic about read groups. Normally, .I jivebunny will assign reads to undeclared read groups by placing the names of the two most likely indices into the .I RG field. However, since it is not feasible to provide a header that declares all these potential read groups, the BAM file will technically be invalid. .I samtools will still happily filter on this field, though. If the .I --pedantic switch is set, these reads are not assigned to any read group. .IP "--verbose" Print progress reports during computation. The estimation process can be observed and a counter runs while reading or writing BAM files. .IP "--quiet" Don't print anything, not even the summary statistics. .IP "-h, -?, --help, --usage" Prints a short usage message and exits the program. .IP "-V, --version" Prints the version of biohazard used and exits the program. .SH FILES .SS Input Files All input files must be double index BAM files. Indices are stored in tagged fields where .IR XI " and " XJ contain the first and second ASCII codes index sequence (only A,C,G,T and N are allowed) and .IR YI " and " YJ contain the quality scores encoded as a string just like in FastaQ (ASCII codepoint of value plus 33). Single indexed BAM files should work in principle, but the output may be hard to interpret. .SS Output File The ouput is a single BAM file which is equivalent to the concatenated inputs with the following modifications: The header contains an additional .I @PG line and one .I @RG line for each read group. Each read gains the appropriate .I @RG field if a known read group can be assigned; otherwise the .I @RG field is deleted. The fields .IR @Z0 " and " @Z2 are deleted and the field .I @Z1 is added with a Phred scaled quality score for the assigned read group. (In other words, .I @Z1 is the probability that some other assignment is actually correct, expressed in deciban. If no read group could be assigned, this value may not be of much value.) .SS Read Group File The read group file is TAB separated table containg an optional header line starting with a hash mark ('#') and then one line per read group. The first field is the name of the read group, the second field is the name (not the sequence) of the first index, the third is the name of the second index. Further fields must have the form .I XY:val and are copied into the .I @RG header of the output. This facility can be used to assign libraries or samples to read groups for easier downstream processing. .SS Index Database The index database is a JSON file containing a single object with two fields named ``p7index'' and ``p5index''. Each of these is an object mapping index names to index sequences, the latter encoded as a string. It is permissible for multiple indices to have the same sequence. These will be treated as aliases when parsing read group files, only the first name is used when producing output. .SH AUTHOR Udo Stenzel .SH "SEE ALSO" .BR biohazard (7), bam (5), fastq (5), json (5) Kircher .I et.al (2012). Double indexing overcomes inaccuracies in multiplex sequencing on the Illumina platform. .IR "Nucleic Acids Research, 40" (1), e3. doi:10.1093/nar/gkr771