clustertools: Tools for manipulating sequence clusters

[ bioinformatics, program ] [ Propose Tags ] [ Report a vulnerability ]

This is a bunch of stuff I needed at some for manipulating sequence clusters. See the README for details. The tools included are:

filter - remove unwanted sequences from a clustering
hist - produce a histogram of cluster sizes from a "label"-formatted clustering.
clusc - compare clusterings, calculating numerous pair-based and entropy based indices.
add_single - add singletons to a clustering.
ace2contigs - parse an ACE assembly file, and output the contigs in a FASTA file.
ace2fasta - parse an ACE assembly, and output each assembly in a separate FASTA file
ace2clusters - parse an ACE assembly, and output clusters in TGICL format
clusterlibs - given a table of regular expressions and library names, along with a clustering (TGICL-format), output a table of cluster sizes per library.
xcerpt - extract sequences from a list of sequence labels.

The Darcs repository is at: http://malde.org/~ketil/biohaskell/cluster_tools.

[Skip to Readme]

Downloads

clustertools-0.1.5.tar.gz [browse] (Cabal source package)
Package description (as included in the package)

Maintainer's Corner

Package maintainers

GwernBranwen, KetilMalde

For package maintainers and hackage trustees

edit package information

Candidates

No Candidates

Versions [RSS]	0.1, 0.1.1, 0.1.2, 0.1.5
Dependencies	base (>=4 && <5), bio (>=0.4), bytestring, containers, QuickCheck, regex-compat, simpleargs (>=0.1) [details]
Tested with	ghc ==6.10.4
License	LicenseRef-GPL
Author	Ketil Malde
Maintainer	Ketil Malde <ketil@malde.org>
Uploaded	by KetilMalde at 2011-06-06T13:02:03Z
Category	Bioinformatics
Home page	http://malde.org/~ketil/
Distributions
Reverse Dependencies	1 direct, 0 indirect [details]
Executables	xcerpt, clusterlibs, ace2clusters, ace2fasta, ace2contigs, add_single, clusc, filter
Downloads	3802 total (5 in the last 30 days)
Rating	(no votes yet) [estimated by Bayesian average]
Your Rating	λ λ λ
Status	Docs not available [build log] All reported builds failed as of 2016-11-10 [all 10 reports]

Readme for clustertools-0.1.5

[back to package description]

This contains the following tools:

To build these, you will need a Haskell compiler (the most likely
candidate begin GHC), and my bioinformatics library and the SimpleArgs
module installed (Downloadable from: <http://malde.org/~ketil/biohaskell/>).

filter - remove unwanted sequences from a clustering
         usage: filter seq.list < cluster.L > cluster2.L
         cluster2.L will only contain sequence labels found in seq.list

hist   - produce a histogram of cluster sizes from a "label"-formatted
         clustering.

clusc  - compare clusterings, calculating numerous pair-based and
         entropy based indices.

xcerpt - given a file containing a list of sequence labels (e.g. a
         "label" formatted clustering), extract matching sequences
         from a FASTA file.  Like "agrep -d '^>'" without the bugs.

         Usage: xcerpt list.txt fasta.seq
         creates "fasta.seq.match" and "fasta.seq.rest"

add_single - add singletons to a clustering.
        Usage: add_single all.L clustering.L
        creates clustering.L_s listing all sequences in all.L but not in
        clustering.L, one per line.

ace2contigs - parse an ACE assembly file, and output the contigs in a
        FASTA file (named by tacking on .fasta to the ACE file name),
        and the corresponding quality information (.qual).

ace2fasta - parse an ACE assembly, and output each assembly in a separate
        FASTA formatted file, with the necessary gaps inserted to align the
        sequences (suitable for import into e.g. Seaview)

ace2clusters - parse an ACE assembly, and output clusters composed of the
 	sequences used for each contig.  The format is similar to TGICL's, 
	with cluster output as one line consisting of a '>' and the contig name,
	and the next line containing the names of the sequences that comprise
	the cluster.

clusterlibs - given a table of regular expressions and library names,
        along with a clustering (TGICL-format), output a table of clusters
        with the library name prepended to the sequences.