New in this version (0.8)
-------------------------
Mostly a maintenance release, but at least we have
* Cabalized install
* Sparse mode completed and optimized
* More reasonable default parameters
Acquiring RBR
-------------
I'll try to keep a selection of source and binaries at:
http://www.ii.uib.no/~ketil/bioinformatics/downloads/software
If you need binaries for other architectures, drop me a mail at
<ketil@ii.uib.no>. The latest version should always be available from
my darcs repo:
darcs get http://www.ii.uib.no/~ketil/bioinformatics/repos/rbr
Installation instructions
-------------------------
I'm working on a smoother installation process, but this is how it
currently works.
You need GHC (http://haskell.org/ghc). Everything is tested against
version 6.8, but older versions might also work. Expect to do some
modifications in the code for earlier versions than that.
You first need to get my 'bio' library, it is available from the same
website, and install that. You can use cabal to install the binary,
at the top level, do
chmod +x Setup.hs
./Setup.hs configure (add --prefix=$HOME if you don't have root access)
./Setup.hs build
{sudo} ./Setup.hs install
If you want to go the more manual route, cd to the src subdirectory, and
make rbr -- builds a dynamically linked executable
or: make rbr_s -- builds a statically linked executable
The main development platform is Linux/x86, so expect that to be most
well supported. In order to build on an aging Sun with an old gcc
(2.95), I had to comment out 'hooks.o' from the Makefile, and static
build didn't work either. I'm investigating this, but perhaps it
suffices to have a current GCC available, and/or a newer Solaris.
Usage
-----
I've no real manual page yet, but 'rbr --help' should list the
available options. Basically, masking is determined by examining word
frequencies of a certain word length (-k), estimating a distribution
around the "modal interval" of the word frequencies with a certain
stringency (-s), and masking words with frequencies exceeding the mean
of this distribution by a certain standard deviations (-t). Defaults
are -k 16 -s 1.1 -t 5.0 if -L (lower case masking) is specified and
-k 16 -s 2.0 -t 8.0 if -n (masking with 'n') is specified. Lower case
is now the default.
To mask more agressively, you can either try to reduce stringency, or
a lower deviation, or both. Conversely if you want more conservative
masking. In general, the differences are small, and typically you can
compensate for a decrease in one parameter with an increase in the
other.
Shorter word length will be more tolerant against SNPs and read
errors, but increase the variance. Longer will be less tolarant, but
have less variance. In addition, word lengths beyond 16 will be
slower.
There's also a --sparse=X option that will store a fraction (but at
least every X'th) of the words. This will reduce memory consumption proportionally.
RBR's memory usage can be limited with options to the run time
system's garbage collector. I good rule of thumb may be to limit it
to 80-90% of available physcal memory, which will avoid paging to
disk. If RBR is compiled with hooks.o linked in, this will be the
default, but if other behaviour is desired, you can use "+RTS -MxxxM
-RTS" to limit heap use to xxxMB¹. See the GHC documentation
(http://haskell.org/ghc) for more on this. Usually, you'll get better
performance by supplying -HxxxM as well (this will reduce GC time,
again see the GHC docs).
The -v option gives some feedback while RBR runs, which is nice if
you're using it interactively.
There is also a server mode, where RBR will index a data set, and
listen on stdin for sequence names, and answer on stdout with the
original sequence, the masked sequence, and the distribution of word
frequencies along the sequence.
¹) GHC version 6.6 and earlier had a bug that would cause memory
consumption to be measured incorrectly if the system allocated it in
an unusual order. The fix will be in subsequent releases of GHC.