Quickstart

Installing the CAP

To install download the repo and run the setup command

python setup.py install
cap2 --help

Running the CAP

Set Up

After the CAP is intalled it can be run on local data. There is no need to install additional programs or databases as these will be automatically installed as needed.

To run the CAP you will need to make two files. A manifest file telling the CAP what data to analyze and a config file telling the CAP where to store its outputs.

This manifest file should have three columns separated by tabs.
  • The first column is the name of the sample
  • the second column is the path to the first read
  • the third column is the path to the second read

An example manifest could look like this

sample_1    /path/to/sample_1.R1.fq.gz  /path/to/sample_1.R2.fq.gz
sample_2    /path/to/sample_2.R1.fq.gz  /path/to/sample_2.R2.fq.gz
sample_3    /path/to/sample_3.R1.fq.gz  /path/to/sample_3.R2.fq.gz

The config file should be a yaml file with two keys out_dir and db_dir

An example manifest could look like this
out_dir: /path/to/output/dir
db_dir: /path/to/dir/where/databases/should/be/stored

Once you have the manifest and config file you can run the pipeline.

Running the Pipeline

To run the qpipeline use the following command

cap2 run pipeline -c /path/to/config.yaml /path/to/manifest.txt

The CAP uses subpipelines to group modules. There are five such subpipelines: quality control, preprocessing, short read analysis, assembly, and databases. To run a specific sub-pipelines add –stage <stage name> to the above command. The different stage names are listed when you run cap2 run pipeline –help

You can also run more than one task at the same time using –workers <number of simultaneous tasks> or increase the number of threads per worker with –threads <num threads>

An example of running the quality control pipeline with two workers that have four threads each would be:

cap2 run pipeline --stage qc --workers 2 --threads 4 -c /path/to/config.yaml /path/to/manifest.txt

See the Demo for more details.

Advanced Topics

Installing Subprograms

By default the CAP installs all necessary subprograms itself. If a particular version of a subprogram is required it is usually possible to override the defaults and supply the program as a parameter.

Databases

The CAP uses a number of large reference databases. Building and indexing these databases is considered a first-class part of the pipeline and modules are included to do so. However, in most cases users will simply want to download prebuilt version of the databases rather than build them from scratch. This is the pipeline default and will happen automatically.

To preload databases on a machine run this from the command line:

cap2 run db -c /path/to/config/file

or from python

from .api import run_db_stage

config = /path/to/config/file
run_db_stage(config_path=config, cores=1, workers=1)

Running Tests

Unit tests can be run with pytest from the repo

python -m pytest tests

Configuration

By default CAP2 downloads all necessary programs and databases when it is run. For users running CAP2 multiple times on the same system it will be beneficial to set up configuration so that downloads only occur once.

Configuration consists of setting three environmental variables. These should go in your .bashrc or equivalent.

CAP2_DB_DIR=<some local path...>
CAP2_CONDA_SPEC_DIR=<some local path...>
CAP2_CONDA_BASE_PATH=<some local path...>

You can also use a yaml configuration file. See the API for details and all options.

Running in the cloud

Running the CAP2 in the cloud often requires some additional setup. This is what we needed to do to get the CAP2 running on DigitalOcean Ubuntu Servers:

sudo apt update
sudo apt install build-essential python-dev libfreetype6-dev config
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source ~/.bashrc
cd CAP2/
git checkout feat/single-ended-reads
python setup.py develop
cd
mkdir workdir
cd workdir
cap2 --help