Quickstart¶
Installing the CAP¶
To install download the repo and run the setup command
python setup.py install
cap2 --help
Running the CAP¶
Set Up¶
After the CAP is intalled it can be run on local data. There is no need to install additional programs or databases as these will be automatically installed as needed.
To run the CAP you will need to make two files. A manifest file telling the CAP what data to analyze and a config file telling the CAP where to store its outputs.
- This manifest file should have three columns separated by tabs.
- The first column is the name of the sample
- the second column is the path to the first read
- the third column is the path to the second read
An example manifest could look like this
sample_1 /path/to/sample_1.R1.fq.gz /path/to/sample_1.R2.fq.gz
sample_2 /path/to/sample_2.R1.fq.gz /path/to/sample_2.R2.fq.gz
sample_3 /path/to/sample_3.R1.fq.gz /path/to/sample_3.R2.fq.gz
The config file should be a yaml file with two keys out_dir and db_dir
An example manifest could look like this
out_dir: /path/to/output/dir
db_dir: /path/to/dir/where/databases/should/be/stored
Once you have the manifest and config file you can run the pipeline.
Running the Pipeline¶
To run the qpipeline use the following command
cap2 run pipeline -c /path/to/config.yaml /path/to/manifest.txt
The CAP uses subpipelines to group modules. There are five such subpipelines: quality control, preprocessing, short read analysis, assembly, and databases. To run a specific sub-pipelines add –stage <stage name> to the above command. The different stage names are listed when you run cap2 run pipeline –help
You can also run more than one task at the same time using –workers <number of simultaneous tasks> or increase the number of threads per worker with –threads <num threads>
An example of running the quality control pipeline with two workers that have four threads each would be:
cap2 run pipeline --stage qc --workers 2 --threads 4 -c /path/to/config.yaml /path/to/manifest.txt
See the Demo for more details.
Advanced Topics¶
Installing Subprograms¶
By default the CAP installs all necessary subprograms itself. If a particular version of a subprogram is required it is usually possible to override the defaults and supply the program as a parameter.
Databases¶
The CAP uses a number of large reference databases. Building and indexing these databases is considered a first-class part of the pipeline and modules are included to do so. However, in most cases users will simply want to download prebuilt version of the databases rather than build them from scratch. This is the pipeline default and will happen automatically.
To preload databases on a machine run this from the command line:
cap2 run db -c /path/to/config/file
or from python
from .api import run_db_stage
config = /path/to/config/file
run_db_stage(config_path=config, cores=1, workers=1)
Configuration¶
By default CAP2 downloads all necessary programs and databases when it is run. For users running CAP2 multiple times on the same system it will be beneficial to set up configuration so that downloads only occur once.
Configuration consists of setting three environmental variables. These should go in your .bashrc or equivalent.
CAP2_DB_DIR=<some local path...>
CAP2_CONDA_SPEC_DIR=<some local path...>
CAP2_CONDA_BASE_PATH=<some local path...>
You can also use a yaml configuration file. See the API for details and all options.
Running in the cloud¶
Running the CAP2 in the cloud often requires some additional setup. This is what we needed to do to get the CAP2 running on DigitalOcean Ubuntu Servers:
sudo apt update
sudo apt install build-essential python-dev libfreetype6-dev config
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source ~/.bashrc
cd CAP2/
git checkout feat/single-ended-reads
python setup.py develop
cd
mkdir workdir
cd workdir
cap2 --help