Analyze human PacBio whole genome sequencing (WGS) data to produce sequence alignment and variant calls for a sample or cohort.
Workflow for analyzing human PacBio whole genome sequencing (WGS) data using Workflow Description Language (WDL). The human WGS workflow performs read alignment, variant calling, and phasing. Joint-calling of small variants and structural variants for cohorts and optional variant filtering and annotation is also available.
Workflow Inputs
An input template file can be found here.
Templates for each of the cloud backends and the HPC with reference dataset input locations filled in can be found here:
Input | Description | ||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
cohort_id | A unique name for the cohort; used to name outputs | ||||||||||||||||||||||||||
samples | The set of samples for the cohort. At least one sample must be defined. Each sample has the following fields:
| ||||||||||||||||||||||||||
phenotypes | Human Phenotype Ontology (HPO) phenotypes associated with the cohort. If no particular phenotypes are desired, the root HPO term, | ||||||||||||||||||||||||||
reference_data | Files associated with the reference genome. These files are hosted publicly in each of the cloud backends.
| ||||||||||||||||||||||||||
slivar_data | Files associated with These files are hosted publicly in each of the cloud backends.
| ||||||||||||||||||||||||||
deepvariant_version | Version of deepvariant to use [“1.5.0”] | ||||||||||||||||||||||||||
deepvariant_model | Optional alternate DeepVariant model file to use | ||||||||||||||||||||||||||
pbsv_call_mem_gb | Optionally set RAM (GB) for pbsv_call during cohort analysis | ||||||||||||||||||||||||||
glnexus_mem_gb | Optionally set RAM (GB) for GLnexus during cohort analysis | ||||||||||||||||||||||||||
run_tertiary_analysis | Run the optional tertiary analysis steps [false] | ||||||||||||||||||||||||||
backend | Backend where the workflow will be executed [“Azure”, “AWS”, “GCP”, “HPC”] | ||||||||||||||||||||||||||
zones | Zones where compute will take place; required if backend is set to ‘AWS’ or ‘GCP’. | ||||||||||||||||||||||||||
aws_spot_queue_arn | Queue ARN for the spot batch queue; required if backend is set to ‘AWS’ and | ||||||||||||||||||||||||||
aws_on_demand_queue_arn | Queue ARN for the on demand batch queue; required if backend is set to ‘AWS’ and | ||||||||||||||||||||||||||
container_registry | Container registry where workflow images are hosted. If left blank, PacBio’s public Quay.io registry will be used. | ||||||||||||||||||||||||||
preemptible | If set to |
Workflow Outputs
The set of workflow outputs will depend on which set of analyses are specified to run, determined by the number of samples in the cohort as well as whether options such as run_tertiary_analysis
are set to true
.
Output | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Sample analysis | These files will be output for each sample defined in the cohort.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||
Cohort analysis | These files will be output if the cohort includes more than one sample.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||
Tertiary analysis | These files will be output for each run of the workflow if
|
References
Reference datasets are hosted publicly for use in the pipeline. For data locations, see the backend-specific documentation and template inputs files for each backend with paths to publicly hosted reference files filled out.
Containers
Docker images definitions used by this workflow can be found in the wdl-dockerfiles repository. Images are hosted in PacBio’s quay.io. Docker images used in the workflow are pegged to specific versions by referring to their digests rather than tags.