Analyze targeted HiFi sequence datasets using PacBio read data. Call and phase small and structural variants.
Workflow for performing target enrichment on PacBio HiFi data written in Workflow Description Language (WDL). The target enrichment workflow minimally runs demultiplexing, duplicate marking, alignment to reference, and small variant variant calling using DeepVariant, in addition to structural variant (SV) calling using pbsv. Phases and haplotags samples. Cohort analysis, QC, HS metrics, and pharmcat steps are optional.
Workflow Inputs
The HifiTargetEnrichment FAQ section has details about file formats.
Input | Description | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
batch_id | Batch name; used for naming files | ||||||||||
hifi_reads | HiFi reads in BAM format. | ||||||||||
barcode_sample_map | FASTA file containing forward and reverse barcode sequences; used to demultiplex reads. | ||||||||||
target_regions_bed | BED file specifying the coordinates of the regions of interest. | ||||||||||
reference | Files associated with the reference genome.
| ||||||||||
run_cohort_analysis | Run optional cohort analysis steps | ||||||||||
run_qc | Run optional QC steps | ||||||||||
qc_low_coverage | Low coverage cutoff for QC [10] | ||||||||||
probes_bed | BED file specifying the coordinates for the probes used to prepare the target capture library. The same file used for target_regions_bed may be used in place of the probes_bed if you do not have access to the probes_bed. If this file is specified, the HS metrics workflow will run. | ||||||||||
picard_sample_size | Sample size for Picard CollectHsMetrics; the sample size used for Theoretical Het Sensitivity sampling. [1000] | ||||||||||
picard_near_distance | Near distance cutoff for Picard CollectHsMetrics; the maximum distance between a read and the nearest probe/bait/amplicon for the read to be considered ‘near probe’ and included in the percent selected. [5000] | ||||||||||
run_pharmcat | Run optional pharmcat and pangu_cyp2d6 steps | ||||||||||
pharmcat_positions | VCF file and index specifying pharmact positions; required if run_pharmcat is set to true. | ||||||||||
pharmcat_min_coverage | Minimum coverage cutoff used to filter the preprocessed VCF passed to pharmcat [10] | ||||||||||
deepvariant_version | Version of deepvariant to use [1.4.0] | ||||||||||
deepvariant_model | Optonal alternate DeepVariant model file to use | ||||||||||
backend | Backend where the workflow will be executed [“Azure”, “AWS”, “GCP”, “HPC”] | ||||||||||
zones | Zones where compute will take place; required if backend is set to ‘AWS’ or ‘GCP’. | ||||||||||
aws_spot_queue_arn | Queue ARN for the spot batch queue; required if backend is set to ‘AWS’ and | ||||||||||
aws_on_demand_queue_arn | Queue ARN for the on demand batch queue; required if backend is set to ‘AWS’ and | ||||||||||
container_registry | Container registry where workflow images are hosted. If left blank, PacBio’s public Quay.io registry will be used. | ||||||||||
preemptible | If set to true, run tasks preemptibly where possible. On-demand VMs will be used only for tasks that run for >24 hours if the backend is set to GCP. If set to false, on-demand VMs will be used for every task. Ignored if backend is set to HPC. |
Workflow Outputs
The set of workflow outputs will depend on which set of analyses are specified to run, determined by the set of inputs that are provided as well as whether options such as run_cohort_analysis
, run_qc
, and run_pharmcat
are set to true
.
Output | Description | ||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
demultiplex_failed_samples | A file listing samples that failed demultiplexing. | ||||||||||||||||||||||||||||||||||||||||||
Sample analysis | Sample analysis are produced for each demultiplexed sample generated from the input hifi reads.
| ||||||||||||||||||||||||||||||||||||||||||
Cohort analysis | Cohort analysis outputs will be produced if the input
| ||||||||||||||||||||||||||||||||||||||||||
Quality control (QC) | QC outputs will be produced if the input
| ||||||||||||||||||||||||||||||||||||||||||
Hybrid selection (HS) metrics | HS metrics outputs will be generated if the input
| ||||||||||||||||||||||||||||||||||||||||||
PharmCAT | PharmCAT outputs will be produced if the input
|
References
Reference datasets are hosted publicly for use in the pipeline.
Containers
Docker images definitions used by this workflow can be found in the wdl-dockerfiles repository. Images are hosted in PacBio’s quay.io. Docker images used in the workflow are pegged to specific versions by referring to their digests rather than tags.
The Docker image used by a particular step of the workflow can be identified by looking at the docker
key in the runtime
block for the given task.