HiFi Target Enrichment

Analyze targeted HiFi sequence datasets using PacBio read data. Call and phase small and structural variants.

Workflow for performing target enrichment on PacBio HiFi data written in Workflow Description Language (WDL). The target enrichment workflow minimally runs demultiplexing, duplicate marking, alignment to reference, and small variant variant calling using DeepVariant, in addition to structural variant (SV) calling using pbsv. Phases and haplotags samples. Cohort analysis, QC, HS metrics, and pharmcat steps are optional.

Workflow Inputs

The HifiTargetEnrichment FAQ section has details about file formats.

Input Description

batch_id

Batch name; used for naming files

hifi_reads

HiFi reads in BAM format.

barcode_sample_map

FASTA file containing forward and reverse barcode sequences; used to demultiplex reads.

target_regions_bed

BED file specifying the coordinates of the regions of interest.

reference

Files associated with the reference genome.

name	Reference name; used to name outputs
fasta	Reference genome and index to align reads to
chromosome_lengths	File specifying the lengths of each of the reference chromosomes
tandem_repeat_bed	Tandem repeat locations in the reference genome
exons_bed	BED file specifying reference exon locations

run_cohort_analysis

Run optional cohort analysis steps

run_qc

Run optional QC steps

qc_low_coverage

Low coverage cutoff for QC [10]

probes_bed

BED file specifying the coordinates for the probes used to prepare the target capture library. The same file used for target_regions_bed may be used in place of the probes_bed if you do not have access to the probes_bed. If this file is specified, the HS metrics workflow will run.

picard_sample_size

Sample size for Picard CollectHsMetrics; the sample size used for Theoretical Het Sensitivity sampling. [1000]

picard_near_distance

Near distance cutoff for Picard CollectHsMetrics; the maximum distance between a read and the nearest probe/bait/amplicon for the read to be considered ‘near probe’ and included in the percent selected. [5000]

run_pharmcat

Run optional pharmcat and pangu_cyp2d6 steps

pharmcat_positions

VCF file and index specifying pharmact positions; required if run_pharmcat is set to true.

pharmcat_min_coverage

Minimum coverage cutoff used to filter the preprocessed VCF passed to pharmcat [10]

deepvariant_version

Version of deepvariant to use [1.4.0]

deepvariant_model

Optonal alternate DeepVariant model file to use

backend

Backend where the workflow will be executed [“Azure”, “AWS”, “GCP”, “HPC”]

zones

Zones where compute will take place; required if backend is set to ‘AWS’ or ‘GCP’.

aws_spot_queue_arn

Queue ARN for the spot batch queue; required if backend is set to ‘AWS’ and preemptible is set to true

aws_on_demand_queue_arn

Queue ARN for the on demand batch queue; required if backend is set to ‘AWS’ and preemptible is set to false

container_registry

Container registry where workflow images are hosted. If left blank, PacBio’s public Quay.io registry will be used.

preemptible

If set to true, run tasks preemptibly where possible. On-demand VMs will be used only for tasks that run for >24 hours if the backend is set to GCP. If set to false, on-demand VMs will be used for every task. Ignored if backend is set to HPC.

Workflow Outputs

The set of workflow outputs will depend on which set of analyses are specified to run, determined by the set of inputs that are provided as well as whether options such as run_cohort_analysis, run_qc, and run_pharmcat are set to true.

Output Description

demultiplex_failed_samples

A file listing samples that failed demultiplexing.

Sample analysis

Sample analysis are produced for each demultiplexed sample generated from the input hifi reads.

pbsv_vcf	Structural variants called by pbsv (with index)
sample_phased_vcfs	Phased VCFs and indices called by DeepVariant and phased by WhatsHap
haplotagged_bams	Haplotagged BAM output by WhatsHap

Cohort analysis

Cohort analysis outputs will be produced if the input run_cohort_analysis is set to true.

cohort_phased_joint_called_vcf Phased cohort VCF called by glnexus and phased by WhatsHap

Quality control (QC)

QC outputs will be produced if the input run_qc is set to true.

Sample
sample_readcount_beds	Per-sample BED file containing counts of intersections between the input `target_regions_bed` and the aligned BAM
sample_readcount_csvs	Per-sample CSV file containing counts of intersections between the input `target_regions_bed` and the aligned BAM
sample_coverage_fraction_csvs	Per-sample base coverage fractions in CSV format
sample_read_metrics_csvs	Per-sample read metrics in CSV format
sample_duplicate_lengths_csvs	Per-sample PCR/optical read duplicate lengths in CSV format
sample_merged_read_metrics_csvs	Per-sample merged read metrics, target read metrics, and exons per read information in CSV format
sample_mean_base_coverage_by_target_plots	Per-sample PNG plot of mean base coverage by target
sample_coverage_plots	Per-sample PNG plot of coverage per target
Batch
batch_covered_fraction_summary_csvs	Batch-level covered fraction summary for targets and exons in CSV format
batch_coverage_summary_csvs	Batch-level coverage summary for targets and exons in CSV format
batch_dropped_issue_elements_csvs	Batch-level dropped elements for targets and exons in CSV format
batch_lowcov_issue_elements_csvs	Batch-level low-coverage elements for targets and exons in CSV format
batch_gc_content_csvs	Batch-level high GC content sites for targets and exons in CSV format
batch_duplicate_lengths_csv	Batch-level PCR/optical read duplicate lengths in CSV format
batch_read_data_csv	Batch-level read metrics
batch_mean_base_coverage_plot_png	Batch-level PNG plot of mean base coverage
batch_multi_coverage_by_target_png	Batch-level PNG plot of coverage
batch_read_categories_png	Batch-level PNG plot of read categories
batch_read_length_by_sample_csv	Batch-level CSV denoting read lengths by sample

Hybrid selection (HS) metrics

HS metrics outputs will be generated if the input probes_bed is defined.

sample_hs_metrics	Picard hybrid-selection (HS) metrics
batch_consolidated_hs_metrics_tsv	Consolidated HS metrics TSV
batch_consolidated_hs_metrics_quickview_tsv	Consolidated HS metrics quickview TSV

PharmCAT

PharmCAT outputs will be produced if the input run_pharmcat is set to true.

pangu_jsons	Pangu report JSON
pangu_tsvs	Pangu TSV output; used by PharmCAT
fixed_pangu_tsvs	Pangu TSV with missing calls fixed
pharmcat_missing_pgx_vcfs	Phased VCF with missing calls converted to ref calls
pharmcat_preprocessed_filtered_vcfs	Phased VCF with low-coverage ref calls removed
pharmcat_match_jsons	PharmCAT match results in JSON format
pharmcat_phenotype_jsons	PharmCAT phenotype results in JSON format
pharmcat_report_htmls	PharmCAT report in HTML format
pharmcat_report_jsons	PharmCAT report in JSON format

References

Reference datasets are hosted publicly for use in the pipeline.

Containers

Docker images definitions used by this workflow can be found in the wdl-dockerfiles repository. Images are hosted in PacBio’s quay.io. Docker images used in the workflow are pegged to specific versions by referring to their digests rather than tags.

The Docker image used by a particular step of the workflow can be identified by looking at the docker key in the runtime block for the given task.