HiFi Target Enrichment

Analyze targeted HiFi sequence datasets using PacBio read data. Call and phase small and structural variants.

Analyze targeted HiFi sequence datasets using PacBio read data. Call and phase small and structural variants.

Workflow for performing target enrichment on PacBio HiFi data written in Workflow Description Language (WDL). The target enrichment workflow minimally runs demultiplexing, duplicate marking, alignment to reference, and small variant variant calling using DeepVariant, in addition to structural variant (SV) calling using pbsv. Phases and haplotags samples. Cohort analysis, QC, HS metrics, and pharmcat steps are optional.

HiFi Target Enrichment workflow diagram
HiFi Target Enrichment workflow diagram

Workflow Inputs

The HifiTargetEnrichment FAQ section has details about file formats.

Input Description
batch_id

Batch name; used for naming files

hifi_reads

HiFi reads in BAM format.

barcode_sample_map

FASTA file containing forward and reverse barcode sequences; used to demultiplex reads.

target_regions_bed

BED file specifying the coordinates of the regions of interest.

reference

Files associated with the reference genome.

name Reference name; used to name outputs
fasta Reference genome and index to align reads to
chromosome_lengths File specifying the lengths of each of the reference chromosomes
tandem_repeat_bed Tandem repeat locations in the reference genome
exons_bed BED file specifying reference exon locations
run_cohort_analysis

Run optional cohort analysis steps

run_qc

Run optional QC steps

qc_low_coverage

Low coverage cutoff for QC [10]

probes_bed

BED file specifying the coordinates for the probes used to prepare the target capture library. The same file used for target_regions_bed may be used in place of the probes_bed if you do not have access to the probes_bed. If this file is specified, the HS metrics workflow will run.

picard_sample_size

Sample size for Picard CollectHsMetrics; the sample size used for Theoretical Het Sensitivity sampling. [1000]

picard_near_distance

Near distance cutoff for Picard CollectHsMetrics; the maximum distance between a read and the nearest probe/bait/amplicon for the read to be considered ‘near probe’ and included in the percent selected. [5000]

run_pharmcat

Run optional pharmcat and pangu_cyp2d6 steps

pharmcat_positions

VCF file and index specifying pharmact positions; required if run_pharmcat is set to true.

pharmcat_min_coverage

Minimum coverage cutoff used to filter the preprocessed VCF passed to pharmcat [10]

deepvariant_version

Version of deepvariant to use [1.4.0]

deepvariant_model

Optonal alternate DeepVariant model file to use

backend

Backend where the workflow will be executed [“Azure”, “AWS”, “GCP”, “HPC”]

zones

Zones where compute will take place; required if backend is set to ‘AWS’ or ‘GCP’.

aws_spot_queue_arn

Queue ARN for the spot batch queue; required if backend is set to ‘AWS’ and preemptible is set to true

aws_on_demand_queue_arn

Queue ARN for the on demand batch queue; required if backend is set to ‘AWS’ and preemptible is set to false

container_registry

Container registry where workflow images are hosted. If left blank, PacBio’s public Quay.io registry will be used.

preemptible

If set to true, run tasks preemptibly where possible. On-demand VMs will be used only for tasks that run for >24 hours if the backend is set to GCP. If set to false, on-demand VMs will be used for every task. Ignored if backend is set to HPC.

Workflow Outputs

The set of workflow outputs will depend on which set of analyses are specified to run, determined by the set of inputs that are provided as well as whether options such as run_cohort_analysis, run_qc, and run_pharmcat are set to true.

Output Description
demultiplex_failed_samples

A file listing samples that failed demultiplexing.

Sample analysis

Sample analysis are produced for each demultiplexed sample generated from the input hifi reads.

pbsv_vcf Structural variants called by pbsv (with index)
sample_phased_vcfs Phased VCFs and indices called by DeepVariant and phased by WhatsHap
haplotagged_bams Haplotagged BAM output by WhatsHap
Cohort analysis

Cohort analysis outputs will be produced if the input run_cohort_analysis is set to true.

cohort_phased_joint_called_vcf Phased cohort VCF called by glnexus and phased by WhatsHap
Quality control (QC)

QC outputs will be produced if the input run_qc is set to true.

Sample
sample_readcount_beds Per-sample BED file containing counts of intersections between the input target_regions_bed and the aligned BAM
sample_readcount_csvs Per-sample CSV file containing counts of intersections between the input target_regions_bed and the aligned BAM
sample_coverage_fraction_csvs Per-sample base coverage fractions in CSV format
sample_read_metrics_csvs Per-sample read metrics in CSV format
sample_duplicate_lengths_csvs Per-sample PCR/optical read duplicate lengths in CSV format
sample_merged_read_metrics_csvs Per-sample merged read metrics, target read metrics, and exons per read information in CSV format
sample_mean_base_coverage_by_target_plots Per-sample PNG plot of mean base coverage by target
sample_coverage_plots Per-sample PNG plot of coverage per target
Batch
batch_covered_fraction_summary_csvs Batch-level covered fraction summary for targets and exons in CSV format
batch_coverage_summary_csvs Batch-level coverage summary for targets and exons in CSV format
batch_dropped_issue_elements_csvs Batch-level dropped elements for targets and exons in CSV format
batch_lowcov_issue_elements_csvs Batch-level low-coverage elements for targets and exons in CSV format
batch_gc_content_csvs Batch-level high GC content sites for targets and exons in CSV format
batch_duplicate_lengths_csv Batch-level PCR/optical read duplicate lengths in CSV format
batch_read_data_csv Batch-level read metrics
batch_mean_base_coverage_plot_png Batch-level PNG plot of mean base coverage
batch_multi_coverage_by_target_png Batch-level PNG plot of coverage
batch_read_categories_png Batch-level PNG plot of read categories
batch_read_length_by_sample_csv Batch-level CSV denoting read lengths by sample
Hybrid selection (HS) metrics

HS metrics outputs will be generated if the input probes_bed is defined.

sample_hs_metrics Picard hybrid-selection (HS) metrics
batch_consolidated_hs_metrics_tsv Consolidated HS metrics TSV
batch_consolidated_hs_metrics_quickview_tsv Consolidated HS metrics quickview TSV
PharmCAT

PharmCAT outputs will be produced if the input run_pharmcat is set to true.

pangu_jsons Pangu report JSON
pangu_tsvs Pangu TSV output; used by PharmCAT
fixed_pangu_tsvs Pangu TSV with missing calls fixed
pharmcat_missing_pgx_vcfs Phased VCF with missing calls converted to ref calls
pharmcat_preprocessed_filtered_vcfs Phased VCF with low-coverage ref calls removed
pharmcat_match_jsons PharmCAT match results in JSON format
pharmcat_phenotype_jsons PharmCAT phenotype results in JSON format
pharmcat_report_htmls PharmCAT report in HTML format
pharmcat_report_jsons PharmCAT report in JSON format

References

Reference datasets are hosted publicly for use in the pipeline.

Containers

Docker images definitions used by this workflow can be found in the wdl-dockerfiles repository. Images are hosted in PacBio’s quay.io. Docker images used in the workflow are pegged to specific versions by referring to their digests rather than tags.

The Docker image used by a particular step of the workflow can be identified by looking at the docker key in the runtime block for the given task.

Top