DRAGEN - Whole Genome Germline Single Sample Analysis

DRAGEN functional equivalence germline SNP and indel discovery in human whole genome sequencing data.

The Whole Genome Germline Single Sample pipeline implements data pre-processing and initial variant calling according to the GATK Best Practices for germline SNP and indel discovery in human whole genome sequencing data. When the pipeline runs in the DRAGEN-GATK mode, it produces functionally equivalent outputs to the DRAGEN pipeline.

This workflow is maintained by the Broad Institute and is written in Workflow Description Language (WDL). Further documentation can be found here.

DRAGEN Whole Genome Single Sample workflow diagram

Workflow Inputs

The workflow requires sample and reference information. The user may decide whether or not to run the pipeline in DRAGEN functional equivalence mode by setting the value of the dragen_functional_equivalence_mode input.

The Broad Institute provides various test inputs hosted in GCP that can be used to run the pipeline.

Input Description

sample_and_unmapped_bams

Information and files associated with the sample.

base_file_name	String used for output files; can be set to a read group ID.
final_gvcf_base_name	Base name for the output GVCF file; can be set to a read group ID.
flowcell_unmapped_bams	Human whole-genome paired-end sequencing data in unmapped BAM (uBAM) format; each uBAM file contains one or more read groups all belonging to a single sample (SM).
sample_name	A string to describe the sample; can be set to a read group ID.
unmapped_bam_suffix	The suffix for the input uBAM file; must be consistent across files; (ex: “.unmapped.bam”).

references

Data associated with the reference genome.

contamination_sites_ud	Contamination site files for the CheckContamination task.
contamination_sites_bed	Contamination site files for the CheckContamination task.
contamination_sites_mu	Contamination site files for the CheckContamination task.
calling_interval_list	Interval list used for variant calling.
reference_fasta	Ref fasta, index, dict, and associated bwa index files. See the struct definition for the full list of associated reference files.
known_indels_sites_vcfs	Set of known indel site VCFs
known_indels_sites_indices	Set of known indel site VCF indices
dbsnp_vcf	dbSNP VCF file
dbsnp_vcf_index	dbSNP VCF file index
evaluation_interval_list	File containing the target set of genomic intervals
haplotype_database_file	File containing known haplotype major and minor alleles and frequencies

dragmap_reference

Files used by the DRAGMAP aligner.

reference_bin	Binary representation of the reference FASTA file used for the DRAGEN mode DRAGMAP aligner.
hash_table_cfg_bin	Binary representation of the configuration for the hash table used for the DRAGEN mode DRAGMAP aligner.
hash_table_cmp	Compressed representation of the hash table that is used for the DRAGEN mode DRAGMAP aligner.

scatter_settings

Information for variant calling scatter settings.

haplotype_scatter_count	Scatter count used for variant calling.
break_bands_at_multiples_of	Breaks reference bands up at genomic positions that are multiples of this number; used to reduce GVCF file size.

papi_settings

Information regarding the number of preemptions allowed.

preemptible_tries	Number of times the workflow can be preempted.
agg_preemptible_tries	Number of preemtible machine tries for the BamtoCram task.

wgs_coverage_interval_list

Interval list for the CollectWgsMetrics tool.

Workflow Outputs

The pipeline outputs variant calls, aligned reads, and various metrics files.

Output Description

UnmappedBamToAlignedBam

Quality control metrics and files output during alignment.

quality_yield_metrics	The quality metrics calculated for the unmapped BAM files.
unsorted_read_group_base_distribution_by_cycle_pdf	PDF of the base distribution for each unsorted, readgroup-specific BAM.
unsorted_read_group_base_distribution_by_cycle_metrics	Metrics of the base distribution by cycle for each unsorted, readgroup-specific BAM.
unsorted_read_group_insert_size_histogram_pdf	Histograms of insert size for the unsorted, readgroup-specific BAMs.
unsorted_read_group_insert_size_metrics	Insert size metrics for the unsorted, readgroup-specific BAMs.
unsorted_read_group_quality_by_cycle_pdf	Quality by cycle PDF for the unsorted, readgroup-specific BAMs.
unsorted_read_group_quality_by_cycle_metrics	Quality by cycle metrics for the unsorted, readgroup-specific BAMs.
unsorted_read_group_quality_distribution_pdf	Quality distribution PDF for the unsorted, readgroup-specific BAMs.
unsorted_read_group_quality_distribution_metrics	Quality distribution metrics for the unsorted, readgroup-specific BAMs.
cross_check_fingerprints_metrics	Fingerprint metrics file if optional fingerprinting is performed.
selfSM	Contamination estimate from VerifyBamID2.
contamination	Estimated contamination from the CheckContamination task.
duplicate_metrics	Duplicate read metrics from the MarkDuplicates tool.
output_bqsr_reports	BQSR reports if BQSR tool is run.
output_bam	Output aligned recalibrated BAM if the `provided_output_bam` is true.
output_bam_index	Optional index for the aligned recalibrated BAM if the `provided_output_bam` is true.

AggregatedBamQC

Outputs from aggregating the aligned recalibrated BAM and calculating quality control metrics.

read_group_alignment_summary_metrics	Alignment summary metrics for the aggregated BAM.
read_group_gc_bias_detail_metrics	GC bias detail metrics for the aggregated BAM.
read_group_gc_bias_pdf	PDF of the GC bias by readgroup for the aggregated BAM.
read_group_gc_bias_summary_metrics	GC bias summary metrics by readgroup for the aggregated BAM.
calculate_read_group_checksum_md5	MD5 checksum for aggregated BAM.
agg_alignment_summary_metrics	Alignment summary metrics for the aggregated BAM.
agg_bait_bias_detail_metrics	Bait bias detail metrics for the aggregated BAM.
agg_bait_bias_summary_metrics	Bait bias summary metrics for the aggregated BAM.
agg_gc_bias_detail_metrics	GC bias detail metrics for the aggregated BAM.
agg_gc_bias_pdf	PDF of GC bias for the aggregated BAM.
agg_gc_bias_summary_metrics	GC bias summary metrics for the aggregated BAM.
agg_insert_size_histogram_pdf	Histogram of insert size for aggregated BAM.
agg_insert_size_metrics	Insert size metrics for the aggregated BAM.
agg_pre_adapter_detail_metrics	Details metrics for artifacts that occur prior to the addition of adaptors for the aggregated BAM.
agg_pre_adapter_summary_metrics	Summary metrics for artifacts that occur prior to the addition of adaptors for the aggregated BAM.
agg_quality_distribution_pdf	PDF of the quality distribution for the aggregated BAM.
agg_quality_distribution_metrics	Quality distribution metrics for the aggregated BAM.
agg_error_summary_metrics	Error summary metrics for the aggregated BAM.
fingerprint_summary_metrics	Optional fingerprint summary metrics for the aggregated BAM.
fingerprint_detail_metrics	Optional fingerprint detail metrics for the aggregated BAM.

CollectWgsMetrics

WGS metrics collected using stringent thresholds.

wgs_metrics

Metrics from the CollectWgsMetrics tool.

CollectRawWgsMetrics

WGS metrics collected using less stringent thresholds.

raw_wgs_metrics

Metrics from the CollectRawWgsMetrics tool.

BamToGvcf

HaplotypeCaller variant calling outputs.

gvcf_summary_metrics	(g)VCF summary metrics.
gvcf_detail_metrics	(g)VCF detail metrics.
output_vcf	Final reblocked gVCF with variant calls produced by HaplotypeCaller (read more in the Reblocking section).
output_vcf_index	Index for the final gVCF.

BamToCram

Files associated with converting the aggregated recalibrated BAM to CRAM.

output_cram	Aligned, recalibrated output CRAM.
output_cram_index	Index for the aligned recalibrated CRAM.
output_cram_md5	MD5 checksum for the aligned recalibrated BAM.
validate_cram_file_report	Validated report for the CRAM created with the ValidateSam tool.

References

Reference data hosted in GCP may be found here.

Containers

Containers used by the pipeline are hosted in the Broad Institute’s public container registry, and the public biocontainers registry in quay.io.

Menu

DRAGEN – Whole Genome Germline Single Sample Analysis

DRAGEN functional equivalence germline SNP and indel discovery in human whole genome sequencing data.

Workflow Inputs

Workflow Outputs

References

Containers