DRAGEN – Whole Genome Germline Single Sample Analysis

DRAGEN functional equivalence germline SNP and indel discovery in human whole genome sequencing data.

DRAGEN functional equivalence germline SNP and indel discovery in human whole genome sequencing data.

The Whole Genome Germline Single Sample pipeline implements data pre-processing and initial variant calling according to the GATK Best Practices for germline SNP and indel discovery in human whole genome sequencing data. When the pipeline runs in the DRAGEN-GATK mode, it produces functionally equivalent outputs to the DRAGEN pipeline.

This workflow is maintained by the Broad Institute and is written in Workflow Description Language (WDL). Further documentation can be found here.

DRAGEN Whole Genome Single Sample workflow diagram
DRAGEN Whole Genome Single Sample workflow diagram

Workflow Inputs

The workflow requires sample and reference information. The user may decide whether or not to run the pipeline in DRAGEN functional equivalence mode by setting the value of the dragen_functional_equivalence_mode input.

The Broad Institute provides various test inputs hosted in GCP that can be used to run the pipeline.

Input Description
sample_and_unmapped_bams

Information and files associated with the sample.

base_file_name String used for output files; can be set to a read group ID.
final_gvcf_base_name Base name for the output GVCF file; can be set to a read group ID.
flowcell_unmapped_bams Human whole-genome paired-end sequencing data in unmapped BAM (uBAM) format; each uBAM file contains one or more read groups all belonging to a single sample (SM).
sample_name A string to describe the sample; can be set to a read group ID.
unmapped_bam_suffix The suffix for the input uBAM file; must be consistent across files; (ex: “.unmapped.bam”).

 

references

Data associated with the reference genome.

contamination_sites_ud Contamination site files for the CheckContamination task.
contamination_sites_bed Contamination site files for the CheckContamination task.
contamination_sites_mu Contamination site files for the CheckContamination task.
calling_interval_list Interval list used for variant calling.
reference_fasta Ref fasta, index, dict, and associated bwa index files. See the struct definition for the full list of associated reference files.
known_indels_sites_vcfs Set of known indel site VCFs
known_indels_sites_indices Set of known indel site VCF indices
dbsnp_vcf dbSNP VCF file
dbsnp_vcf_index dbSNP VCF file index
evaluation_interval_list File containing the target set of genomic intervals
haplotype_database_file File containing known haplotype major and minor alleles and frequencies
dragmap_reference

Files used by the DRAGMAP aligner.

reference_bin Binary representation of the reference FASTA file used for the DRAGEN mode DRAGMAP aligner.
hash_table_cfg_bin Binary representation of the configuration for the hash table used for the DRAGEN mode DRAGMAP aligner.
hash_table_cmp Compressed representation of the hash table that is used for the DRAGEN mode DRAGMAP aligner.
scatter_settings

Information for variant calling scatter settings.

haplotype_scatter_count Scatter count used for variant calling.
break_bands_at_multiples_of Breaks reference bands up at genomic positions that are multiples of this number; used to reduce GVCF file size.
papi_settings

Information regarding the number of preemptions allowed.

preemptible_tries Number of times the workflow can be preempted.
agg_preemptible_tries Number of preemtible machine tries for the BamtoCram task.
wgs_coverage_interval_list

Interval list for the CollectWgsMetrics tool.

Workflow Outputs

The pipeline outputs variant calls, aligned reads, and various metrics files.

Output Description
UnmappedBamToAlignedBam

Quality control metrics and files output during alignment.

quality_yield_metrics The quality metrics calculated for the unmapped BAM files.
unsorted_read_group_base_distribution_by_cycle_pdf PDF of the base distribution for each unsorted, readgroup-specific BAM.
unsorted_read_group_base_distribution_by_cycle_metrics Metrics of the base distribution by cycle for each unsorted, readgroup-specific BAM.
unsorted_read_group_insert_size_histogram_pdf Histograms of insert size for the unsorted, readgroup-specific BAMs.
unsorted_read_group_insert_size_metrics Insert size metrics for the unsorted, readgroup-specific BAMs.
unsorted_read_group_quality_by_cycle_pdf Quality by cycle PDF for the unsorted, readgroup-specific BAMs.
unsorted_read_group_quality_by_cycle_metrics Quality by cycle metrics for the unsorted, readgroup-specific BAMs.
unsorted_read_group_quality_distribution_pdf Quality distribution PDF for the unsorted, readgroup-specific BAMs.
unsorted_read_group_quality_distribution_metrics Quality distribution metrics for the unsorted, readgroup-specific BAMs.
cross_check_fingerprints_metrics Fingerprint metrics file if optional fingerprinting is performed.
selfSM Contamination estimate from VerifyBamID2.
contamination Estimated contamination from the CheckContamination task.
duplicate_metrics Duplicate read metrics from the MarkDuplicates tool.
output_bqsr_reports BQSR reports if BQSR tool is run.
output_bam Output aligned recalibrated BAM if the provided_output_bam is true.
output_bam_index Optional index for the aligned recalibrated BAM if the provided_output_bam is true.
AggregatedBamQC

Outputs from aggregating the aligned recalibrated BAM and calculating quality control metrics.

read_group_alignment_summary_metrics Alignment summary metrics for the aggregated BAM.
read_group_gc_bias_detail_metrics GC bias detail metrics for the aggregated BAM.
read_group_gc_bias_pdf PDF of the GC bias by readgroup for the aggregated BAM.
read_group_gc_bias_summary_metrics GC bias summary metrics by readgroup for the aggregated BAM.
calculate_read_group_checksum_md5 MD5 checksum for aggregated BAM.
agg_alignment_summary_metrics Alignment summary metrics for the aggregated BAM.
agg_bait_bias_detail_metrics Bait bias detail metrics for the aggregated BAM.
agg_bait_bias_summary_metrics Bait bias summary metrics for the aggregated BAM.
agg_gc_bias_detail_metrics GC bias detail metrics for the aggregated BAM.
agg_gc_bias_pdf PDF of GC bias for the aggregated BAM.
agg_gc_bias_summary_metrics GC bias summary metrics for the aggregated BAM.
agg_insert_size_histogram_pdf Histogram of insert size for aggregated BAM.
agg_insert_size_metrics Insert size metrics for the aggregated BAM.
agg_pre_adapter_detail_metrics Details metrics for artifacts that occur prior to the addition of adaptors for the aggregated BAM.
agg_pre_adapter_summary_metrics Summary metrics for artifacts that occur prior to the addition of adaptors for the aggregated BAM.
agg_quality_distribution_pdf PDF of the quality distribution for the aggregated BAM.
agg_quality_distribution_metrics Quality distribution metrics for the aggregated BAM.
agg_error_summary_metrics Error summary metrics for the aggregated BAM.
fingerprint_summary_metrics Optional fingerprint summary metrics for the aggregated BAM.
fingerprint_detail_metrics Optional fingerprint detail metrics for the aggregated BAM.
CollectWgsMetrics

WGS metrics collected using stringent thresholds.

wgs_metrics Metrics from the CollectWgsMetrics tool.
CollectRawWgsMetrics

WGS metrics collected using less stringent thresholds.

raw_wgs_metrics Metrics from the CollectRawWgsMetrics tool.
BamToGvcf

HaplotypeCaller variant calling outputs.

gvcf_summary_metrics (g)VCF summary metrics.
gvcf_detail_metrics (g)VCF detail metrics.
output_vcf Final reblocked gVCF with variant calls produced by HaplotypeCaller (read more in the Reblocking section).
output_vcf_index Index for the final gVCF.
BamToCram

Files associated with converting the aggregated recalibrated BAM to CRAM.

output_cram Aligned, recalibrated output CRAM.
output_cram_index Index for the aligned recalibrated CRAM.
output_cram_md5 MD5 checksum for the aligned recalibrated BAM.
validate_cram_file_report Validated report for the CRAM created with the ValidateSam tool.

References

Reference data hosted in GCP may be found here.

Containers

Containers used by the pipeline are hosted in the Broad Institute’s public container registry, and the public biocontainers registry in quay.io.

Top