Identify high-quality metagenome-assembled genomes (MAGs) from PacBio HiFi data.
Workflow for identifying high-quality MAGs (Metagenome-Assembled Genomes) from PacBio HiFi data written in Workflow Description Language (WDL)
Workflow Inputs
The workflow can run with either FASTQ- or BAM-format HiFi reads as input. If BAM reads are supplied, they will first be converted to FASTQ before being run through the remainder of the metagenome-assembled genomes (MAG) pipeline.
Input | Description |
---|---|
sample_id | Sample ID; used for naming files. |
hifi_reads_bam | HiFi reads in BAM format. If supplied, the reads will first be converted to a FASTQ. One of [hifi_reads_bam, hifi_reads_fastq] is required. |
hifi_reads_fastq | HiFi reads in FASTQ format. One of [hifi_reads_bam, hifi_reads_fastq] is required. |
checkm2_ref_db | The CheckM2 DIAMOND reference database Uniref100/KO used to predict the completeness and contamination of MAGs |
min_contig_length | Minimum size of a contig to consider a long contig. [500000] |
min_contig_completeness | Minimum completeness percentage (from CheckM2) to mark a contig as complete and place it in a distinct bin; this value should not be lower than 90%. [93] |
metabat2_min_contig_size | The minimum size of contig to be included in binning for MetaBAT2. [30000] |
semibin2_model | The trained model to be used in SemiBin2. If set to ‘TRAIN’, a new model will be trained from your data. (‘TRAIN’, ‘human_gut’, ‘human_oral’, ‘dog_gut’, ‘cat_gut’, ‘mouse_gut’, ‘pig_gut’, ‘chicken_caecum’, ‘ocean’, ‘soil’, ‘built_environment’, ‘wastewater’, ‘global’) [‘global’] |
dastool_search_engine | The engine for single copy gene searching used in DAS Tool. (‘blast’, ‘diamond’, ‘usearch’) [‘diamond’] |
dastool_score_threshold | Score threshold until selection algorithm will keep selecting bins (0..1); used by DAS Tool. [0.2] |
min_mag_completeness | Minimum completeness percent for a genome bin. [70] |
max_mag_contamination | Maximum contamination threshold for a genome bin. [10] |
max_contigs | The maximum number of contigs allowed in a genome bin. [20] |
gtdbtk_data_tar_gz | A .tar.gz file of GTDB-Tk (Genome Database Taxonomy toolkit) reference data, release207_v2 used for assigning taxonomic classifications to bacterial and archaeal genomes. |
backend | Backend where the workflow will be executed [“Azure”, “AWS”, “GCP”, “HPC”] |
zones | Zones where compute will take place; required if backend is set to ‘AWS’ or ‘GCP’. |
aws_spot_queue_arn | Queue ARN for the spot batch queue; required if backend is set to ‘AWS’ and |
aws_on_demand_queue_arn | Queue ARN for the on demand batch queue; required if backend is set to ‘AWS’ and |
container_registry | Container registry where workflow images are hosted. If left blank, PacBio’s public Quay.io registry will be used. |
preemptible | If set to |
Workflow Outputs
The set of outputs generated by the Metagenomics workflow depend on whether any long or incomplete contigs pass quality filters.
Output | Description | ||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Metagenome assembly |
| ||||||||||||||||||||||||||||||||||||||||||
Contig binning |
Long contig binning
Incomplete contig binning
| ||||||||||||||||||||||||||||||||||||||||||
Taxonomy assignment | Taxonomy assignment outputs will be generated if there is at least one long or incomplete bin passing filters.
|
References
Reference datasets are hosted publicly for use in the pipeline.
Containers
Docker images definitions used by this workflow can be found in the wdl-dockerfiles repository. Images are hosted in PacBio’s quay.io. Docker images used in the workflow are pegged to specific versions by referring to their digests rather than tags.
The Docker image used by a particular step of the workflow can be identified by looking at the docker
key in the runtime
block for the given task.